Apache Iceberg has quickly become the de facto standard for building modern, high-performance data lakehouses. Designed to bring SQL warehouse-like features, like ACID transactions, time travel, and schema evolution, to cloud object storage, Iceberg powers analytic workloads at massive scale.
But beyond its built-in capabilities, Iceberg also offers a powerful and underappreciated feature: extensibility. Whether you're implementing custom indexing, embedding semantic metadata, or capturing data quality metrics, Iceberg provides multiple extensibility points where metadata can be written by data producers and read by downstream consumers.
This blog explores how you can take advantage of Iceberg's extensibility mechanisms to store and retrieve custom metadata. We'll cover where to put custom data, how to read it safely and efficiently, and which mechanisms best fit different use cases.
Note: This extension of Apache Iceberg is typically intended for platforms that build features on Apache Iceberg. For individual Iceberg Lakehouses, it is advisable to avoid extending Iceberg unless you are willing to build custom workflows to utilize that custom metadata. For example of platform extension, watch this discussion of using Iceberg metadata for data quality from dbt's Amy Chen.
Try Dremio’s Interactive Demo
Explore this interactive demo and see how Dremio's Intelligent Lakehouse enables Agentic AI
Why Extend Iceberg Metadata?
Out-of-the-box, Apache Iceberg provides rich metadata that supports partition pruning, schema evolution, and query planning. But in real-world data platforms, teams often need to attach additional context to their tables to support custom workflows, observability, or governance.
Here are a few examples of custom metadata scenarios:
Custom Indexing: Suppose you build a domain-specific index for faster lookups (e.g., geospatial indexes or ML embeddings). You'll need a way to register and share that index with query engines.
Data Quality Information: You may want to capture the outcome of a validation process, such as null counts, schema conformance status, or failed row percentages, for observability and alerting purposes.
Semantic Metadata: You could annotate tables with business classifications, ownership info, PII tags, or lineage annotations to support governance and auditing.
Processing Hints: Performance-tuning metadata, such as preferred sort columns, Z-order configurations, or execution statistics, can inform more innovative query planning.
Pipeline Provenance: Capture what job or system wrote the data, and when, for traceability and reproducibility.
These aren't hypothetical needs; they reflect the types of metadata that engineers frequently invent ad hoc solutions for. Iceberg’s extensibility features offer a structured and portable way to support them natively.
Method 1: Using Table Properties for Lightweight Metadata
The simplest and most accessible way to add custom metadata to an Apache Iceberg table is via the properties map in the table metadata file. This map supports arbitrary key-value pairs and is designed to be both human-readable and machine-accessible. This property is intended to store properties that effect the reading and writing of the table.
How It Works
The properties field lives in the table's top-level metadata JSON file. It's intended for configuration and annotations that are not strictly required for core Iceberg operations, but are useful for engines or external tools. You can set properties when creating a table or modify them later with ALTER TABLE statements or programmatic APIs.
Example Use Cases
Custom Indexes ALTER TABLE customers SET TBLPROPERTIES ( 'index.geohash.path' = 's3://.../geohash.idx', 'index.geohash.columns' = 'latitude,longitude' );
Data Quality Annotations ALTER TABLE orders SET TBLPROPERTIES ( 'dq.null_check.status' = 'passed', 'dq.schema_check.version' = 'v1.2', 'dq.metrics.null_rate' = '0.003' );
Semantic Information ALTER TABLE transactions SET TBLPROPERTIES ( 'owner' = 'finance_team', 'pii' = 'true', 'classification' = 'confidential' );
Processing Metadata
ALTER TABLE products SET TBLPROPERTIES ( 'ingest.source' = 'airflow_dag_47', 'last_updated' = '2025-06-01T03:21:00Z' );
These key-value annotations can be easily retrieved via the Iceberg API, SQL interfaces, or REST catalog endpoints, making them useful for dashboards, orchestration systems, or governance platforms. Keep in mind, arbitrary properties will not be recognized by query engines automatically but you can build custom workflows that retrieve and act on these properties.
Best Practices
Prefix keys for custom domains (e.g., dq. or index.) to avoid future collisions with reserved keys.
Avoid large values—this mechanism is not suitable for storing large blobs or complex data structures.
Use string values where possible. While some engines support typed values, others treat all values as strings.
Limitations
Properties are intended for simple metadata. If you need to store structured or binary data (like large indexes or metrics vectors), the properties map may not be appropriate.
There's no schema enforcement or typing—this is pure convention. Consistent naming and documentation are key to usability.
"Properties" was intended for properties that engines like Apache Spark will use for tuning read and write behavior. While you can store arbitrary metadata here, it may be better if it exists in other places. Since custom properties would require custom workflows to take advantage of anyway, you can also store them in custom files. Suppose you want the property to be used across engines. In that case, it's best to work with the community to have it added to the specification and to the list of recognized table properties in the documentation.
Method 2: Leveraging Puffin Files for Structured Metadata
While table properties offer a lightweight solution for annotations, more sophisticated metadata use cases, such as extensive indexes, detailed statistics, or semi-structured payloads, require a richer and more scalable mechanism. That’s where Puffin files come in.
Puffin is Iceberg’s extensible metadata file format designed to store structured binary metadata alongside a table, without interfering with core table operations or schema evolution.
What Is a Puffin File?
A Puffin file is a sidecar metadata artifact registered with a specific Iceberg table. It allows for:
Structured metadata (e.g., Avro or JSON records).
Versioned attachment to a specific table snapshot or commit.
Efficient retrieval through Iceberg’s metadata discovery layers.
These files are handy when you need to store non-trivial, typed, or binary metadata that is too large or complex for the properties map.
Example Use Cases
Custom Indexing
Store a serialized bloom filter, spatial index, or approximate distinct count sketch.
Reference the index in a properties key, while storing the actual data in a Puffin file.
Data Quality Metrics
Store per-column null counts, histograms, outlier flags, and anomaly scores.
Example: a pipeline writes column-wise metrics after validation and registers them as a Puffin attachment.
Semantic Layer Metadata
Capture user-defined tags, column descriptions, or lineage graphs in structured form.
Useful for data catalogs or governance tools.
ML Metadata
Store training statistics, model feature vectors, or hyperparameter configs tied to datasets.
Reproducibility and audit logs for ML pipelines.
Working with Puffin Files
While Puffin files are still an advanced feature (often used through the Java API), the general process is:
Write structured metadata using a compatible serialization format (e.g., Avro).
Save the Puffin file to a location in table storage.
Register the Puffin file with the table’s metadata using the Iceberg API.
Currently, reading and writing Puffin files requires integration with Iceberg internals, but work is underway to make them easier to use through standardized APIs.
Best Practices
Use typed formats like Avro or Parquet for consistent schema enforcement.
Store only derived metadata, not source data.
Tie Puffin files to specific snapshots if they depend on data state (e.g., indexes).
Limitations
Not all engines expose Puffin functionality directly (e.g., through SQL), so usage often requires platform-level integration or custom readers.
Puffin files should be treated as auxiliary, clients must know how to interpret them. So, if you develop an index that you want to be used across engines, contributing a spec for the Puffin blog, along with libraries and tools for reading the blogs, to the Iceberg repo will help make it more likely to be broadly usable.
Method 3: Using the REST Catalog API for Metadata Discovery
In addition to table-level metadata and Puffin files, Apache Iceberg also defines a REST catalog specification—a standardized interface that query engines use to interact with table catalogs over HTTP. This API isn’t just for basic operations like table creation and listing—it can also be a powerful vector for metadata extensibility and remote discovery.
Standard REST Catalog Endpoints
The REST catalog spec defines a set of core endpoints for interacting with catalogs, namespaces, and tables. These include:
GET /v1/namespaces/{namespace}: Retrieve metadata about a namespace, including custom properties.
GET /v1/namespaces/{namespace}/tables/{table}: Get full table metadata including properties.
GET /v1/tables/{identifier}/metadata: Retrieve raw metadata file location (often used by engines to bootstrap table state).
These endpoints surface the same custom properties discussed earlier. Still, now they are accessible via REST, meaning they can be queried and interpreted by services outside the processing engine (e.g., catalogs, data governance layers, UI tools).
Custom Extensions Beyond the Spec
While the standardized REST endpoints offer a uniform and engine-compatible way to discover metadata, many catalog implementations extend beyond these endpoints to offer custom APIs that expose richer metadata and functionality. For example:
Nessie’s Branching and Tagging API
Nessie allows you to treat a catalog as a version-controlled repository, with branches and tags for tables.
APIs like GET /trees/{ref} or POST /commits provide advanced versioning and branching not covered by the Iceberg REST spec.
Engines need custom integrations to use these features fully—Apache Flink, Spark, and Dremio have developed Nessie connectors for this purpose.
Polaris Catalog Management API Endpoints
Polaris exposes additional metadata about service principals, access controls, and catalog hierarchies beyond the standard Iceberg catalog scope.
These APIs support organizational metadata like access policies, ownership, or lineage not defined in the base REST spec.
Trade-offs and Considerations
Pros of Custom REST APIs
Cons / Trade-offs
Enables advanced features like branching
Requires engine-specific plugins or extensions
Allows centralized metadata management
Can break interoperability with vanilla engines
Better control over security and auditing
Risk of vendor lock-in or fragmented tooling
Best Practices
Follow the standard REST catalog spec for widely consumed metadata like properties.
Use custom endpoints for platform-specific features, but document and version them carefully.
Collaborate with engine communities if proposing broader support (e.g., Iceberg community adoption of new features, indexing, etc.).
Choosing the Right Metadata Mechanism
With three extensibility points, table properties, Puffin files, and the REST catalog API, Iceberg gives you a versatile toolkit for embedding and discovering custom metadata. Here’s how they compare:
Use Case
Recommended Mechanism
Why
Simple annotations or flags
Table properties
Easy to use, accessible via SQL and REST
Structured or large metadata
Puffin files
Supports Avro, complex schemas, and large payloads
Organization-wide access and control
REST catalog (standard)
Enables remote discovery and orchestration
Platform-specific metadata or APIs
REST catalog (custom endpoints)
Full flexibility, but needs custom engine integration
Decision Guide
If the metadata is simple, text-based, and tied to a table, use properties.
If the metadata is typed, structured, or snapshot-aware, use Puffin files.
If the metadata must be queryable by services or UIs across your organization, expose it via the REST catalog.
For metadata related to catalog-wide behavior (e.g., branching, access control), use or build custom REST APIs into the catalog.
Real-World Scenarios
Let’s revisit some of the earlier use cases and how these mechanisms could theoretically apply:
Custom Indexing
Use properties to register index paths or types.
Store the index itself as a Puffin file or external artifact.
Data Quality Checks
Write status flags (e.g., pass/fail) in properties.
Store column-level metrics in Puffin files.
Use REST APIs to expose results for dashboards.
Semantic Metadata
Tag fields or tables in properties.
Use REST APIs to let catalogs, governance tools, or UIs consume metadata globally.
ML Metadata
Annotate tables with model versioning info in properties.
Store model-specific statistics or feature maps in Puffin files.
Conclusion
Apache Iceberg’s modular architecture doesn’t just support scalable data operations—it also provides the scaffolding to embed meaningful metadata where it belongs: right alongside your data.
By using properties, Puffin files, and REST catalog APIs wisely, you can build richer, more introspective data systems. Whether you're developing an internal data quality pipeline or a multi-tenant ML feature store, Iceberg offers clean integration points that let metadata travel with the data.
Want to go further? Consider contributing your custom metadata tooling back to the Iceberg community—or proposing new specs that can benefit the broader ecosystem.
Try Dremio Cloud free for 30 days
Deploy agentic analytics directly on Apache Iceberg data with no pipelines and no added overhead.
Intro to Dremio, Nessie, and Apache Iceberg on Your Laptop
We're always looking for ways to better handle and save money on our data. That's why the "data lakehouse" is becoming so popular. It offers a mix of the flexibility of data lakes and the ease of use and performance of data warehouses. The goal? Make data handling easier and cheaper. So, how do we […]
Aug 16, 2023·Dremio Blog: News Highlights
5 Use Cases for the Dremio Lakehouse
With its capabilities in on-prem to cloud migration, data warehouse offload, data virtualization, upgrading data lakes and lakehouses, and building customer-facing analytics applications, Dremio provides the tools and functionalities to streamline operations and unlock the full potential of data assets.
Aug 31, 2023·Dremio Blog: News Highlights
Dremio Arctic is Now Your Data Lakehouse Catalog in Dremio Cloud
Dremio Arctic bring new features to Dremio Cloud, including Apache Iceberg table optimization and Data as Code.