Dremio Blog

20 minute read · June 9, 2025

Extending Apache Iceberg: Best Practices for Storing and Discovering Custom Metadata

Alex Merced Head of DevRel, Dremio

Start For Free

Copied to clipboard

Extending Apache Iceberg: Best Practices for Storing and Discovering Custom Metadata

Introduction

Method 1: Using Table Properties for Lightweight Metadata

Method 2: Leveraging Puffin Files for Structured Metadata

Method 3: Using the REST Catalog API for Metadata Discovery

Choosing the Right Metadata Mechanism

Real-World Scenarios

Conclusion

Download a Free Copy of Apache Iceberg: The Definitive Guide

Download a Free Copy of Apache Polaris: The Definitive Guide

Introduction

Apache Iceberg has quickly become the de facto standard for building modern, high-performance data lakehouses. Designed to bring SQL warehouse-like features, like ACID transactions, time travel, and schema evolution, to cloud object storage, Iceberg powers analytic workloads at massive scale.

But beyond its built-in capabilities, Iceberg also offers a powerful and underappreciated feature: extensibility. Whether you're implementing custom indexing, embedding semantic metadata, or capturing data quality metrics, Iceberg provides multiple extensibility points where metadata can be written by data producers and read by downstream consumers.

This blog explores how you can take advantage of Iceberg's extensibility mechanisms to store and retrieve custom metadata. We'll cover where to put custom data, how to read it safely and efficiently, and which mechanisms best fit different use cases.

Note: This extension of Apache Iceberg is typically intended for platforms that build features on Apache Iceberg. For individual Iceberg Lakehouses, it is advisable to avoid extending Iceberg unless you are willing to build custom workflows to utilize that custom metadata. For example of platform extension, watch this discussion of using Iceberg metadata for data quality from dbt's Amy Chen.

Why Extend Iceberg Metadata?

Out-of-the-box, Apache Iceberg provides rich metadata that supports partition pruning, schema evolution, and query planning. But in real-world data platforms, teams often need to attach additional context to their tables to support custom workflows, observability, or governance.

Here are a few examples of custom metadata scenarios:

Custom Indexing: Suppose you build a domain-specific index for faster lookups (e.g., geospatial indexes or ML embeddings). You'll need a way to register and share that index with query engines.
Data Quality Information: You may want to capture the outcome of a validation process, such as null counts, schema conformance status, or failed row percentages, for observability and alerting purposes.
Semantic Metadata: You could annotate tables with business classifications, ownership info, PII tags, or lineage annotations to support governance and auditing.
Processing Hints: Performance-tuning metadata, such as preferred sort columns, Z-order configurations, or execution statistics, can inform more innovative query planning.
Pipeline Provenance: Capture what job or system wrote the data, and when, for traceability and reproducibility.

These aren't hypothetical needs; they reflect the types of metadata that engineers frequently invent ad hoc solutions for. Iceberg’s extensibility features offer a structured and portable way to support them natively.

Method 1: Using Table Properties for Lightweight Metadata

The simplest and most accessible way to add custom metadata to an Apache Iceberg table is via the properties map in the table metadata file. This map supports arbitrary key-value pairs and is designed to be both human-readable and machine-accessible. This property is intended to store properties that effect the reading and writing of the table.

How It Works

The properties field lives in the table's top-level metadata JSON file. It's intended for configuration and annotations that are not strictly required for core Iceberg operations, but are useful for engines or external tools. You can set properties when creating a table or modify them later with ALTER TABLE statements or programmatic APIs.

Example Use Cases

Custom Indexes
ALTER TABLE customers SET TBLPROPERTIES ( 'index.geohash.path' = 's3://.../geohash.idx', 'index.geohash.columns' = 'latitude,longitude' );
Data Quality Annotations
ALTER TABLE orders SET TBLPROPERTIES ( 'dq.null_check.status' = 'passed', 'dq.schema_check.version' = 'v1.2', 'dq.metrics.null_rate' = '0.003' );
Semantic Information
ALTER TABLE transactions SET TBLPROPERTIES ( 'owner' = 'finance_team', 'pii' = 'true', 'classification' = 'confidential' );
Processing Metadata
ALTER TABLE products SET TBLPROPERTIES ( 'ingest.source' = 'airflow_dag_47', 'last_updated' = '2025-06-01T03:21:00Z' );

These key-value annotations can be easily retrieved via the Iceberg API, SQL interfaces, or REST catalog endpoints, making them useful for dashboards, orchestration systems, or governance platforms. Keep in mind, arbitrary properties will not be recognized by query engines automatically but you can build custom workflows that retrieve and act on these properties.

Best Practices

Prefix keys for custom domains (e.g., dq. or index.) to avoid future collisions with reserved keys.
Avoid large values—this mechanism is not suitable for storing large blobs or complex data structures.
Use string values where possible. While some engines support typed values, others treat all values as strings.

Limitations

Properties are intended for simple metadata. If you need to store structured or binary data (like large indexes or metrics vectors), the properties map may not be appropriate.
There's no schema enforcement or typing—this is pure convention. Consistent naming and documentation are key to usability.
"Properties" was intended for properties that engines like Apache Spark will use for tuning read and write behavior. While you can store arbitrary metadata here, it may be better if it exists in other places. Since custom properties would require custom workflows to take advantage of anyway, you can also store them in custom files. Suppose you want the property to be used across engines. In that case, it's best to work with the community to have it added to the specification and to the list of recognized table properties in the documentation.

Method 2: Leveraging Puffin Files for Structured Metadata

While table properties offer a lightweight solution for annotations, more sophisticated metadata use cases, such as extensive indexes, detailed statistics, or semi-structured payloads, require a richer and more scalable mechanism. That’s where Puffin files come in.

Puffin is Iceberg’s extensible metadata file format designed to store structured binary metadata alongside a table, without interfering with core table operations or schema evolution.

What Is a Puffin File?

A Puffin file is a sidecar metadata artifact registered with a specific Iceberg table. It allows for:

Structured metadata (e.g., Avro or JSON records).
Versioned attachment to a specific table snapshot or commit.
Efficient retrieval through Iceberg’s metadata discovery layers.

These files are handy when you need to store non-trivial, typed, or binary metadata that is too large or complex for the properties map.

Example Use Cases

Custom Indexing
- Store a serialized bloom filter, spatial index, or approximate distinct count sketch.
- Reference the index in a properties key, while storing the actual data in a Puffin file.
Data Quality Metrics
- Store per-column null counts, histograms, outlier flags, and anomaly scores.
- Example: a pipeline writes column-wise metrics after validation and registers them as a Puffin attachment.
Semantic Layer Metadata
- Capture user-defined tags, column descriptions, or lineage graphs in structured form.
- Useful for data catalogs or governance tools.
ML Metadata
- Store training statistics, model feature vectors, or hyperparameter configs tied to datasets.
- Reproducibility and audit logs for ML pipelines.

Working with Puffin Files

While Puffin files are still an advanced feature (often used through the Java API), the general process is:

Write structured metadata using a compatible serialization format (e.g., Avro).
Save the Puffin file to a location in table storage.
Register the Puffin file with the table’s metadata using the Iceberg API.

Currently, reading and writing Puffin files requires integration with Iceberg internals, but work is underway to make them easier to use through standardized APIs.

Best Practices

Use typed formats like Avro or Parquet for consistent schema enforcement.
Store only derived metadata, not source data.
Tie Puffin files to specific snapshots if they depend on data state (e.g., indexes).

Limitations

Not all engines expose Puffin functionality directly (e.g., through SQL), so usage often requires platform-level integration or custom readers.
Puffin files should be treated as auxiliary, clients must know how to interpret them. So, if you develop an index that you want to be used across engines, contributing a spec for the Puffin blog, along with libraries and tools for reading the blogs, to the Iceberg repo will help make it more likely to be broadly usable.

Method 3: Using the REST Catalog API for Metadata Discovery

In addition to table-level metadata and Puffin files, Apache Iceberg also defines a REST catalog specification—a standardized interface that query engines use to interact with table catalogs over HTTP. This API isn’t just for basic operations like table creation and listing—it can also be a powerful vector for metadata extensibility and remote discovery.

Standard REST Catalog Endpoints

The REST catalog spec defines a set of core endpoints for interacting with catalogs, namespaces, and tables. These include:

GET /v1/namespaces/{namespace}: Retrieve metadata about a namespace, including custom properties.
GET /v1/namespaces/{namespace}/tables/{table}: Get full table metadata including properties.
GET /v1/tables/{identifier}/metadata: Retrieve raw metadata file location (often used by engines to bootstrap table state).

These endpoints surface the same custom properties discussed earlier. Still, now they are accessible via REST, meaning they can be queried and interpreted by services outside the processing engine (e.g., catalogs, data governance layers, UI tools).

Custom Extensions Beyond the Spec

While the standardized REST endpoints offer a uniform and engine-compatible way to discover metadata, many catalog implementations extend beyond these endpoints to offer custom APIs that expose richer metadata and functionality. For example:

Nessie’s Branching and Tagging API
- Nessie allows you to treat a catalog as a version-controlled repository, with branches and tags for tables.
- APIs like GET /trees/{ref} or POST /commits provide advanced versioning and branching not covered by the Iceberg REST spec.
- Engines need custom integrations to use these features fully—Apache Flink, Spark, and Dremio have developed Nessie connectors for this purpose.
Polaris Catalog Management API Endpoints
- Polaris exposes additional metadata about service principals, access controls, and catalog hierarchies beyond the standard Iceberg catalog scope.
- These APIs support organizational metadata like access policies, ownership, or lineage not defined in the base REST spec.

Trade-offs and Considerations

Pros of Custom REST APIs	Cons / Trade-offs
Enables advanced features like branching	Requires engine-specific plugins or extensions
Allows centralized metadata management	Can break interoperability with vanilla engines
Better control over security and auditing	Risk of vendor lock-in or fragmented tooling

Best Practices

Follow the standard REST catalog spec for widely consumed metadata like properties.
Use custom endpoints for platform-specific features, but document and version them carefully.
Collaborate with engine communities if proposing broader support (e.g., Iceberg community adoption of new features, indexing, etc.).

Choosing the Right Metadata Mechanism

With three extensibility points, table properties, Puffin files, and the REST catalog API, Iceberg gives you a versatile toolkit for embedding and discovering custom metadata. Here’s how they compare:

Use Case	Recommended Mechanism	Why
Simple annotations or flags	Table `properties`	Easy to use, accessible via SQL and REST
Structured or large metadata	Puffin files	Supports Avro, complex schemas, and large payloads
Organization-wide access and control	REST catalog (standard)	Enables remote discovery and orchestration
Platform-specific metadata or APIs	REST catalog (custom endpoints)	Full flexibility, but needs custom engine integration

Decision Guide

If the metadata is simple, text-based, and tied to a table, use properties.
If the metadata is typed, structured, or snapshot-aware, use Puffin files.
If the metadata must be queryable by services or UIs across your organization, expose it via the REST catalog.
For metadata related to catalog-wide behavior (e.g., branching, access control), use or build custom REST APIs into the catalog.

Real-World Scenarios

Let’s revisit some of the earlier use cases and how these mechanisms could theoretically apply:

Custom Indexing
- Use properties to register index paths or types.
- Store the index itself as a Puffin file or external artifact.
Data Quality Checks
- Write status flags (e.g., pass/fail) in properties.
- Store column-level metrics in Puffin files.
- Use REST APIs to expose results for dashboards.
Semantic Metadata
- Tag fields or tables in properties.
- Use REST APIs to let catalogs, governance tools, or UIs consume metadata globally.
ML Metadata
- Annotate tables with model versioning info in properties.
- Store model-specific statistics or feature maps in Puffin files.

Conclusion

Apache Iceberg’s modular architecture doesn’t just support scalable data operations—it also provides the scaffolding to embed meaningful metadata where it belongs: right alongside your data.

By using properties, Puffin files, and REST catalog APIs wisely, you can build richer, more introspective data systems. Whether you're developing an internal data quality pipeline or a multi-tenant ML feature store, Iceberg offers clean integration points that let metadata travel with the data.

Want to go further? Consider contributing your custom metadata tooling back to the Iceberg community—or proposing new specs that can benefit the broader ecosystem.

Try Dremio Cloud free for 30 days

Deploy agentic analytics directly on Apache Iceberg data with no pipelines and no added overhead.

Start For Free

Article Topics

Dremio Blog: Open Data Insights

Sep 22, 2023 Dremio Blog: Open Data Insights

Intro to Dremio, Nessie, and Apache Iceberg on Your Laptop

We're always looking for ways to better handle and save money on our data. That's why the "data lakehouse" is becoming so popular. It offers a mix of the flexibility of data lakes and the ease of use and performance of data warehouses. The goal? Make data handling easier and cheaper. So, how do we […]

Alex Merced

Aug 16, 2023 Dremio Blog: News Highlights

5 Use Cases for the Dremio Lakehouse

With its capabilities in on-prem to cloud migration, data warehouse offload, data virtualization, upgrading data lakes and lakehouses, and building customer-facing analytics applications, Dremio provides the tools and functionalities to streamline operations and unlock the full potential of data assets.

Alex Merced

Aug 31, 2023 Dremio Blog: News Highlights

Dremio Arctic is Now Your Data Lakehouse Catalog in Dremio Cloud

Dremio Arctic bring new features to Dremio Cloud, including Apache Iceberg table optimization and Data as Code.

Jeremiah Morrow

Extending Apache Iceberg: Best Practices for Storing and Discovering Custom Metadata

Table of Contents