h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2h2

40 minute read · December 14, 2022

Apache Iceberg FAQ

Alex Merced

Alex Merced · Senior Tech Evangelist, Dremio

Below you’ll find answers to many of the questions people may have about Apache Iceberg. Also be sure to check out our Apache Iceberg 101 article for a video series introducing you to many of the Apache Iceberg concepts.

Table of contents

The Basics

What is a data lakehouse?

A data lakehouse is an architectural pattern where most of the workloads often associated with a data warehouse take place on the data lake. This reduces the duplication of data and complexity of data pipelines for better regulatory compliance, consistency of data, and self-service delivery of that data.

What is a data lakehouse table format?

A table format allows tools to look at files in your data lake storage and recognize groups of those files as a single table to enable you to query and transform the data performantly directly on the data lake, enabling data warehouse-like workloads (and more) on the data lake (also known as a data lakehouse).

Table formats are not new or specific to data lakes — they’ve been around since System R, Multics, and Oracle first implemented Edgar Codd’s relational model, although “table format” wasn’t the term used at the time. A data lakehouse table format is different from those because multiple engines need to interact with it.

Additional resources:

What is Apache Iceberg?

Apache Iceberg is a data lakehouse table format that enables ACID transactions, time travel, schema evolution, partition evolution, and more. Apache Iceberg provides the table abstraction layer for your data lake to work like a data warehouse, otherwise known as a data lakehouse.

Additional resources:

Why does Apache Iceberg matter?

Apache Iceberg is the core foundational piece to enabling data lakehouse architecture. With Apache Iceberg you have performance and flexibility when working with your data lake. This simplifies your data architecture and results in quicker turnaround in making data available to consumers while reducing compute and storage costs as you reduce your data warehouse footprint.

Additional resources:

What is an open data lakehouse?

An open lakehouse is a type of data lakehouse that focuses on using open architecture to enable the most resilient version of a data lakehouse by:

  • Using open standard file formats that can be read and written by many tools
  • Using open standard table formats that can be read and written by many tools

This enables an architecture that gives you the benefits of a data lakehouse without the risk of vendor lock-in/lock-out, allowing for flexibility in tooling and sustainable cost savings.

Additional resources:

What Apache Iceberg is and isn’t.

Apache Iceberg is …❌Apache Iceberg is not …
A table format specificationAn execution or storage engine
A set of APIs/libraries for engines
to interact with tables
A service

How does Apache Iceberg compare to other table formats?

Apache Iceberg, like other table formats, provides features like ACID transactions and time travel. 

Apache Iceberg is different because:

  • It’s the only format with partition evolution
  • It’s the only format to completely decouple from Hive file structure dependency
  • It provides the easiest to use and truly transparent hidden partitioning (makes partitioning much easier for consumers)

Additional resources:

Apache Iceberg Table Structure

What is a catalog?

An Iceberg catalog tracks what tables exist along with a pointer/reference to the most recently created metadata file. Apache Hive, AWS Glue, relational databases (JDBC), DynamoDB, and Project Nessie can all be used as a catalog for your Iceberg tables. 

Additional resources:

What are metadata files?

Every time a table is created or data is added, deleted, or updated from a table a new metadata file is created. The job of the metadata file is to track the high-level definition of the table such as current/past snapshots, current/past schemas, current/past partitioning schemas, and more.

Additional resources:

What are manifest lists?

Each snapshot of the table is tracked in a file called a manifest list. These files track which manifest files make up the table at that particular snapshot along with metadata on those files to allow query engines to filter out unnecessary files through techniques like partition pruning.

Additional resources:

What are manifests?

Each manifest file tracks a group of data files and delete files that make up the table at one or more snapshots. Manifests have metadata on the individual data files that allow the query engines to do further filtering of files not needed in the query scan (such as min/max filtering).

Additional resources:

What are data files?

Data files are the Parquet, AVRO, or ORC files that hold the data in the table.

What are delete files?

Delete files are files that track rows that have been deleted and should be ignored from certain data files. Delete files are used when a table sets its row-level operations (update, delete, merge) to Merge-on-Read (MOR). Delete files come in two styles: position deletes that track the position of deleted records in the file and equality deletes that track the deleted rows by the rows’ column values.

Additional resources:

What is the Puffin format?

The Puffin format is a binary format for tracking additional table stats and metadata. This provides a very memory-efficient manner for adding additional stats and indexes to Iceberg functionality that engines can use to enhance performance further.

Additional resources:

What is the access pattern of these files when a query is executed?

Apache Iceberg writes always begin by predicting the hash or filename of the next snapshot, executing the write and then committing the write if no other writer has written the expected snapshot first. Reads begin by finding a reference to the newest metadata file in the Iceberg catalog then using the metadata file, manifest list, and manifests to determine what files should be scanned, and then executing the scan.

Additional resources:

Features

What are ACID transactions?

ACID transactions are data transactions that can provide atomicity, consistency, isolation, and durability guarantees. Before Iceberg, many of these guarantees were difficult to provide when executing transactions on the data lake. With Iceberg, these guarantees are available to all your write transactions.

What is partition evolution?

In most table formats if you decide to change how you partition data to improve query times, you have to rewrite the entire table. This can be pretty expensive and time-consuming at scale. Apache Iceberg has the unique feature of partition evolution where you can change the partitioning scheme of the data going forward without having to rewrite all the data partitioned by the old partition scheme.

Additional resources:

What is hidden partitioning?

Partition evolution works because instead of tracking partitions by columns, Iceberg tracks based on two factors: a column and a transform. This pattern also unlocks another feature unique to Iceberg, hidden partitioning. The result is data consumers don’t have to worry about how the table is partitioned to take advantage of that partitioning, unlike other formats that usually require creating extra “partition” columns that have to be included in filters to take advantage of.

Additional resources:

What is time travel?

Since Iceberg’s metadata tracks the snapshots of the table, you can query the table as it was at any particular time; this is known as time travel. This is great for testing machine learning algorithms on the same data that previous versions may have been tested against for better comparison or to run historical analytics.

Additional resources:

What is schema evolution?

Schema evolution is the ability to change the schema of the table. Without a table format like Apache Iceberg, updating a table’s schema on the data lake may require a rewrite of the table. Apache Iceberg allows you to add columns, update column types, rename columns, and remove columns from a table without the need to rewrite it.

Additional resources:

How does Iceberg handle multiple concurrent writes?

Iceberg uses "optimistic concurrency" to provide transactional guarantees during multiple concurrent writes. When multiple writers attempt to write to an Iceberg table, they’ll project the filename or hash for the next snapshot. Before committing the next snapshot, they’ll check that the projected file name/hash still doesn’t exist. If it doesn’t, it will commit the table update and if it does, it will create a new projection and reprocess the update until successful or use up the maximum reattempts.

Additional resources:

What is Copy-on-Write and Merge-on-Read and when should I use one vs. the other?

Copy-on-Write (COW) and Merge-on-Read (MOR) are two strategies for handling row-level updates for a table (update, delete, merge). In COW, files with updated records are rewritten, which is optimal for reads but slowest for writes. In MOR, files with updated records are kept, and instead a delete file is written that tracks updated/deleted rows and is then merged together when the table is read. Fast writes with a cost at read time.

Additional resources:

How does Apache Iceberg scale on object storage?

Apache Iceberg isn’t dependent on the physical layout of files which means files from the same partition don’t have to be in the same directory/prefix. This allows Iceberg to avoid the bottleneck of object storage throttling.

Additional resources:

Apache Iceberg and Engines/Tools

Apache Iceberg and Dremio Sonar

Dremio Sonar is the lakehouse query engine that is part of the Dremio Cloud platform. Dremio Sonar supports querying Iceberg tables using S3, AWS Glue, or Nesse/Arctic as a catalog (Azure, Google Cloud, Hive, and Hadoop can be used with Dremio Community Edition). Dremio Sonar supports full Iceberg DML to query, insert, delete, and update.

Additional resources:

Apache Iceberg and Dremio Arctic

Dremio Arctic is the Nessie-based intelligent metastore service from the Dremio Cloud platform. It’s a cloud-managed Nessie catalog with a host of data management features to simplify managing Iceberg tables in the catalog. Arctic is free and the easiest way to have a Nessie-based catalog for your Iceberg tables to enable Git-like features like branching and merging. Arctic works with any engine that supports Nessie catalogs (Spark, Sonar, etc.). 

Additional resources:

Apache Iceberg and Apache Spark

Spark is one of the most ubiquitous engines in the industry. Maintenance and migration of Iceberg tables is done through call procedures and actions that are used through Spark. Spark supports full Iceberg DML.

Additional resources:

Apache Iceberg on AWS

AWS Glue is one of the possible Apache Iceberg catalogs that provides interoperability with many other tools in the AWS ecosystem for Iceberg tables.

Using Apache Iceberg’s programmatic APIs

To programmatically deal with various operations, Iceberg provides two language bindings: Java and Python. Java is the mature one and allows you to manage table metadata, retrieve snapshots, create/drop/alter tables, etc. The Python API as of now doesn’t have write capabilities.

Additional resources:

Apache Flink is a popular tool for stream processing and ingestion. Apache Iceberg has libraries for working with Flink.

Additional resources:

Apache Iceberg and Presto

Presto is an open source query engine that originated at Facebook. Presto supports working with Apache Iceberg tables.

Additional resources:

Apache Iceberg and Databricks

Databricks is a data lake analytics and machine learning platform. It uses its own version of Spark that can be configured to work with Apache Iceberg.

Additional resources:

Migration

How should I go about migrating to Apache Iceberg?

When it comes to migrating data to Apache Iceberg, you can either do a shadow migration where you restate the data using an engine like Spark or Dremio with CTAS queries or you can do an in-place migration where you move existing data from Hive tables or Parquet files into Iceberg using its migrate and add_files Spark procedures.

Additional resources:

What are Iceberg call procedures?

There are procedures that are part of Iceberg’s API. These can be used via any engine that implements support for them or can be used imperatively using Iceberg’s Java API. For example, Iceberg’s Spark SQL extensions use the following syntax: 

CALL catalog.system.procedure_name(arguments);

What is an in-place migration?

An in-place migration is one where you don’t rewrite your data. This can be done by converting a Hive table into an Iceberg table with the migrate procedure (should use snapshot to test first) or this can be done by adding existing Parquet files to an existing Apache Iceberg table using the add_files procedure.

What is the migrate procedure?

The migrate procedure replaces an existing Hive table with an Iceberg table using the data files from the original Hive table. After completion, the Hive table no longer exists which is why the snapshot procedure should be used before undergoing the migrate procedure.

Additional resources:

What is the snapshot procedure?

The snapshot procedure creates a temporary table to check before undergoing a migration of a Hive table.

Additional resources:

What is the add_files procedure?

The add_files procedure takes the data files of a Hive table and adds them to the data files being used in an existing Iceberg table. A new snapshot of the Iceberg table is created that includes these files.

Additional resources:

Can I use CTAS statements for migration?

While migrate and add_files allow you to migrate tables without needing to restate the data, you may want to restate the data to update partitioning and do other auditing in the process. To create a new Iceberg table from a Hive table where the data is restated, CTAS (CREATE TABLE AS) statements should do the trick.

Additional resources:

Maintenance

Apache Iceberg should undergo regular maintenance to stay performant and avoid storing data you don’t want to store. The following talks about many of the features available to maintain your Iceberg table to solve common maintenance issues like:

  • How to designate what data should be deleted
  • How to eliminate performance hits from too many small files
  • How to cluster data for better performance
  • How to identify and clean up files from failed writes

Additional resources:

How do I ensure a record is hard deleted for regulatory purposes like GDPR?

Apache Iceberg data files are deleted when they are no longer associated with a valid snapshot.

When using Copy-on-Write, data files containing changed data are rewritten, so expiring all snapshots from the deletion and prior will result in the deletion of the datafiles with those deleted records.

When using Merge-on-Read, data files aren’t rewritten so they may be associated with snapshots after the deletion occurred. To hard delete the files you will want to run a compaction job and then expire all snapshots before the compaction job which will delete all files that include those deleted records.

Additional resources:

How do I expire snapshots?

Every INSERT/UPDATE/DELETE results in the creation of a new snapshot so expiring older snapshots regularly can help keep your table performant and your file storage in check. This can be done with the expire snapshots procedure.

Additional resources:

What is compaction and how do I do it?

Compaction allows you to take smaller less efficient data files and rewrite them into fewer larger data files for improved query performance. This can be done using the rewriteDataFiles action.

Additional resources:

What are the rewrite/clustering strategies available for compacting your data files?

Iceberg allows you to perform rewrites with the rrewrite_data_files procedure and it supports two strategies: bin-pack or sort. The default is bin-pack. Using sort allows you to cluster your data based on one or more fields to group like data together for more efficient data scans. Iceberg also supports z-order clustering.

Additional resources:

What are orphan files and how do I clean them?

Failed jobs can result in unused files and artifacts polluting your data storage, and these orphan files can be cleaned up using the delete_orphan_files action. This can be a time-consuming operation so it should be done periodically.

Additional resources:

Do I need to retain every metadata file?

A new metadata file is created on every CREATE/INSERT/UPDATE/DELETE and can build up over time. You can set a max number of metadata files so that it deletes the oldest metadata file whenever a new metadata file is created.

Additional resources:

How do I automatically clean metadata files?

To automatically clean metadata files, set write.metadata.delete-after-commit.enabled=true in table properties. This will keep some metadata files (up to write.metadata.previous-versions-max) and will delete the oldest metadata file after each new one is created

But I Have Additional Questions You Haven’t Covered!

Feel free to reach out to us at [email protected] and we’ll answer your question and add it to this list to help others! We will also do the occasional, Apache Iceberg Office Hours which we will list below as they occur.

Get a Free Early Release Copy of "Apache Iceberg: The Definitive Guide".

Ready to Get Started?

Bring your users closer to the data with organization-wide self-service analytics and lakehouse flexibility, scalability, and performance at a fraction of the cost. Run Dremio anywhere with self-managed software or Dremio Cloud.