Dremio Blog

40 minute read · December 14, 2022

Apache Iceberg FAQ

Alex Merced Head of DevRel, Dremio

Start For Free

Copied to clipboard

Apache Iceberg FAQ

Table of contents

The Basics

Apache Iceberg Table Structure

Features

Apache Iceberg and Engines/Tools

Migration

Maintenance

But I Have Additional Questions You Haven’t Covered!

Below you’ll find answers to many of the questions people may have about Apache Iceberg. Also be sure to check out our Apache Iceberg 101 article for a video series introducing you to many of the Apache Iceberg concepts.

The Basics
Apache Iceberg Table Structure
Features
Apache Iceberg and Engines/Tools
Migration
Maintenance
But I Have Additional Questions You Haven’t Covered!

The Basics

What is a data lakehouse?

A data lakehouse is an architectural pattern where most of the workloads often associated with a data warehouse take place on the data lake. This reduces the duplication of data and complexity of data pipelines for better regulatory compliance, consistency of data, and self-service delivery of that data.

What is a data lakehouse table format?

A table format allows tools to look at files in your data lake storage and recognize groups of those files as a single table to enable you to query and transform the data performantly directly on the data lake, enabling data warehouse-like workloads (and more) on the data lake (also known as a data lakehouse).

Table formats are not new or specific to data lakes — they’ve been around since System R, Multics, and Oracle first implemented Edgar Codd’s relational model, although “table format” wasn’t the term used at the time. A data lakehouse table format is different from those because multiple engines need to interact with it.

Additional resources:

What is Apache Iceberg?

Apache Iceberg is a data lakehouse table format that enables ACID transactions, time travel, schema evolution, partition evolution, and more. Apache Iceberg provides the table abstraction layer for your data lake to work like a data warehouse, otherwise known as a data lakehouse.

Additional resources:

Why does Apache Iceberg matter?

Apache Iceberg is the core foundational piece to enabling data lakehouse architecture. With Apache Iceberg you have performance and flexibility when working with your data lake. This simplifies your data architecture and results in quicker turnaround in making data available to consumers while reducing compute and storage costs as you reduce your data warehouse footprint.

Additional resources:

What is an open data lakehouse?

An open lakehouse is a type of data lakehouse that focuses on using open architecture to enable the most resilient version of a data lakehouse by:

Using open standard file formats that can be read and written by many tools
Using open standard table formats that can be read and written by many tools

This enables an architecture that gives you the benefits of a data lakehouse without the risk of vendor lock-in/lock-out, allowing for flexibility in tooling and sustainable cost savings.

Additional resources:

Blog: How We Got to Open Lakehouse

What Apache Iceberg is and isn’t.

✅ Apache Iceberg is …	❌Apache Iceberg is not …
A table format specification	An execution or storage engine
A set of APIs/libraries for engines to interact with tables	A service

How does Apache Iceberg compare to other table formats?

Apache Iceberg, like other table formats, provides features like ACID transactions and time travel.

Apache Iceberg is different because:

It’s the only format with partition evolution
It’s the only format to completely decouple from Hive file structure dependency
It provides the easiest to use and truly transparent hidden partitioning (makes partitioning much easier for consumers)

Additional resources:

Apache Iceberg Table Structure

What is a catalog?

An Iceberg catalog tracks what tables exist along with a pointer/reference to the most recently created metadata file. Apache Hive, AWS Glue, relational databases (JDBC), DynamoDB, and Project Nessie can all be used as a catalog for your Iceberg tables.

Additional resources:

Video: Iceberg Catalogs in Dremio

What are metadata files?

Every time a table is created or data is added, deleted, or updated from a table a new metadata file is created. The job of the metadata file is to track the high-level definition of the table such as current/past snapshots, current/past schemas, current/past partitioning schemas, and more.

Additional resources:

Docs: Iceberg Metadata File Spec
Blog: Hands-On Look at the Structure of an Iceberg Table

What are manifest lists?

Each snapshot of the table is tracked in a file called a manifest list. These files track which manifest files make up the table at that particular snapshot along with metadata on those files to allow query engines to filter out unnecessary files through techniques like partition pruning.

Additional resources:

Docs: Iceberg Manifest List Spec
Blog: Hands-On Look at the Structure of an Iceberg Table

What are manifests?

Each manifest file tracks a group of data files and delete files that make up the table at one or more snapshots. Manifests have metadata on the individual data files that allow the query engines to do further filtering of files not needed in the query scan (such as min/max filtering).

Additional resources:

Docs: Iceberg Manifest Spec
Blog: Hands-On Look at the Structure of an Iceberg Table

What are data files?

Data files are the Parquet, AVRO, or ORC files that hold the data in the table.

What are delete files?

Delete files are files that track rows that have been deleted and should be ignored from certain data files. Delete files are used when a table sets its row-level operations (update, delete, merge) to Merge-on-Read (MOR). Delete files come in two styles: position deletes that track the position of deleted records in the file and equality deletes that track the deleted rows by the rows’ column values.

Additional resources:

Blog: Row-Level Operations in Apache Iceberg

What is the Puffin format?

The Puffin format is a binary format for tracking additional table stats and metadata. This provides a very memory-efficient manner for adding additional stats and indexes to Iceberg functionality that engines can use to enhance performance further.

Additional resources:

Blog: Apache Iceberg Puffin Format

What is the access pattern of these files when a query is executed?

Apache Iceberg writes always begin by predicting the hash or filename of the next snapshot, executing the write and then committing the write if no other writer has written the expected snapshot first. Reads begin by finding a reference to the newest metadata file in the Iceberg catalog then using the metadata file, manifest list, and manifests to determine what files should be scanned, and then executing the scan.

Additional resources:

Features

What are ACID transactions?

ACID transactions are data transactions that can provide atomicity, consistency, isolation, and durability guarantees. Before Iceberg, many of these guarantees were difficult to provide when executing transactions on the data lake. With Iceberg, these guarantees are available to all your write transactions.

What is partition evolution?

In most table formats if you decide to change how you partition data to improve query times, you have to rewrite the entire table. This can be pretty expensive and time-consuming at scale. Apache Iceberg has the unique feature of partition evolution where you can change the partitioning scheme of the data going forward without having to rewrite all the data partitioned by the old partition scheme.

Additional resources:

Blog: Apache Iceberg’s Partition Evolution

What is hidden partitioning?

Partition evolution works because instead of tracking partitions by columns, Iceberg tracks based on two factors: a column and a transform. This pattern also unlocks another feature unique to Iceberg, hidden partitioning. The result is data consumers don’t have to worry about how the table is partitioned to take advantage of that partitioning, unlike other formats that usually require creating extra “partition” columns that have to be included in filters to take advantage of.

Additional resources:

Blog: Apache Iceberg’s Hidden Partitioning

What is time travel?

Since Iceberg’s metadata tracks the snapshots of the table, you can query the table as it was at any particular time; this is known as time travel. This is great for testing machine learning algorithms on the same data that previous versions may have been tested against for better comparison or to run historical analytics.

Additional resources:

Blog: Time Travel with Apache Iceberg

What is schema evolution?

Schema evolution is the ability to change the schema of the table. Without a table format like Apache Iceberg, updating a table’s schema on the data lake may require a rewrite of the table. Apache Iceberg allows you to add columns, update column types, rename columns, and remove columns from a table without the need to rewrite it.

Additional resources:

Docs: Schema Evolution

How does Iceberg handle multiple concurrent writes?

Iceberg uses "optimistic concurrency" to provide transactional guarantees during multiple concurrent writes. When multiple writers attempt to write to an Iceberg table, they’ll project the filename or hash for the next snapshot. Before committing the next snapshot, they’ll check that the projected file name/hash still doesn’t exist. If it doesn’t, it will commit the table update and if it does, it will create a new projection and reprocess the update until successful or use up the maximum reattempts.

Additional resources:

Docs: Concurrent Writes
Docs: Writing for File and Metadata Tables

What is Copy-on-Write and Merge-on-Read and when should I use one vs. the other?

Copy-on-Write (COW) and Merge-on-Read (MOR) are two strategies for handling row-level updates for a table (update, delete, merge). In COW, files with updated records are rewritten, which is optimal for reads but slowest for writes. In MOR, files with updated records are kept, and instead a delete file is written that tracks updated/deleted rows and is then merged together when the table is read. Fast writes with a cost at read time.

Additional resources:

Blog: Row-Level Changes on the Lakehouse: Copy-On-Write vs. Merge-On-Read in Apache Iceberg

How does Apache Iceberg scale on object storage?

Apache Iceberg isn’t dependent on the physical layout of files which means files from the same partition don’t have to be in the same directory/prefix. This allows Iceberg to avoid the bottleneck of object storage throttling.

Additional resources:

Blog: Ensuring High Performance at Any Scale with Apache Iceberg’s Object Store File Layout

Apache Iceberg and Engines/Tools

Apache Iceberg and Dremio Sonar

Dremio Sonar is the lakehouse query engine that is part of the Dremio Cloud platform. Dremio Sonar supports querying Iceberg tables using S3, AWS Glue, or Nesse/Arctic as a catalog (Azure, Google Cloud, Hive, and Hadoop can be used with Dremio Community Edition). Dremio Sonar supports full Iceberg DML to query, insert, delete, and update.

Additional resources:

Video: Signing Up for Dremio Cloud
Video: Creating Your First Sonar Project
Video: Tour of the Sonar UI
Video: DML with Iceberg Tables in Dremio

Apache Iceberg and Dremio Arctic

Dremio Arctic is the Nessie-based intelligent metastore service from the Dremio Cloud platform. It’s a cloud-managed Nessie catalog with a host of data management features to simplify managing Iceberg tables in the catalog. Arctic is free and the easiest way to have a Nessie-based catalog for your Iceberg tables to enable Git-like features like branching and merging. Arctic works with any engine that supports Nessie catalogs (Spark, Sonar, etc.).

Additional resources:

Apache Iceberg and Apache Spark

Spark is one of the most ubiquitous engines in the industry. Maintenance and migration of Iceberg tables is done through call procedures and actions that are used through Spark. Spark supports full Iceberg DML.

Additional resources:

Blog: Introduction to Apache Iceberg Using Spark
Blog: Accessing a Dremio Arctic Catalog from Apache Spark
Docs: Apache Iceberg and Apache Spark
Markdown: Setting up a Docker/Spark Notebook Environment locally
Markdown: PySpark settings for different Catalogs

Apache Iceberg on AWS

AWS Glue is one of the possible Apache Iceberg catalogs that provides interoperability with many other tools in the AWS ecosystem for Iceberg tables.

Using Apache Iceberg’s programmatic APIs

To programmatically deal with various operations, Iceberg provides two language bindings: Java and Python. Java is the mature one and allows you to manage table metadata, retrieve snapshots, create/drop/alter tables, etc. The Python API as of now doesn’t have write capabilities.

Additional resources:

Docs: The Iceberg Java API
Docs: The Iceberg Python API (pyIceberg)

Apache Iceberg and Apache Flink

Apache Flink is a popular tool for stream processing and ingestion. Apache Iceberg has libraries for working with Flink.

Additional resources:

Docs: Apache Iceberg and Flink

Apache Iceberg and Presto

Presto is an open source query engine that originated at Facebook. Presto supports working with Apache Iceberg tables.

Additional resources:

Docs: Apache Iceberg and Presto

Apache Iceberg and Databricks

Databricks is a data lake analytics and machine learning platform. It uses its own version of Spark that can be configured to work with Apache Iceberg.

Additional resources:

Blog: Getting Started with Iceberg Using Databricks Spark

Migration

How should I go about migrating to Apache Iceberg?

When it comes to migrating data to Apache Iceberg, you can either do a shadow migration where you restate the data using an engine like Spark or Dremio with CTAS queries or you can do an in-place migration where you move existing data from Hive tables or Parquet files into Iceberg using its migrate and add_files Spark procedures.

Additional resources:

Video: Strategies for Iceberg Migration
Video: How to Turn a CSV/JSON Files Into an Iceberg Table with Dremio
Blog: Migration of Hive tables to Iceberg
Blog: Hands-On Exercise Migrating Hive Tables to Iceberg
Blog: Migration of Delta Lake Tables to Iceberg

What are Iceberg call procedures?

There are procedures that are part of Iceberg’s API. These can be used via any engine that implements support for them or can be used imperatively using Iceberg’s Java API. For example, Iceberg’s Spark SQL extensions use the following syntax:

CALL catalog.system.procedure_name(arguments);

What is an in-place migration?

An in-place migration is one where you don’t rewrite your data. This can be done by converting a Hive table into an Iceberg table with the migrate procedure (should use snapshot to test first) or this can be done by adding existing Parquet files to an existing Apache Iceberg table using the add_files procedure.

What is the migrate procedure?

The migrate procedure replaces an existing Hive table with an Iceberg table using the data files from the original Hive table. After completion, the Hive table no longer exists which is why the snapshot procedure should be used before undergoing the migrate procedure.

Additional resources:

Docs: The Migrate Procedure

What is the snapshot procedure?

The snapshot procedure creates a temporary table to check before undergoing a migration of a Hive table.

Additional resources:

Docs: The Snapshot Procedure

What is the add_files procedure?

The add_files procedure takes the data files of a Hive table and adds them to the data files being used in an existing Iceberg table. A new snapshot of the Iceberg table is created that includes these files.

Additional resources:

Docs: The add_files procedure

Can I use CTAS statements for migration?

While migrate and add_files allow you to migrate tables without needing to restate the data, you may want to restate the data to update partitioning and do other auditing in the process. To create a new Iceberg table from a Hive table where the data is restated, CTAS (CREATE TABLE AS) statements should do the trick.

Additional resources:

Docs: CTAS (CREATE TABLE AS)

Maintenance

Apache Iceberg should undergo regular maintenance to stay performant and avoid storing data you don’t want to store. The following talks about many of the features available to maintain your Iceberg table to solve common maintenance issues like:

How to designate what data should be deleted
How to eliminate performance hits from too many small files
How to cluster data for better performance
How to identify and clean up files from failed writes

Additional resources:

Blog: Article on Maintenance of Iceberg Tables

Apache Iceberg data files are deleted when they are no longer associated with a valid snapshot.

When using Copy-on-Write, data files containing changed data are rewritten, so expiring all snapshots from the deletion and prior will result in the deletion of the datafiles with those deleted records.

When using Merge-on-Read, data files aren’t rewritten so they may be associated with snapshots after the deletion occurred. To hard delete the files you will want to run a compaction job and then expire all snapshots before the compaction job which will delete all files that include those deleted records.

Additional resources:

Blog: Apache Iceberg and the Right to Be Forgotten

How do I expire snapshots?

Every INSERT/UPDATE/DELETE results in the creation of a new snapshot so expiring older snapshots regularly can help keep your table performant and your file storage in check. This can be done with the expire snapshots procedure.

Additional resources:

Docs: The expire_snapshots Procedure

What is compaction and how do I do it?

Compaction allows you to take smaller less efficient data files and rewrite them into fewer larger data files for improved query performance. This can be done using the rewriteDataFiles action.

Additional resources:

Docs: The rewrite_data_files Procedure

What are the rewrite/clustering strategies available for compacting your data files?

Iceberg allows you to perform rewrites with the rrewrite_data_files procedure and it supports two strategies: bin-pack or sort. The default is bin-pack. Using sort allows you to cluster your data based on one or more fields to group like data together for more efficient data scans. Iceberg also supports z-order clustering.

Additional resources:

Blog: Iceberg Compaction Strategies
Blog: Iceberg and Z-Order

What are orphan files and how do I clean them?

Failed jobs can result in unused files and artifacts polluting your data storage, and these orphan files can be cleaned up using the delete_orphan_files action. This can be a time-consuming operation so it should be done periodically.

Additional resources:

Blog: remove_orphan_files procedure

Do I need to retain every metadata file?

A new metadata file is created on every CREATE/INSERT/UPDATE/DELETE and can build up over time. You can set a max number of metadata files so that it deletes the oldest metadata file whenever a new metadata file is created.

Additional resources:

Docs: Setting the Max Number of Metadata Files

How do I automatically clean metadata files?

To automatically clean metadata files, set write.metadata.delete-after-commit.enabled=true in table properties. This will keep some metadata files (up to write.metadata.previous-versions-max) and will delete the oldest metadata file after each new one is created

But I Have Additional Questions You Haven’t Covered!

Feel free to reach out to us at [email protected] and we’ll answer your question and add it to this list to help others! We will also do the occasional, Apache Iceberg Office Hours which we will list below as they occur.

Apache Iceberg Office Hours #1

Get a Free Early Release Copy of "Apache Iceberg: The Definitive Guide".

Try Dremio Cloud free for 30 days

Deploy agentic analytics directly on Apache Iceberg data with no pipelines and no added overhead.

Start For Free

Article Topics

Dremio Blog: Open Data Insights

Sep 22, 2023 Dremio Blog: Open Data Insights

Intro to Dremio, Nessie, and Apache Iceberg on Your Laptop

We're always looking for ways to better handle and save money on our data. That's why the "data lakehouse" is becoming so popular. It offers a mix of the flexibility of data lakes and the ease of use and performance of data warehouses. The goal? Make data handling easier and cheaper. So, how do we […]

Alex Merced

Aug 16, 2023 Dremio Blog: News Highlights

5 Use Cases for the Dremio Lakehouse

With its capabilities in on-prem to cloud migration, data warehouse offload, data virtualization, upgrading data lakes and lakehouses, and building customer-facing analytics applications, Dremio provides the tools and functionalities to streamline operations and unlock the full potential of data assets.

Alex Merced

Aug 31, 2023 Dremio Blog: News Highlights

Dremio Arctic is Now Your Data Lakehouse Catalog in Dremio Cloud

Dremio Arctic bring new features to Dremio Cloud, including Apache Iceberg table optimization and Data as Code.

Jeremiah Morrow

Table of Contents

Table of contents

The Basics

What is a data lakehouse?

What is a data lakehouse table format?

What is Apache Iceberg?

Why does Apache Iceberg matter?

What is an open data lakehouse?

What Apache Iceberg is and isn’t.

How does Apache Iceberg compare to other table formats?

Apache Iceberg Table Structure

What is a catalog?

What are metadata files?

What are manifest lists?

What are manifests?

What are data files?

What are delete files?

What is the Puffin format?

What is the access pattern of these files when a query is executed?

Features

What are ACID transactions?

What is partition evolution?

What is hidden partitioning?

What is time travel?

What is schema evolution?

How does Iceberg handle multiple concurrent writes?

What is Copy-on-Write and Merge-on-Read and when should I use one vs. the other?

How does Apache Iceberg scale on object storage?

Apache Iceberg and Engines/Tools

Apache Iceberg and Dremio Sonar

Apache Iceberg and Dremio Arctic

Apache Iceberg and Apache Spark

Apache Iceberg on AWS

Using Apache Iceberg’s programmatic APIs

Apache Iceberg and Apache Flink

Apache Iceberg and Presto

Apache Iceberg and Databricks

Migration

How should I go about migrating to Apache Iceberg?

What are Iceberg call procedures?

What is an in-place migration?

What is the migrate procedure?

What is the snapshot procedure?

What is the add_files procedure?

Can I use CTAS statements for migration?

Maintenance

How do I ensure a record is hard deleted for regulatory purposes like GDPR?

How do I expire snapshots?

What is compaction and how do I do it?

What are the rewrite/clustering strategies available for compacting your data files?

What are orphan files and how do I clean them?

Do I need to retain every metadata file?

How do I automatically clean metadata files?

But I Have Additional Questions You Haven’t Covered!

Try Dremio Cloud free for 30 days

Related Dremio Articles

Intro to Dremio, Nessie, and Apache Iceberg on Your Laptop

5 Use Cases for the Dremio Lakehouse

Dremio Arctic is Now Your Data Lakehouse Catalog in Dremio Cloud

Ready to Get Started?