Dremio Blog

56 minute read · September 12, 2022

Apache Iceberg 101 – Your Guide to Learning Apache Iceberg Concepts and Practices

Alex Merced Head of DevRel, Dremio

Start For Free

Copied to clipboard

Apache Iceberg 101 – Your Guide to Learning Apache Iceberg Concepts and Practices

Why Apache Iceberg Exists

How Apache Iceberg Works: The Metadata Tree

Core Features of Apache Iceberg

Format Versions: V1, V2, and V3

The Apache Iceberg Ecosystem

The Apache Iceberg 101 Video Series

Table Maintenance: Keeping Your Iceberg Tables Healthy

Data Ingestion into Iceberg Tables

How Iceberg Compares to Other Table Formats

When Iceberg Is the Right Choice

Getting Started

Frequently Asked Questions

Free Resources to Continue Your Iceberg Journey

Directory of Additional Iceberg Resources

This article has been revised and updated from its original version published in 2022 to reflect the latest Apache Iceberg developments, including V3 features, the REST Catalog specification, and the rise of Apache Polaris.

Apache Iceberg is an open table format for large-scale analytic datasets. It sits between your query engine (Spark, Flink, Dremio, Trino, DuckDB) and your data files (Parquet, ORC, Avro) on object storage. Iceberg adds the transactional guarantees, performance optimizations, and management capabilities that raw file formats lack.

If you're hearing terms like "data lakehouse," "open table format," or "REST Catalog" and want to understand what they mean in practice, this guide covers everything from architecture to hands-on tutorials.

Why Apache Iceberg Exists

Traditional data lakes store files in directories on object storage. To run a query, the engine lists those directories to find the right files, a process that becomes painfully slow as tables grow to thousands or millions of files.

Hive tables improved on this by tracking partitions in a metastore, but they still had major limitations:

No ACID transactions: Concurrent writes could corrupt data
No schema evolution: Adding a column required rewriting the entire table
Partition awareness required: Users had to know the partition scheme and include partition columns in every query
No time travel: Once data was overwritten, the previous version was gone
Single-engine lock-in: Tight coupling to Hive and Spark

Apache Iceberg was created at Netflix to solve all of these problems. It was donated to the Apache Software Foundation in 2018 and has since become the most widely adopted open table format, supported by more engines than any competitor.

Try Dremio’s Interactive Demo

Explore this interactive demo and see how Dremio's Intelligent Lakehouse enables Agentic AI

How Apache Iceberg Works: The Metadata Tree

Iceberg replaces directory listings with a structured metadata tree. Understanding this tree is the key to understanding how Iceberg delivers performance and reliability.

The Five Layers

Catalog → Metadata File → Manifest List → Manifest Files → Data Files

Catalog: A pointer to the current metadata file. This is the single source of truth for where the table's metadata lives. One HTTP GET or object storage read.
Metadata File (JSON): Contains the table's complete definition: schema history, partition specs, sort orders, snapshot list, and table properties. Each commit writes a new metadata file.
Manifest List (Avro): One per snapshot. Groups manifest files with partition-level summary statistics, enabling the first level of query pruning.
Manifest Files (Avro): Each manifest tracks a set of data files with per-file column statistics (min/max values, null counts, row counts). This enables the second level of pruning.
Data Files (Parquet/ORC): The actual rows of data. With Parquet, row group statistics enable a third level of pruning.

To understand this tree in depth, read Hands-On with the Iceberg Table Structure and The Life of a Read Query for Apache Iceberg Tables.

Three-Level Query Pruning

This layered metadata structure gives Iceberg its signature performance advantage:

Pruning Level	What It Uses	What It Skips
Manifest list	Partition summaries	Entire manifest files whose partitions don't match the query
Manifest file	Column min/max stats	Individual data files that can't contain matching rows
Parquet row group	Row group statistics	Row groups within a file that don't match

A query that might scan 10,000 files in a traditional data lake might scan only 50 files after Iceberg's three-level pruning. For a detailed look on how this works, see How Apache Iceberg Is Built for Performance.

Core Features of Apache Iceberg

ACID Transactions

Every write to an Iceberg table creates a new snapshot atomically. Readers always see a consistent view of the table, either the old snapshot or the new one, never a partial write. This is possible because commits are atomic metadata pointer swaps via optimistic concurrency control.

Schema Evolution

Iceberg tracks columns by unique ID, not by position or name. This means you can:

Add new columns
Drop existing columns
Rename columns
Reorder columns
Widen types (e.g., int → long)

All without rewriting data files. Existing data files remain valid. The engine simply maps the old schema to the new one at read time.

Hidden Partitioning

In traditional Hive-style partitioning, users must include partition columns in their queries. Iceberg eliminates this with hidden partitioning: you define partition transforms (like month(order_date)), and Iceberg automatically maps query predicates to the right partitions. Users write WHERE order_date = '2024-03-15' and Iceberg handles the rest.

Partition Evolution

Need to change your partition scheme as data grows? Iceberg supports partition evolution, letting you change from month(ts) to day(ts) without rewriting existing data. Old data stays in monthly partitions, new data goes into daily partitions, and queries spanning both work correctly.

Time Travel

Every snapshot is immutable. You can query any previous version of your table by snapshot ID or timestamp:

SELECT * FROM orders FOR SYSTEM_TIME AS OF '2024-01-15';

This is essential for auditing, debugging, and reproducible analytics.

Rollback

When a bad data ingestion corrupts your table, rollback reverts to a previous snapshot instantly. It's a metadata-only operation: no data files are deleted or rewritten.

Row-Level Operations (UPDATE, DELETE, MERGE)

Iceberg supports row-level changes through two strategies: copy-on-write (rewrites entire files) and merge-on-read (writes small delete files). V3 introduces deletion vectors for even faster merge-on-read writes.

Format Versions: V1, V2, and V3

Version	Key Additions
V1	Core spec: metadata tree, schema evolution, partitioning, snapshots
V2	Row-level deletes (positional + equality delete files), partition stats
V3	Deletion vectors, variant type, default values, row lineage

V3's deletion vectors are a significant performance improvement. They replace separate delete files with bitmap-based markers stored in Puffin files, giving you fast writes and fast reads. For the full V2-to-V3 comparison, see the table format architecture detailed look and the official Iceberg specification.

The Apache Iceberg Ecosystem

Catalogs

The catalog is the single source of truth for table metadata location. Iceberg supports multiple catalog implementations:

Catalog	Key Feature
REST Catalog (standard)	Engine-agnostic HTTP API, credential vending
Apache Polaris (Incubating)	Open-source REST catalog with RBAC
Project Nessie	Git-like branching, multi-table transactions
AWS Glue	Managed catalog for AWS environments
Hive Metastore	Legacy compatibility

For the full history, see The Evolution of Apache Iceberg Catalogs.

Query Engines

Any engine that implements the Iceberg library or REST Catalog client can read and write Iceberg tables:

Apache Spark: The most mature integration; ideal for ETL
Apache Flink: Best for streaming ingestion into Iceberg
Dremio: High-performance SQL with Reflections for query acceleration
Trino/Presto: Interactive SQL queries
DuckDB: Single-node analytics
Snowflake: External Iceberg table support

Python Tools

Python developers have three ways to work with Iceberg: PyIceberg (pure Python, no JVM), PySpark (distributed processing), and Arrow Flight (SQL via Dremio).

The Apache Iceberg 101 Video Series

This 12-part video series walks through every core concept. Watch them in order for the best experience:

Introduction to the course: What you'll learn and why it matters
The Problem and the Solution: Why existing formats fell short
Iceberg and the Data Lakehouse: How Iceberg fits the lakehouse pattern
Architecture Overview: Metadata files, manifests, and the catalog pointer
Transactions Step by Step: How writes create new snapshots atomically
Iceberg Catalogs: The role of the catalog and the REST Catalog specification
Copy-on-Write and Merge-on-Read: Two strategies for row-level changes
Table Tuning with Table Properties: Configuring compaction, sort order, and more
Migrating to Iceberg: Moving from Hive tables without rewriting data
Time Travel: Querying past states of your tables
Maintaining Iceberg Tables: Compaction, snapshot expiry, orphan file cleanup
Hard Deletions and GDPR: Ensuring the right to be forgotten

Table Maintenance: Keeping Your Iceberg Tables Healthy

Iceberg tables need periodic maintenance to stay performant. Three operations matter:

Compaction

Small files accumulate from streaming ingestion or frequent small writes. Compaction rewrites them into larger, optimally sized files (256-512MB target).

Snapshot Expiry

Every commit creates a new snapshot. Over time, hundreds of snapshots accumulate. Expiring old snapshots reclaims metadata space.

Orphan File Cleanup

Failed writes or expired snapshots can leave unreferenced data files on storage. Orphan cleanup removes them.

Dremio automates all three operations for tables managed by its catalog, so you don't need to schedule them manually.

Data Ingestion into Iceberg Tables

There are multiple paths to get data into Iceberg:

Method	Best For	Related Guide
Dremio CTAS	Converting files already on storage	CSV to Iceberg, JSON to Iceberg
Spark SQL	Large-scale batch ingestion	Introduction to Iceberg Using Spark
Flink	Streaming ingestion	Flink + Iceberg + Nessie
Airbyte/Fivetran	CDC from databases	Lakehouse with Airbyte, Lakehouse with Fivetran
AWS Kinesis + Glue	AWS streaming	Kinesis + Glue → Iceberg
Format migration	Moving from Hive or Delta	Hive to Iceberg, Delta to Iceberg

How Iceberg Compares to Other Table Formats

Apache Iceberg is one of three major open table formats, alongside Delta Lake and Apache Hudi. The architecture comparison and feature comparison cover the differences in detail. Key differentiators for the current spec version:

Factor	Iceberg	Delta Lake	Hudi
Engine support	Broadest (15+ engines)	Spark-centric + UniForm	Spark/Flink
Catalog standard	REST Catalog spec	Unity Catalog	Hudi metadata
Governance	ASF (multi-vendor)	Linux Foundation	ASF
Partitioning	Hidden + evolution	Liquid Clustering	Traditional
Community	Largest contributor base	Databricks-led	Uber-originated

When Iceberg Is the Right Choice

Use Iceberg when:

You need ACID transactions on object storage
Multiple engines need to read/write the same tables
Tables are frequently updated or have evolving schemas
You want warehouse-grade query performance without vendor lock-in
Data governance and compliance (GDPR, CCPA) are requirements

Iceberg may not be ideal when:

Datasets are very small (metadata overhead isn't justified)
Sub-second latency is required (though Flink streaming to Iceberg supports near-real-time ingestion)
Your entire stack is locked to a single vendor's proprietary format

Getting Started

The fastest way to get hands-on with Iceberg:

Try Dremio Cloud: Start a free trial and query Iceberg tables with zero infrastructure setup
Local development: Follow Intro to Dremio, Nessie, and Apache Iceberg on Your Laptop for a Docker Compose setup
Read the book: Download Apache Iceberg: The Definitive Guide for free

Frequently Asked Questions

What is the best way to get started with Iceberg?

The fastest path to hands-on experience is through Dremio Cloud. Sign up for a free trial, connect to your object storage, and create your first Iceberg table with a simple CREATE TABLE statement. Dremio handles all the catalog management, compaction, and query optimization automatically. For local development, use Docker with Spark and a REST catalog like Nessie or Apache Polaris.

Does Apache Iceberg lock me into one query engine?

No. Iceberg is engine-agnostic by design. Any engine that implements the Iceberg library or connects via the REST Catalog API can read and write the same tables. You can write data with Spark, query it with Dremio, and analyze it with DuckDB, all against the same table.

How does Iceberg handle schema changes without breaking queries?

Iceberg tracks every column by a unique integer ID rather than by name or position. When you add, drop, rename, or reorder columns, existing data files remain untouched. The engine maps old schemas to new schemas at read time using those IDs, so queries against historical snapshots still return correct results.

When should I use V3 instead of V2?

Upgrade to V3 when your tables have frequent row-level updates or deletes. V3's deletion vectors provide significantly faster write and read performance for those operations compared to V2's positional delete files. If your tables are append-only, V2 is sufficient.

Free Resources to Continue Your Iceberg Journey

Iceberg Lakehouse Books from Dremio Authors

Directory of Additional Iceberg Resources

After watching the series of videos above, you should have a pretty good understanding of Apache Iceberg and its concepts.

Below is a list of additional resources to continue learning more about Apache Iceberg, including hands-on exercises, articles from companies detailing their usage of Apache Iceberg and more.

Apache Iceberg Core Concepts

Below are several resources for understanding what Apache Iceberg is and how it fundamentally works at a high-level conceptual level.

Apache Iceberg Features

Below are resources to learn more about the many features of Apache Iceberg.

Hands-on Apache Iceberg Exercises

The resources below guide you through guided exercises and tutorials to try Apache Iceberg in action with different tools.

A Unified Apache Iceberg Lakehouse

Apache Iceberg and BI Dashboards

Iceberg Video Demos

Videos showing hands use of Apache Iceberg Tables

Comparison of Apache Iceberg to Other Table Formats

With the resources below you can read on how Apache Iceberg compares to other table formats.

Below are articles from companies that have documented their deployment of Apache Iceberg into production. You can read about their experiences and lessons learned.

Optimizing and Maintaining Apache Iceberg Tables

Once you have Apache Iceberg tables in place you’ll want to optimize and maintain them, below are articles that walk through different features for engineering tables for best performance.

Ingesting Data into Apache Iceberg Tables

How do we get data into our Iceberg tables, the following are articles on the ingestion of data into Iceberg tables from different sources.

Working with Cloud Object Storage

Object storage has become the standard for storing data in a data lakehouse and the resources below highlight Apache Iceberg in the context of cloud object storage.

The Java and Python API

Below are articles on Apache Iceberg’s Java and Python API.

Streaming with Apache Iceberg

Streaming data can require lots of considerations that don’t exist in batch processing. Below are resources that deal with using Apache Iceberg in streaming data.

Data as Code

Take your Apache Iceberg tables to the next level with Project Nessie/Dremio Arctic catalog, which allows you to create catalog-level branches for isolating ETL, catalog rollback, multi-table transactions, and more. Here are some talks and blogs on the subject.

Apache Iceberg Office Hours

Recordings of Apache Iceberg Office Hours, held as part of the Gnarly Data Waves podcast.

Miscellaneous Blog Articles

Here is a list of other great Apache Iceberg articles you can learn from.

Gnarly Data Waves

Episodes of the Gnarly Data Waves Podcast dedicated to Apache Iceberg, subscribe on Youtube or Spotify.

Iceberg Subsurface Conference Talks

Here is a list of Subsurface conference talks on Apache Iceberg.

Even more talks from Subsurface Live 2023!

Try Dremio Cloud free for 30 days

Deploy agentic analytics directly on Apache Iceberg data with no pipelines and no added overhead.

Start For Free

Article Topics

Dremio Blog: Open Data Insights

Sep 22, 2023 Dremio Blog: Open Data Insights

Intro to Dremio, Nessie, and Apache Iceberg on Your Laptop

We're always looking for ways to better handle and save money on our data. That's why the "data lakehouse" is becoming so popular. It offers a mix of the flexibility of data lakes and the ease of use and performance of data warehouses. The goal? Make data handling easier and cheaper. So, how do we […]

Alex Merced

Aug 16, 2023 Dremio Blog: News Highlights

5 Use Cases for the Dremio Lakehouse

With its capabilities in on-prem to cloud migration, data warehouse offload, data virtualization, upgrading data lakes and lakehouses, and building customer-facing analytics applications, Dremio provides the tools and functionalities to streamline operations and unlock the full potential of data assets.

Alex Merced

Aug 31, 2023 Dremio Blog: News Highlights

Dremio Arctic is Now Your Data Lakehouse Catalog in Dremio Cloud

Dremio Arctic bring new features to Dremio Cloud, including Apache Iceberg table optimization and Data as Code.

Jeremiah Morrow

Table of Contents

Why Apache Iceberg Exists

Try Dremio’s Interactive Demo

How Apache Iceberg Works: The Metadata Tree

The Five Layers

Three-Level Query Pruning

Core Features of Apache Iceberg

ACID Transactions

Schema Evolution

Hidden Partitioning

Partition Evolution

Time Travel

Rollback

Row-Level Operations (UPDATE, DELETE, MERGE)

Format Versions: V1, V2, and V3

The Apache Iceberg Ecosystem

Catalogs

Query Engines

Python Tools

The Apache Iceberg 101 Video Series

Table Maintenance: Keeping Your Iceberg Tables Healthy

Compaction

Snapshot Expiry

Orphan File Cleanup

Data Ingestion into Iceberg Tables

How Iceberg Compares to Other Table Formats

When Iceberg Is the Right Choice

Getting Started

Frequently Asked Questions

What is the best way to get started with Iceberg?

Does Apache Iceberg lock me into one query engine?

How does Iceberg handle schema changes without breaking queries?

When should I use V3 instead of V2?

Free Resources to Continue Your Iceberg Journey

Iceberg Lakehouse Books from Dremio Authors

Directory of Additional Iceberg Resources

Apache Iceberg Core Concepts

Apache Iceberg Features

Hands-on Apache Iceberg Exercises

A Unified Apache Iceberg Lakehouse

Apache Iceberg and BI Dashboards

Iceberg Video Demos

Comparison of Apache Iceberg to Other Table Formats

Companies Sharing Their Production Apache Iceberg Usage

Optimizing and Maintaining Apache Iceberg Tables

Ingesting Data into Apache Iceberg Tables

Working with Cloud Object Storage

The Java and Python API

Streaming with Apache Iceberg

Data as Code

Apache Iceberg Office Hours

Miscellaneous Blog Articles

Gnarly Data Waves

Iceberg Subsurface Conference Talks

Try Dremio Cloud free for 30 days

Related Dremio Articles

Intro to Dremio, Nessie, and Apache Iceberg on Your Laptop

5 Use Cases for the Dremio Lakehouse

Dremio Arctic is Now Your Data Lakehouse Catalog in Dremio Cloud

Ready to Get Started?