Dremio Blog

35 minute read · September 22, 2023

Exploring the Architecture of Apache Iceberg, Delta Lake, and Apache Hudi

Alex Merced Alex Merced Head of DevRel, Dremio
Start For Free
Exploring the Architecture of Apache Iceberg, Delta Lake, and Apache Hudi
Copied to clipboard

This article has been revised and updated from its original version published in 2022 to reflect the latest developments in all three table formats.

The three major open table formats (Apache Iceberg, Delta Lake, and Apache Hudi) each solve the "open lakehouse" problem differently at the architectural level. While the high-level comparison covers features and ecosystem support, this article dives deeper into the internal architecture of each format, how they organize metadata, handle commits, manage concurrent operations, and implement row-level changes.

Architecture determines everything in a data lakehouse: how fast queries plan (metadata structure), how efficiently data is pruned (statistics and partitioning), how safe concurrent operations are (commit protocols), and how much operational overhead your team absorbs (maintenance complexity). A format with superior architecture delivers faster queries, lower costs, and simpler operations, not just today, but as your data scales from gigabytes to petabytes.

Dremio's decision to build exclusively on Apache Iceberg's architecture reflects a deep analysis of these trade-offs. Iceberg's hierarchical metadata tree aligns naturally with Dremio's Apache Arrow-based vectorized execution engine and Columnar Cloud Cache (C3). The result is a query engine that can plan queries against tables with millions of files in under a second, prune 99%+ of data before any I/O occurs, and serve sub-second dashboard queries through live Reflections, all running on commodity cloud object storage.

Understanding these architectural differences helps you make an informed choice for your data platform and explains why Dremio built its lakehouse engine exclusively on Apache Iceberg.

Metadata Architecture: Three Approaches

Try Dremio’s Interactive Demo

Explore this interactive demo and see how Dremio's Intelligent Lakehouse enables Agentic AI

Iceberg: Hierarchical Metadata Tree

Iceberg organizes metadata in a four-layer tree structure: For official documentation, refer to the Iceberg format versioning spec.

Catalog → Metadata File → Manifest List → Manifest → Data Files

Each layer provides a pruning opportunity:

  1. Catalog: Resolves the table to its current metadata file
  2. Manifest list: Contains partition summaries that prune entire manifests
  3. Manifest: Contains column statistics (min/max/null count) that prune individual files
  4. Data files: Only surviving files are scanned

This hierarchical structure is why Iceberg excels at cloud object storage: the engine never lists directories. Every file path is recorded in a manifest, and three levels of pruning eliminate most data from being read.

Delta Lake: Transaction Log

Delta Lake uses a flat transaction log stored in the _delta_log/ directory:

_delta_log/
├── 00000000000000000000.json    ← First commit
├── 00000000000000000001.json    ← Second commit
├── ...
├── 00000000000000000009.json    ← Ninth commit
└── 00000000000000000010.checkpoint.parquet  ← Checkpoint (every 10 commits)

Each JSON file is an action list (AddFile, RemoveFile, Metadata, Protocol). To reconstruct the current table state, an engine must read the latest checkpoint plus all subsequent JSON files. Column statistics for data skipping are stored inline in AddFile actions.

Hudi: Timeline Architecture

Hudi organizes metadata around a timeline of actions stored in .hoodie/:

.hoodie/
├── hoodie.properties
├── 20240617120000000.commit         ← Commit metadata
├── 20240617120000000.commit.requested
├── 20240617120000000.commit.inflight
└── metadata/                        ← Table metadata (file index, column stats)

Hudi's timeline tracks every action (commit, compaction, clustering, cleaning) with timestamps. The timeline is append-only and provides a complete audit trail of every operation. Hudi uses a metadata table (stored as its own Hudi table) for file listings and column statistics.

Commit Protocols: How Each Format Handles Concurrency

Iceberg: Optimistic Concurrency Control (OCC)

Iceberg uses OCC with atomic compare-and-swap on the catalog metadata pointer:

  1. Read the current metadata file
  2. Compute changes (new snapshots, manifests)
  3. Write new metadata file
  4. Atomically swap the catalog pointer (compare-and-swap)
  5. If swap fails → conflict detected → retry from step 1

This protocol works correctly on any storage system with an atomic pointer swap, implemented via database transactions in JDBC catalogs, conditional PutObject in REST catalogs, or atomic rename in HDFS.

Delta Lake: Write-Ahead Log

Delta Lake uses a write-ahead log protocol:

  1. Read the latest version from _delta_log/
  2. Write changes to _delta_log/{next_version}.json
  3. Commit by creating the next sequential JSON file

Concurrency control relies on the underlying file system's ability to prevent two writers from creating the same file. On S3, this requires a lock service (DynamoDB). On HDFS, atomic rename provides natural conflict resolution.

Hudi: Timeline-Based Concurrency

Hudi's timeline tracks action states (REQUESTED → INFLIGHT → COMPLETED):

  1. Create .inflight action on the timeline
  2. Perform the operation
  3. Transition to .commit on success

This state machine approach supports multiple concurrent writers with table-level or file-group-level locking.

Row-Level Operations: Architectural Differences

How each format implements UPDATE, DELETE, and MERGE operations reveals their architectural priorities:

Iceberg

Iceberg offers three mechanisms, controlled by table properties:

  • Copy-on-Write (V2): Rewrites entire data files with modifications applied. Best for batch workloads.
  • Merge-on-Read (V2): Writes separate delete/update files. Merged at read time during query execution.
  • Deletion Vectors (V3): Compact bitmaps marking deleted row positions within data files. The most efficient approach for sparse deletes.

See COW vs MOR: Row-Level Changes on the Lakehouse for a detailed comparison.

Delta Lake

Delta Lake supports COW (classic) and Deletion Vectors (recent addition). Deletion Vectors in Delta use a bitmap format similar to Iceberg V3. The original Delta approach was COW-only, with the engine rewriting affected files for every delete/update operation.

Hudi

Hudi was designed for record-level operations from the start:

  • COW tables: Rewrite entire file groups on updates
  • MOR tables: Write delta logs that are merged with base files during compaction or query time

Hudi's record-level index makes point lookups efficient, it can identify which file group contains a specific record key without scanning metadata.

Performance on Cloud Storage

The architectural differences have significant performance implications on cloud object storage:

OperationIcebergDelta LakeHudi
Query planningFast (manifest-based, no listing)Fast (log-based, no listing)Moderate (timeline + metadata table)
File pruningThree-level (partition → manifest → file)Two-level (partition → file stats)Two-level (partition → file stats)
Partition evolutionMetadata-onlyRequires Liquid ClusteringRequires rewrite
Small file handlingExplicit compactionAuto-OptimizeAutomatic compaction
S3 API costsLowest (fewest GET/LIST calls)LowModerate

Why Dremio Builds on Iceberg Architecture

Dremio's query engine architecture aligns naturally with Iceberg's hierarchical metadata:

  1. C3 cache: Caches Iceberg manifest data and data file column chunks on NVMe SSDs
  2. Apache Arrow execution: Vectorized processing of Parquet column chunks identified by manifest statistics
  3. Reflections: Stored as Iceberg tables, enabling live and incremental refresh
  4. OPTIMIZE TABLE: Single command for data compaction, manifest compaction, and sort optimization
  5. VACUUM TABLE: Combined snapshot expiry and orphan file cleanup

This tight integration between Dremio's engine and Iceberg's metadata architecture enables sub-second query performance on petabyte-scale datasets stored on cloud object storage, a combination that isn't achievable with Hive, Delta Lake, or Hudi at the same cost point.

The Convergence Trend

All three formats are converging toward similar feature sets:

  • Iceberg added Deletion Vectors (like Delta Lake)
  • Hudi added Multi-Table Transactions (like Iceberg's approach)
  • Delta Lake added Uniform format (Iceberg compatibility layer)

Despite convergence, the fundamental architectural differences remain. Iceberg's specification-first, engine-independent approach provides the strongest foundation for multi-vendor, multi-engine data lakehouses.

Metadata Overhead Analysis

Understanding each format's metadata footprint helps predict operational overhead at scale:

FactorIcebergDelta LakeHudi
Metadata per commit~1 KB (snapshot ref)~1 KB (JSON action)~2 KB (timeline instant)
Column stats storageIn manifests (Avro)In log (JSON/Parquet)In metadata table
Metadata compactionOPTIMIZE TABLE merges manifestsCheckpoint every 10 commitsMetadata table compaction
Table-level statsPuffin files (NDV, sketches)Not availableNot available
Metadata read cost (1M files)1 manifest list + ~100 manifests1 checkpoint + recent JSONsMetadata table scan

Iceberg's Puffin statistics are a unique architectural advantage, no other format provides table-level statistical aggregates like NDV (number of distinct values) and Theta sketches that enable cost-based join optimization.

Catalog Architecture Comparison

The catalog layer differs significantly between formats:

Iceberg Catalogs

Iceberg supports multiple catalog implementations:

  • Apache Polaris: Open-source, REST-based catalog with fine-grained access control
  • NessieGit-like branching and tagging for data versioning
  • AWS Glue: Managed catalog on AWS
  • Hive Metastore: Backward-compatible catalog
  • REST Catalog: Standard HTTP API for any catalog backend

Delta Lake Catalogs

  • Unity Catalog: Databricks' catalog (open-sourced as OSS Unity)
  • Hive Metastore: Basic catalog support
  • Glue: AWS integration

Hudi Catalogs

  • Hive Metastore: Primary catalog
  • Glue: AWS integration
  • No standardized REST catalog protocol

Iceberg's catalog diversity is a significant architectural advantage. Organizations can choose catalogs based on their infrastructure (Nessie for Git-like workflows, Polaris for multi-engine governance, Glue for AWS-native) without changing table format.

Frequently Asked Questions

Can I read Delta Lake tables from Dremio?

Yes. Dremio provides native Delta Lake support for tables written by V2 writers, allowing you to query Delta tables alongside Iceberg tables (for tables written with newer Delta writers, enable Uniform). However, Dremio's optimization features (Reflections, OPTIMIZE TABLEVACUUM TABLE) are only available for Iceberg tables.

Is there a performance difference between Iceberg V2 and V3?

V3's deletion vectors are more efficient than V2's separate delete files for sparse deletes. For tables with frequent row-level changes, V3 can reduce read-time merge overhead by 30-50%.

Which format has the best GDPR compliance support?

All three formats support DELETE operations for GDPR compliance. However, Iceberg's snapshot expiry + compaction pipeline provides the most straightforward path to verifiable physical deletion, a key requirement for GDPR Article 17 compliance audits.

How do partitioning approaches differ architecturally?

See Table Format Partitioning Comparison for a detailed analysis. The key architectural difference: Iceberg stores partition transforms in manifest metadata, making hidden partitioning and partition evolution possible without data rewrites.

Are the table format differences narrowing over time?

Yes, all three formats are converging on similar capabilities such as row-level deletes, partition evolution, and Z-ordering. However, Iceberg maintains the strongest position in multi-engine adoption and governance. Databricks created UniForm to produce Iceberg-compatible metadata from Delta tables, effectively acknowledging Iceberg as the interoperability standard. The governance differences remain significant: Iceberg's Apache Software Foundation stewardship provides vendor-neutral guarantees that no single-vendor-controlled format can match. For organizations evaluating table formats today, Iceberg offers the broadest ecosystem compatibility with Dremio, Spark, Trino, Flink, Snowflake, DuckDB, and StarRocks all providing native support.


Free Resources to Continue Your Iceberg Journey

Iceberg Lakehouse Books from Dremio Authors


Legacy Content

In the age of data-centric applications, storing, accessing, and managing data can significantly influence an organization's ability to derive value from a data lakehouse. At the heart of this conversation are data lakehouse table formats, which are metadata layers that allow tools to interact with data lake storage like a traditional database. But why do these formats matter? The answer lies in performance, efficiency, and ease of data operations. This blog post will help make the architecture of Apache Iceberg, Delta Lake, and Apache Hudi more accessible to better understand the high-level differences in their respective approaches to providing the lakehouse metadata layer.

The metadata layer these formats provide contains the details that query engines can use to plan efficient data operations. You can think of this metadata as serving a similar role to the Dewey Decimal System in a library. The Dewey Decimal System serves as an abstraction to help readers more quickly find the books they want to read without having to walk up and down the entire library. In the same manner, a table format’s metadata makes it possible for query engines to not have to scan every data file in a dataset.

If you haven’t already read our content that compares these formats, here are some links to help get you acquainted:

Apache Iceberg

Originating out of Netflix and then becoming a community run Apache project, Apache Iceberg provides a data lakehouse metadata layer that many vendors support. It works off a three-tier metadata layer.

Let’s examine the three tiers.

Metadata Files

At the heart of Apache Iceberg’s metadata is the metadata.json, which defines all the table-level information such as the tables schema, partitioning scheme, current snapshot along with a historical list of schemas, partition schemes, and snapshots.

{
    "format-version": 2,
    "table-uuid": "f47ac10b-58cc-4372-a567-0e02b2c3d479",
    "location": "/data/animals/",
    "last-sequence-number": 1002,
    "last-updated-ms": 1677619248000,
    "last-column-id": 3,
    "schemas": [
        {
            "schema-id": 1,
            "fields": [
                {
                    "id": 1,
                    "name": "id",
                    "type": "int"
                },
                {
                    "id": 2,
                    "name": "category",
                    "type": "string"
                },
                {
                    "id": 3,
                    "name": "name",
                    "type": "string"
                }
            ]
        }
    ],
    "current-schema-id": 1,
    "partition-specs": [
        {
            "spec-id": 1,
            "fields": [
                {
                    "source-id": 2,
                    "transform": "identity",
                    "field-id": 1001
                }
            ]
        }
    ],
    "default-spec-id": 1,
    "last-partition-id": 1001,
    "sort-orders": [
        {
            "order-id": 1,
            "fields": [
                {
                    "source-id": 1,
                    "direction": "asc",
                    "null-order": "nulls-first"
                }
            ]
        }
    ],
    "default-sort-order-id": 1
}

Manifest Lists

Each snapshot listed in the metadata.json points to a “Manifest List” which lists all file manifests that make up that particular snapshot along with manifest-level stats that can be used to filter manifests based on things like partition values.

{
  "manifests": [
    {
      "manifest_path": "s3://bucket/path/to/manifest1.avro",
      "manifest_length": 1024,
      "partition_spec_id": 1,
      "content": 0,
      "sequence_number": 1000,
      "min_sequence_number": 999,
      "added_snapshot_id": 12345,
      "added_files_count": 10,
      "existing_files_count": 8,
      "deleted_files_count": 2,
      "added_rows_count": 100,
      "existing_rows_count": 80,
      "deleted_rows_count": 20,
      "partitions": [
        {
          "contains_null": false,
          "lower_bound": "encoded_value",
          "upper_bound": "encoded_value"
        },
        {
          "contains_null": true,
          "contains_nan": false,
          "lower_bound": "another_encoded_value",
          "upper_bound": "another_encoded_value"
        }
      ]
    },
    {
      "manifest_path": "s3://bucket/path/to/manifest2.avro",
      "manifest_length": 2048,
      "partition_spec_id": 2,
      "content": 1,
      "sequence_number": 1001,
      "min_sequence_number": 1000,
      "added_snapshot_id": 12346,
      "added_files_count": 5,
      "existing_files_count": 7,
      "deleted_files_count": 3,
      "added_rows_count": 50,
      "existing_rows_count": 70,
      "deleted_rows_count": 30,
      "partitions": [
        {
          "contains_null": false,
          "lower_bound": "yet_another_encoded_value",
          "upper_bound": "yet_another_encoded_value"
        }
      ]
    }
  ]
}

Manifests

Each file manifest that is listed on the snapshot “Manifest List” has a listing of the individual files that contain data along with stats on each file to help skip data files based on each file's column stats.

{
  "metadata": {
    "schema": "{"type":"record","name":"table_schema","fields":[{"name":"id","type":"long"},{"name":"name","type":"string"},{"name":"age","type":"int"}]}",
    "schema-id": "1",
    "partition-spec": "{"fields":[{"name":"age","transform":"identity","fieldId":1,"sourceId":1}]}",
    "partition-spec-id": "1",
    "format-version": "2",
    "content": "data"
  },
  "entries": [
    {
      "status": 1,
      "snapshot_id": 101,
      "data_file": {
        "content": 0,
        "file_path": "s3://mybucket/data/datafile1.parquet",
        "file_format": "parquet",
        "partition": {
          "age": 30
        },
        "record_count": 5000,
        "file_size_in_bytes": 1000000,
        "column_sizes": {
          "1": 200000,
          "2": 300000,
          "3": 500000
        },
        "value_counts": {
          "1": 5000,
          "2": 4500,
          "3": 4000
        },
        "null_value_counts": {
          "1": 0,
          "2": 500,
          "3": 1000
        }
      }
    },
    {
      "status": 1,
      "snapshot_id": 102,
      "data_file": {
        "content": 0,
        "file_path": "s3://mybucket/data/datafile2.parquet",
        "file_format": "parquet",
        "partition": {
          "age": 40
        },
        "record_count": 6000,
        "file_size_in_bytes": 1200000,
        "column_sizes": {
          "1": 240000,
          "2": 360000,
          "3": 600000
        },
        "value_counts": {
          "1": 6000,
          "2": 5400,
          "3": 4800
        },
        "null_value_counts": {
          "1": 0,
          "2": 600,
          "3": 1200
        }
      }
    }
  ]
}

Delta Lake

Developed by Databricks, Delta Lake provides a data lakehouse metadata layer that benefits from many features primarily available on the Databricks platform, along with those built into its own specification. Two types of files handle most of the work with Delta Lake: log files and checkpoint files.

Delta Logs

Delta Logs are very similar to Git commits mechanically. In Git, each commit captures the lines of code that are added and removed since the last commit, whereas Delta Logs capture files added and removed from the table since the last commit.

For instance, a diary page (00000000000000000001.json) might read:

{
  "protocol": {
    "minReaderVersion": 1,
    "minWriterVersion": 2
  },
  "commitInfo": {
    "timestamp": 1629292910020,
    "operation": "WRITE",
    "operationParameters": {
      "mode": "Overwrite",
      "partitionBy": "['date']"
    },
    "isBlindAppend": false
  },
  "add": {
    "path": "data/partition=date/parquetfile.parquet",
    "partitionValues": {
      "date": "2023-08-18"
    },
    "modificationTime": 1629292910020,
    "size": 84123
  }
}

Checkpoint Files

An engine can re-create the state of a table by going through each Delta Log file and constructing the list of files in the table. However, after many commits, this process can begin to introduce some latency. To deal with this, there are checkpoint files that summarize a group of log files so each individual log file doesn’t have to be read to construct the list of files in the dataset.

| path                                        | partitionValues | modificationTime   | size    |
|---------------------------------------------|-----------------|--------------------|---------|
| data/partition=date1/parquetfile1.parquet   | {date: date1}   | 1629291000010      | 84123   |
| data/partition=date2/parquetfile2.parquet   | {date: date2}   | 1629291500020      | 76234   |

Apache Hudi

Apache Hudi is another table format that originated at Uber. Hudi’s approach revolves around capturing the timestamp and type of different operations and creating a timeline.

Directory Structure

Each Hudi table has several directories it uses to organize the metadata it uses to track the table.

  • /voter_data/: This is the folder of the table that will house partition folders with data files and the.hoodie folder that houses all the metadata.
    • /.hoodie/: This is the folder that holds all the table metadata tracking table properties and file metadata.
      • /hoodie.properties: List of properties on how the table is structured.
      • /metadata/: This is where the metadata is saved, including transactions, bloom filters, and more.

Hudi Metadata

The metadata folder in Hudi contains the metadata table which holds several indexes for improving transaction performance via data skipping and other optimizations.

  • Files Index: Stores file details like name, size, and status.
  • Column Stats Index: Contains statistics of specific columns, aiding in data skipping and speeding up queries.
  • Bloom Filter Index: Houses bloom filters of data files for more efficient data lookups.
  • Record Index: Maps record keys to locations for rapid retrieval, introduced in Hudi 0.14.0.

Hudi uses HFile, a format related to HBase, to maintain records of this metadata.

Base and Log Files: The Core Content

In Hudi’s world, there are two main types of data files:

  • Base Files: These are the original data files written in Parquet or ORC.
  • Log Files: These are files that track changes to the data in the base file to be reconciled on read.

Naming Conventions

The way Hudi names these files is:

  • Base Files: It’s like [ID]_[Writing Version]_[Creation Time].[Type]
  • Log Notes: More detailed as [Base File ID]_[Base Creation Time].[Note Type].[Note Version]_[Writing Version]

The Timeline Mechanics

Hudi loves order. Every action or change made to a table is recorded in a timeline, allowing you to see the entire history. This timeline also ensures that multiple changes don’t clash.

Actions on this timeline go through stages like:

  • Planning (requested)
  • Doing (inflight)
  • Done (commit)

These steps are captured in the hoodie folder through files with the naming convention of [timestamp].[transaction state(requested/inflight/commit)], and this is how Hudi is able to identify and reconcile concurrent transactions. If two transactions come in at the same timestamp, one of the two transactions will see the pending transaction and adjust accordingly.

Hudi Table Fields

Each table comes with some additional fields that assists in Hudi’s data lookup:

  • User Labels: These are the fields of the table the user specified when they created the table or updated the schema.

Hudi’s Own Labels: Fields created by Hudi to optimize operations which include _hoodie_commit_time, _hoodie_commit_seqno, _hoodie_record_key, _hoodie_partition_path, _hoodie_file_name.

Conclusion

Apache IcebergDelta LakeApache Hudi
Metadata Approach3-Tier Metadata TreeLogs & CheckpointsTimeline
Data SkippingBased on Manifest and File StatsFile Stats from Log FilesBased on Column Stats Indexes in Metadata Table
File TypesMetadata.json, Manifest Lists, ManifestsLog Files and CheckpointsLog Files, Metadata Table, Requested - Inflight - Commit Files

Each format takes a very different approach to maintain metadata for enabling ACID transactions, time travel, and schema evolution in the data lakehouse. Hopefully, this helps you better understand the internal structures of these data lakehouse table formats.

Try Dremio Cloud free for 30 days

Deploy agentic analytics directly on Apache Iceberg data with no pipelines and no added overhead.