Dremio Blog

45 minute read · September 1, 2022

The Life of a Write Query for Apache Iceberg Tables

Alex Merced Alex Merced Head of DevRel, Dremio
Start For Free
The Life of a Write Query for Apache Iceberg Tables
Copied to clipboard

This article has been revised and updated from its original version published in 2022 to reflect the latest Apache Iceberg developments, including V3 deletion vectors and optimistic concurrency improvements.

When you execute INSERT INTO orders VALUES (...) or MERGE INTO orders USING staged_orders ON ..., Apache Iceberg orchestrates a careful sequence of operations to maintain ACID guarantees on object storage. This guide walks through every step of the write path, from data file creation to atomic commit, and explains how Iceberg handles concurrent writes safely.

Understanding the write path is critical for designing performant ingestion pipelines. It explains why compaction matters, how Copy-on-Write and Merge-on-Read differ, and what happens when two writers conflict.

Types of Write Operations

Iceberg supports four types of write operations, each with different characteristics: For official documentation, refer to the Iceberg specification.

OperationWhat It DoesFiles CreatedComplexity
INSERT / INSERT OVERWRITEAdds new data or replaces partitionsNew data files onlySimple
DELETERemoves rows matching a conditionDelete files or rewritten data filesMedium
UPDATEModifies rows in placeDelete files + new data files (or rewritten files)Medium
MERGE INTOUpsert: insert/update/delete in one operationMix of all file typesComplex

The first type appends new files only. The last three involve row-level modifications, which trigger one of two strategies.

Try Dremio’s Interactive Demo

Explore this interactive demo and see how Dremio's Intelligent Lakehouse enables Agentic AI

Step 1: Plan the Write

Before writing any files, the engine plans the operation:

  1. Parse the SQL and determine which partitions will be affected
  2. Read the current metadata to understand the schema, partition spec, sort order, and table properties
  3. Choose the write mode: Copy-on-Write or Merge-on-Read (set via table properties write.delete.modewrite.update.modewrite.merge.mode)
  4. For row-level operations: Identify which existing data files contain rows that match the condition

Write Mode Selection

The mode determines how row-level changes are applied:

Copy-on-Write (COW)

  • Reads affected data files entirely
  • Applies changes in memory
  • Writes completely new data files with the changes applied
  • Deletes the old files from the manifest
  • Best for: Read-heavy tables with infrequent updates

Merge-on-Read (MOR)

  • Does NOT rewrite existing data files
  • Writes small delete files (positional or equality) that mark rows as deleted
  • Writes new data files for inserted/updated rows
  • At read time, the engine merges data files with delete files
  • Best for: Write-heavy tables, streaming ingestion, frequent updates

V3 Deletion Vectors

  • Evolution of MOR: stores row deletion marks as bitmaps within the manifest, not separate files
  • Faster writes (no separate file) AND faster reads (bitmap lookup vs. file join)
  • Best of both worlds for most workloads

For an in-depth comparison, see Copy-on-Write vs. Merge-on-Read in Apache Iceberg.

Step 2: Write Data Files to Object Storage

The engine serializes rows into data files (typically Parquet) and uploads them to object storage.

Parquet File Construction

Each data file is written with:

  • Row groups of 64-128MB each
  • Column-level statistics (min/max, null count, distinct count) embedded in the Parquet footer
  • Encoding and compression (typically Snappy or ZSTD)
  • Page-level bloom filters (optional, for equality predicates)

File Size Targeting

The engine targets a configurable file size (default 512MB for most engines, configurable via write.target-file-size-bytes). This is important because:

  • Too-small files create metadata overhead and reduce read-path pruning effectiveness
  • Too-large files increase the blast radius of rewrites and slow down individual file reads

When small files accumulate from streaming ingestion, compaction consolidates them.

Sort Order

If the table has a defined sort order, the engine sorts data within each file by the specified columns. This creates tight min/max bounds that dramatically improve partition and file pruning. For multi-column filter patterns, Z-ordering interleaves sort dimensions.

Partitioned Writes

If the table is partitioned (e.g., month(order_date)), the engine evaluates the partition transform for each row and routes it to the appropriate partition. Each partition gets its own data file(s).

Step 3: Create New Manifest Files

After data files are written, the engine creates new manifest files (Avro format) that describe them:

Manifest Entry:
  status: ADDED
  file_path: s3://warehouse/db/orders/data/00042.parquet
  partition_data: {order_month: "2024-03"}
  record_count: 125000
  file_size_in_bytes: 268435456
  column_sizes: {1: 45000000, 2: 12000000...}
  value_counts: {1: 125000, 2: 125000...}
  null_value_counts: {1: 0, 2: 12...}
  lower_bounds: {1: "US", 2: "2024-03-01"...}
  upper_bounds: {1: "US", 2: "2024-03-31"...}

Each manifest file also records:

  • Snapshot ID: Which snapshot this manifest was created for
  • Sequence number: Ordering for conflict resolution
  • Partition summaries: Aggregated partition bounds across all entries

Manifest Reuse

Iceberg reuses manifests from previous snapshots for files that haven't changed. If a write adds 5 new files to a table with 10,000 existing files, only the new files get a new manifest. The other 10,000 files are referenced through existing manifests. This is why Iceberg commits are O(number of changed files), not O(total files).

Step 4: Create a New Manifest List (Snapshot)

The engine writes a new manifest list (also Avro) that references all manifests, both new ones and reused ones from the previous snapshot:

Manifest List:
  snapshot_id: 987654322
  parent_snapshot_id: 987654321
  manifests:
    - manifest-001.avro (EXISTING, from previous snapshot)
    - manifest-002.avro (EXISTING, from previous snapshot)
    - manifest-new.avro (ADDED, for this commit)

This manifest list IS the new snapshot. It's a complete description of the table at this point in time, every data file and delete file. This immutability is what makes time travel and rollback possible.

Step 5: Atomic Commit via Optimistic Concurrency

This is the critical step that provides ACID guarantees. The engine must atomically update the metadata pointer from the old snapshot to the new one.

The Commit Protocol

  1. Read the current metadata file version
  2. Build the new metadata file referencing the new snapshot
  3. Attempt an atomic swap:
    • Object storage: Atomic rename of the metadata file (v42.metadata.json → v43.metadata.json)
    • Catalog: Compare-and-swap API call (REST Catalog, Glue, Nessie)
  4. If successful: Commit is done. Readers immediately see the new snapshot.
  5. If failed (someone else committed first): Retry from step 1

Conflict Detection and Retry

When two writers commit concurrently, the second writer's atomic swap fails. Iceberg then:

  1. Reads the other writer's new metadata
  2. Checks for conflicts:
    • No conflict if the changes don't overlap (e.g., writing to different partitions)
    • Conflict if both writers modified the same files
  3. If no conflict, rebases the changes and retries the commit
  4. If conflict, fails the operation

This is optimistic concurrency, it assumes conflicts are rare and only checks at commit time. For independent partitions (e.g., two ETL jobs writing to different dates), both succeed without coordination.

Multi-Table Transactions with Nessie

Project Nessie extends this model to support atomic commits across multiple Iceberg tables. A single Git-like commit can create a new snapshot for several tables simultaneously, essential for maintaining referential integrity in star schemas.

Write Path by Operation Type

INSERT

The simplest path, writes new data files, new manifest, new manifest list, atomic commit. No reading of existing files required.

DELETE (Merge-on-Read)

  1. Identify data files that might contain matching rows (using manifest statistics)
  2. Read those files and find the matching row positions
  3. Write delete files recording the positions
  4. Commit with both existing data files and new delete files

DELETE (Copy-on-Write)

  1. Identify and read affected data files
  2. Filter out deleted rows
  3. Write new data files WITHOUT the deleted rows
  4. Commit: new manifest replaces old file entries with new file entries

MERGE INTO

The most complex operation, combines DELETE + INSERT + UPDATE in a single transaction:

  1. Join the source data (staged_orders) with the target (orders) on the match condition
  2. For matched rows: apply UPDATE or DELETE
  3. For unmatched rows: apply INSERT
  4. Write all resulting files (data + delete files)
  5. Single atomic commit for the entire operation

Performance Tuning the Write Path

Target File Size

Set write.target-file-size-bytes to 256MB-512MB for batch workloads. For streaming, smaller initial files are acceptable if you run compaction regularly.

Choose the Right Write Mode

WorkloadRecommended Mode
Nightly batch with dashboardsCopy-on-Write
Streaming ingestion with frequent queriesMerge-on-Read + scheduled compaction
GDPR deletion requestsMOR for immediate compliance, COW compaction for permanent erasure
CDC from databasesMOR with Deletion Vectors (V3)

Minimize Small Files

Each commit creates at least one new file. High-frequency micro-batch commits (every few seconds) can create thousands of small files per hour. Mitigate by:

  • Batching writes into larger intervals
  • Using Flink with checkpoint intervals of 1-5 minutes
  • Running automated compaction

Dremio Write Optimizations

Dremio's write path includes:

  • Automatic sort by partition + sort order
  • Automatic compaction for managed tables
  • Concurrent write support with cross-engine coordination through the catalog

To understand the complementary read path, see The Life of a Read Query for Apache Iceberg Tables.

How Iceberg Writes Compare to Traditional Data Lake Writes

Traditional data lake writes lack the guarantees that Iceberg provides:

AspectTraditional Data LakeApache Iceberg
AtomicityPartial writes visible during executionAtomic snapshot swap, all or nothing
Concurrent writersCan corrupt data or produce duplicatesOptimistic concurrency with automatic retry
Schema enforcementNone, mismatched schemas written silentlySchema validated at write time
RollbackManual file deletion (error-prone)Metadata-only rollback to any snapshot
Audit trailNo write historyFull snapshot log with timestamps and metadata
ConsistencyDirty reads of in-progress writes possibleSnapshot isolation guaranteed

The difference is most visible during failures. In a traditional data lake, a failed write mid-execution leaves partial files on storage, corrupting the table until someone manually cleans up. In Iceberg, the atomic commit never fires, so the table remains in its previous consistent state. No cleanup required.

Common Write Path Anti-Patterns

Anti-Pattern 1: Frequent Micro-Commits

Committing every few seconds creates thousands of small files and snapshots. Each commit has fixed overhead (manifest creation, atomic swap), so amortizing that cost over larger batches is always more efficient.

Fix: Batch writes into larger intervals. For Flink, set checkpoint intervals to 1-5 minutes. For Spark Structured Streaming, use trigger intervals of at least 30 seconds.

Anti-Pattern 2: Using COW for High-Update Tables

Copy-on-Write rewrites entire data files for every UPDATE or DELETE. On a table with frequent row-level changes (CDC ingestion, GDPR deletions), this means rewriting gigabytes of data for a few changed rows.

Fix: Switch to Merge-on-Read or V3 deletion vectors. Run periodic compaction to resolve accumulated delete files.

Anti-Pattern 3: Ignoring Sort Order

Writing unsorted data produces files with wide min/max column statistics. Downstream read queries can't prune files effectively, negating Iceberg's performance advantages.

Fix: Define a sort order on the table and ensure it's applied during writes or compaction.

Anti-Pattern 4: Skipping Partition Strategy

Writing to an unpartitioned table forces the engine to manage all data in a single partition group. This creates massive manifest files, prevents partition pruning, and makes every query a full table scan at the manifest level.

Fix: Add hidden partitions based on your most common query filter columns.

Failure Recovery and Idempotency

What Happens When a Write Fails?

  1. Engine crashes during data file writes: Partially written files become orphans on storage. The table is unaffected because no commit happened. Clean up with remove_orphan_files.
  2. Engine crashes after data files written, before commit: Same as above. Data files exist but no manifest references them. Table remains consistent.
  3. Commit fails due to conflict: The engine retries the commit with a rebased metadata file. If the conflict is unresolvable (same files modified by both writers), the operation fails and the user must retry.
  4. Object storage outage during commit: The atomic swap fails, and the table stays at the previous snapshot. Data files become orphans.

In all cases, the table never enters an inconsistent state. This is fundamentally different from traditional data lakes where any of these failures can corrupt the table.

Frequently Asked Questions

What happens when two writers try to commit at the same time?

Iceberg uses optimistic concurrency control. Both writers prepare their changes independently, then attempt to commit by updating the metadata pointer. The first writer succeeds. The second writer detects a conflict, re-reads the updated metadata, validates whether its changes are still compatible (no overlapping files modified), and retries if possible. If the changes conflict, the second writer raises an error.

Does every write create a new snapshot?

Yes. Every commit (insert, delete, update, compaction) produces a new snapshot with a unique snapshot ID. This is what enables time travel and rollback. Over time, snapshots accumulate and should be periodically expired to reclaim metadata space.

How does Iceberg ensure data consistency during writes?

Consistency comes from atomic metadata pointer swaps. The catalog maintains a single pointer to the current metadata file. A writer creates all new data files and metadata first, then atomically updates this pointer to reference the new metadata. If the write fails partway through, the pointer is never updated, so readers never see partial data. Orphaned files from failed writes are cleaned up later.


Free Resources to Continue Your Iceberg Journey

Iceberg Lakehouse Books from Dremio Authors


Legacy Content

Apache Iceberg is an open data lakehouse table format that provides your data lake with critical features like time travel, ACID transaction, partition evolution, schema evolution, and more. You can read this article or watch this video to learn more about the architecture of Apache Iceberg. In this article, we’d like to explain how write queries work for Apache Iceberg.

In order to understand how write queries work we first need to understand the metadata structure of Apache Iceberg, and then examine step by step how that structure is used by query engines to plan and execute the query.

Apache Iceberg 101

Apache Iceberg has a tiered metadata structure and it’s key to how Iceberg provides high-speed queries for both reads and writes. Let’s summarize the structure of Apache Iceberg to see how this works. If you are already familiar with Iceberg’s architecture then feel free to skip ahead to the section titled “How a Query Engine Processes the Query.”

Data Layer

Starting from the bottom of the diagram, the data layer holds the actual data in the table. It’s made up of two types of files: 

Data files – store the data in file formats such as Parquet or ORC. 

Delete files – track records that still exist in the data files, but that should be considered as deleted.

Metadata Layer

Apache Iceberg uses three tiers of metadata files which cover three different scopes.

Manifest files A subset of the snapshot. These files track the individual files in the data layer in that subset along with metadata for further pruning. 

Manifest lists Defines a snapshot of the table and lists all the manifest files that make up that snapshot with metadata on those manifest files for pruning.

Metadata files Defines the table, and tracks manifest lists, current and previous snapshots, schemas, and partition schemes.

The Catalog

Tracks a reference/pointer to the current metadata file. This is usually some store that can provide some transactional guarantees like a relational database (Postgres, etc.) or metastore (Hive, Project Nessie, Glue).

How a Query Engine Processes the Query

So let’s run an INSERT, DELETE, and MERGE INTO against a table called orders to see the behavior.

Running an Insert Query

INSERT INTO orders VALUES (...);

1. Sending the query to the engine

The query is submitted to the query engine that parses the query. The engine now needs the table data to begin planning the query.

2. Checking the catalog and planning

The engine first reaches out to the catalog to find the current metadata file’s path and reads that metadata file. Even though it’s just appending data to the table, it does this to determine two things:

  1. The table's schema to make sure the data it’s going to write fits that schema, as well as which fields can versus can’t have null values.
  2. The partitioning of the table to determine how to organize the data to be written. 

3. Writing the data files

Since no existing data files will be affected because it’s an INSERT, we can begin writing new data files. At this point, all new records will be written into files based on the partitioning scheme of the table. If the table has a configured sort order and the engine writing the data supports it, then the records will be sorted prior to writing the records to data files.

4. Writing the metadata files

Now that data files have been written, we will group these files into manifest files. For each manifest file, we’ll include the file paths to the data files it tracks and statistics about each data file, such as minimum and maximum values for each column in the data file. We’ll then write these manifest files out to the data lake.

Then, we will include all new or existing manifests associated with this snapshot in a new manifest list. In the manifest list, we’ll include the file paths to the manifest files and statistics about the write operation, such as how many data files it added, and information about the data referenced by the manifest files, such as the lower and upper bounds of their values for partition columns. We’ll then write this manifest list out to the data lake.

Finally, we will create a new metadata file to summarize the table as it stands based on the metadata file that was current before our INSERT, including the file path and details of the manifest list we just wrote as well as a reference to note that this manifest list is now the current manifest list/snapshot. 

Now we just need to commit this to the catalog.

5. Committing the changes

The engine goes to the catalog again to ensure that no new snapshots appeared while the files were being written. This check prevents concurrent writers from overwriting each other’s changes. Writes will always be safe when done concurrently, as the writer that completes first gets committed and any other writes that conflict with the first writer’s changes will go back to step 3 or 4 and reattempt until successful or quit retrying and decide to fail.

Readers will always plan based on the current metadata file, so they won’t be affected by concurrent writes that have not yet committed.

Running a Delete Query

DELETE FROM orders 
WHERE order_ts BETWEEN '2022-05-01 10:00:00' AND '2022-05-31 10:00:00';

1. Sending the query to the engine

The query is submitted to the query engine that parses the query, and the engine now needs the table data to begin planning the query.

2. Checking the catalog and planning

The engine first reaches out to the catalog to find the current metadata file. It does this to determine a few things:

  1. The partitioning of the table to determine how to organize the data to be written. 
  2. The current sequence ID of the table and what its transaction’s sequence ID will be. It is assumed that no other writes that conflict with this INSERT will be complete before this job is done since Iceberg leverages optimistic concurrency control (assume everything is good till it isn’t). 

3. Writing the files

How the files are written is determined by whether the table has the delete strategy set to either “copy-on-write” (COW) or “merge-on-read” (MOR).

If the table is set to copy-on-write, it will go through the following process: 

  1. Use the metadata to identify which files contain the data to be deleted.
  2. Read these files in their entirety to determine what is deleted.
  3. Write a new file to replace this file in the new snapshot, where this new file doesn’t have the data requested to be deleted.

If the table is set to merge-on-read, the process avoids rewriting data files to speed up the write, but the specific steps it will go through depends on the type of delete file the table is configured to use to track deleted records.

Position deletes

  1. It will scan the metadata to determine which data files have records that will be deleted.
  2. It will scan the data files to determine the positions of the deleted records.
  3. It will write a delete file for the partition that lists which records by position are deleted from which files.

Equality deletes

  1. It will not need to read any files, it will just skip to writing a delete file listing the values that specify which rows are to be deleted.
  2. During reads, the reader will have to reconcile the delete file against all records in the partition which may be expensive.

Which type of delete file will be written based on the table settings? The original data file will still be used in the snapshot, and the engine will reconcile the delete file in future reads of that snapshot. Read this article on COW vs. MOR to understand what situations fit each strategy best.

4. Writing the metadata files

Now that data files/delete files have been written, we will group these files into manifest files. For each manifest file, we’ll include the file paths to the data files it tracks and statistics about each data file, such as minimum and maximum values for each column in the data file. We’ll then write these manifest files out to the data lake.

Then, we will include all new or existing manifests associated with this snapshot into a new manifest list. In the manifest list, we’ll include the file paths to the manifest files and statistics about the write operation, such as how many data files it added, and about the data referenced by the manifest files, such as the lower and upper bounds of their values for partition columns. We’ll then write this manifest list out to the data lake.

Finally, we will create a new metadata file to summarize the table as it stands based on the metadata file that was current before our DELETE, including the file path and details of the manifest list we just wrote as well as a reference to note that this manifest list is now the current manifest list/snapshot. 

Now we just need to commit this to the catalog.

5. Committing the changes

The engine goes to the catalog again to ensure that no new snapshots appeared while the files were being written. This check prevents concurrent writers from overwriting each other’s changes. Writes will always be safe when done concurrently, as the writer that completes first gets committed and any other writes that conflict with the first writer’s changes will reattempt until successful or quit retrying and decide to fail.

Readers will always plan based on the current metadata file, so they won’t be affected by concurrent writes that have not yet committed.

Running an Upsert/Merge Query

MERGE INTO orders o
USING (SELECT * FROM orders_staging) s
ON o.id = s.id
WHEN MATCHED …
WHEN NOT MATCHED

1. Sending the query to the engine

The query is sent to the query engine that parses the query, so the engine now needs the table data for both the target and staging table to begin planning the query.

2. Checking the catalog and planning

The engine first reaches out to the catalog to find the current metadata file. It does this to determine a few things:

  1. The table's schema to make sure the data it’s going to write fits that schema, as well as which fields can versus can’t have null values.
  2. The partitioning of the table to determine how to organize the data to be written. 
  3. The current sequence ID of the table and what its transaction’s sequence ID will be. It is assumed that no other writes that conflict with this INSERT will be complete before this job is done since Iceberg leverages optimistic concurrency control (assume everything is good till it isn’t). 

3. Writing the files

At this point, the engine will scan and load the relevant data from both the target and stage tables into memory and begin the matching and updating process. It will go through each row in the staging table to see if there is a match in the target table (via the id field matching, since that’s what we specified). What will be tracked in memory as it identifies matches will be based on the configured writing mode for table merges, copy-on-write or merge-on-read.

If copy-on-write, any data files where at least one record was updated will be rewritten to include updates and new records in the same partition. This means as updated files are read, the entire file must be saved to memory for the eventual writing of a new file.

If merge-on-read, delete files will be generated so the older versions of records will be ignored on future reads, then new data files will be written with only the newly updated or inserted records. Only the records to be updated will be tracked in memory which can be a lot less memory intensive over the entire transaction.

If there is a match, it will carry out any instructions inside the WHEN MATCHED clause typically to update certain fields.

If no match is found, it will carry out any instructions inside the WHEN NOT MATCHED clause, typically to insert the record from the stage table into the target table.

4. Writing the metadata files

Now that data files/delete files have been written, we will group these files into manifest files. For each manifest file, we’ll include the file paths to the data files it tracks and statistics about each data file, such as minimum and maximum values for each column in the data file. We’ll then write these manifest files out to the data lake.

Then, we will include all new or existing manifests associated with this snapshot into a new manifest list. In the manifest list, we’ll include the file paths to the manifest files and statistics about the write operation, such as how many data files it added, and information about the data referenced by the manifest files, such as the lower and upper bounds of their values for partition columns. We’ll then write this manifest list out to the data lake.

Finally, we will create a new metadata file to summarize the table as it stands based on the metadata file that was current before our MERGE, including the file path and details of the manifest list we just wrote as well as a reference to note that this manifest list is now the current manifest list/snapshot. 

Now we just need to commit this to the catalog.

5. Committing the changes

The engine goes to the catalog again to ensure that no new snapshots appeared while the files were being written. This check prevents concurrent writers from overwriting each other’s changes. Writes will always be safe when done concurrently, as the writer that completes first gets committed and any other writes that conflict with the first writer’s changes will reattempt until successful or quit retrying and decide to fail.

Readers will always plan based on the current metadata file, so they won’t be affected by concurrent writes that have not yet committed.

Conclusion

Apache Iceberg using optimistic concurrency and snapshot isolation can ensure ACID guarantees so you can write data to a table without affecting concurrent readers or writers. If you are considering adopting Apache Iceberg into your data lakehouse, check out this tutorial on how to create Apache Iceberg tables using AWS Glue and Dremio.

Try Dremio Cloud free for 30 days

Deploy agentic analytics directly on Apache Iceberg data with no pipelines and no added overhead.