For all the many improvements data lakehouses bring to analytics, there's one uncomfortable trade-off: deleting rows is expensive. In a system built around immutable Parquet files, a delete is actually a rewrite. You read the file, filter out the rows you don't want, and write a new file. At scale those I/O costs mount up fast.
Apache Iceberg v2 offered an improvement with merge-on-read semantics: instead of rewriting a data file immediately, you write a separate "position delete file" that records which rows to treat as deleted. Reads then apply those deletes on the fly. The problem is that as position delete files accumulate, reads get progressively slower. Each query has to join the delete files against the data files to figure out which rows are actually live. Again, at scale on a popular table that overhead adds up.
Iceberg v3 replaces position delete files with deletion vectors, and it's a meaningful step forward. Dremio supports deletion vectors fully on v3 tables. Here's what has changed and why it matters.
How Deletion Vectors Work
A deletion vector is a bitmap stored in a Puffin file alongside a data file, with a direct 1:1 mapping between them. Each row position in the data file has a corresponding bit in the bitmap. When a row is deleted, its bit is flipped. It's that simple.
During a read, Dremio applies the deletion vector as a bitmask over the data file. Rows flagged in the bitmap are excluded from the result. There's no join with a separate delete file, no path matching, and no cross-file lookups. The bitmap is more compact, the mapping is direct, and the read overhead is minimal compared to what v2 position delete files require. Our testing shows a read performance improvement of 50-80% with deletion vectors when compared to positional deletes.
When you run a DELETE, UPDATE, or MERGE on a v3 table, Dremio writes or updates the deletion vector for the affected data file rather than producing a new delete file or immediately rewriting the data. The operation completes quickly and the data file isn't touched.
However, just like position deletes, deletion vectors accumulate over time as more rows are marked deleted. Dremio's OPTIMIZE TABLE command handles this by rewriting data files to produce clean Parquet files that incorporate the deletions, removing the vectors in the process. After a compaction run, affected data files have no deletion overhead at all. Running OPTIMIZE on a regular schedule is best practice to keep read performance steady as your table evolves.
Try Dremio’s Interactive Demo
Explore this interactive demo and see how Dremio's Intelligent Lakehouse enables Agentic AI
The Upgrade Path From v2 Tables
If you're moving existing Iceberg v2 tables to v3, deletion vectors don't require any migration of your delete history. Dremio can still read v2 position delete files on upgraded tables. The first merge-on-read operation after the upgrade, whether that's a DELETE, UPDATE, or MERGE, converts existing position deletes into deletion vectors automatically. From that point, all new deletes use the v3 format.
Where Deletion Vectors Make a Real Difference
The most direct beneficiaries are workloads that delete or update rows frequently without wanting to pay full rewrite costs every time.
GDPR and right-to-erasure workflows are the obvious compliance case. When a data subject requests deletion, the operation needs to be fast and auditable. With deletion vectors, marking rows deleted is a bitmap write rather than a file rewrite. You can process erasure requests at high frequency without the I/O cost of rewriting data files on each request. Compaction runs can occur later, at a time that suits the workflows, and the physical data is cleaned up.
Change data capture pipelines that land CDC events as upserts are another strong fit. A CDC stream typically produces a mix of inserts, updates, and deletes as upstream records change. On a v2 table, frequent deletes and updates drive accumulation of position delete files that degrade read performance over time. On a v3 table with deletion vectors, the overhead is lower and more predictable, and compaction is the single lever that keeps it in check.
High-frequency MERGE operations, common in slowly changing dimension tables and deduplication pipelines, also see a meaningful improvement. Merge-on-read with v3 deletion vectors is faster than the equivalent on v2 position delete files, which means you can run merges more aggressively without degrading downstream query performance.
Getting Started
Deletion vectors are available on Iceberg v3 tables in Dremio Cloud today. Create a v3 table, run your normal DELETE, UPDATE, or MERGE operations, and the deletion vectors are handled automatically. Then schedule OPTIMIZE TABLE to compact periodically to keep read overhead from accumulating.
If you want to test this against your own workload, a free Dremio Cloud environment at dremio.com/get-started has full Iceberg v3 support from the start. I'd recommend running a before-and-after comparison on a delete-heavy table to see for yourself how the read overhead compares.
Try Dremio Cloud free for 30 days
Deploy agentic analytics directly on Apache Iceberg data with no pipelines and no added overhead.
Ingesting Data Into Apache Iceberg Tables with Dremio: A Unified Path to Iceberg
By unifying data from diverse sources, simplifying data operations, and providing powerful tools for data management, Dremio stands out as a comprehensive solution for modern data needs. Whether you are a data engineer, business analyst, or data scientist, harnessing the combined power of Dremio and Apache Iceberg will undoubtedly be a valuable asset in your data management toolkit.
Sep 22, 2023·Dremio Blog: Open Data Insights
Intro to Dremio, Nessie, and Apache Iceberg on Your Laptop
We're always looking for ways to better handle and save money on our data. That's why the "data lakehouse" is becoming so popular. It offers a mix of the flexibility of data lakes and the ease of use and performance of data warehouses. The goal? Make data handling easier and cheaper. So, how do we […]
Oct 12, 2023·Product Insights from the Dremio Blog
Table-Driven Access Policies Using Subqueries
This blog helps you learn about table-driven access policies in Dremio Cloud and Dremio Software v24.1+.