Dremio Blog

6 minute read · April 20, 2026

Iceberg Row Lineage: Giving Every Row a Paper Trail

Will Martin Will Martin Technical Evangelist
Start For Free
Iceberg Row Lineage: Giving Every Row a Paper Trail
Copied to clipboard

Most data teams think about lineage at the table or column level. Which pipeline wrote to this table? Which upstream source feeds this column? Those are useful questions, but they stop short of what actually matters in an audit or incident investigation: which specific rows were affected, by which operation, and when.

Apache Iceberg v3 answers that question natively, and Dremio supports it fully. It's called Row Lineage, and it's built into every Iceberg v3 table automatically.

How Row Lineage Works: Two Metadata Columns That Do a Lot

Row lineage adds two metadata columns to every table. You aren't required to define, configure, or even think about them during normal use. Dremio automatically writes them on every insert, update, and merge for v3 Iceberg tables.

The first column is _row_id. When a row is inserted into an Iceberg v3 table, it receives a unique identifier that stays with it for the rest of its life. If the row is updated ten times, its _row_id never changes. This is the consistent identifier that lets you track a row across its entire history of modifications.

The second column is _last_updated_sequence_number. This records the Iceberg snapshot sequence number of the operation that last touched the row. Combined with Iceberg's snapshot metadata, you can map this number back to a specific operation: which job ran, at what timestamp, and what the table state looked like before and after. Insert a row and the sequence number is set. Update or merge it and the number advances to reflect the new operation.

Together, the two columns tell you both the rows identity and what last happened to it. Neither piece of information is redundant. You need both to answer the questions that actually come up in practice.

  • NOTE: OPTIMIZE TABLE preserves both values when rewriting data files during compaction. Without that guarantee, a maintenance job could silently overwrite lineage metadata, making it unreliable for audit purposes.

Row lineage metadata is stored in the Parquet files themselves, so it's readable by any engine that supports the Iceberg v3 spec, including Apache Spark 4.1 and later. A multi-engine environment where Spark handles ingestion and Dremio handles analytics can query row lineage from either side without any coordination.

Try Dremio’s Interactive Demo

Explore this interactive demo and see how Dremio's Intelligent Lakehouse enables Agentic AI

Why This Matters More Than Table Lineage

Table-level lineage tells you that pipeline A feeds table B. That's useful for impact analysis and documentation, but it breaks down the moment someone asks a harder question.

Consider a financial services firm running daily reconciliation. A discrepancy shows up in the settlement table. The investigation needs to know which specific rows changed, in which batch, and whether those rows had been modified before that batch ran. Table-level lineage can't answer that, but row lineage can. Query _last_updated_sequence_number for the affected rows, map the sequence numbers back to snapshots, and you have an exact record of which operation wrote each value.

Or consider a GDPR right-to-erasure request. The compliance team needs to confirm that all rows associated with a specific user have been deleted and haven't been reintroduced by a downstream merge or reprocessing job. With _row_id tracking, you can verify that the identifiers in question no longer exist in any active snapshot.

Data quality debugging is another common case. When a merge job produces unexpected duplicates or incorrect values in a large table, the normal investigation approach is sampling and guessing. With row lineage, you can query which rows were last touched by the suspect operation directly and scope the investigation to exactly those rows.

Getting Started With Row Lineage on Dremio

Row lineage is on by default for every Apache Iceberg v3 table in Dremio Cloud. Creating a v3 table is the only prerequisite. From that point, _row_id and _last_updated_sequence_number are available as queryable columns on every row in the table.

If you want to explore this against your own data, a free Dremio Cloud environment at dremio.com/get-started includes full Iceberg v3 support from day one. Create a v3 table, run a few inserts and updates, and query the lineage columns directly to see how the sequence numbers advance with each operation.

Try Dremio Cloud free for 30 days

Deploy agentic analytics directly on Apache Iceberg data with no pipelines and no added overhead.