Most data teams think about lineage at the table or column level. Which pipeline wrote to this table? Which upstream source feeds this column? Those are useful questions, but they stop short of what actually matters in an audit or incident investigation: which specific rows were affected, by which operation, and when.
Apache Iceberg v3 answers that question natively, and Dremio supports it fully. It's called Row Lineage, and it's built into every Iceberg v3 table automatically.
How Row Lineage Works: Two Metadata Columns That Do a Lot
Row lineage adds two metadata columns to every table. You aren't required to define, configure, or even think about them during normal use. Dremio automatically writes them on every insert, update, and merge for v3 Iceberg tables.
The first column is _row_id. When a row is inserted into an Iceberg v3 table, it receives a unique identifier that stays with it for the rest of its life. If the row is updated ten times, its _row_id never changes. This is the consistent identifier that lets you track a row across its entire history of modifications.
The second column is _last_updated_sequence_number. This records the Iceberg snapshot sequence number of the operation that last touched the row. Combined with Iceberg's snapshot metadata, you can map this number back to a specific operation: which job ran, at what timestamp, and what the table state looked like before and after. Insert a row and the sequence number is set. Update or merge it and the number advances to reflect the new operation.
Together, the two columns tell you both the rows identity and what last happened to it. Neither piece of information is redundant. You need both to answer the questions that actually come up in practice.
NOTE: OPTIMIZE TABLE preserves both values when rewriting data files during compaction. Without that guarantee, a maintenance job could silently overwrite lineage metadata, making it unreliable for audit purposes.
Row lineage metadata is stored in the Parquet files themselves, so it's readable by any engine that supports the Iceberg v3 spec, including Apache Spark 4.1 and later. A multi-engine environment where Spark handles ingestion and Dremio handles analytics can query row lineage from either side without any coordination.
Try Dremio’s Interactive Demo
Explore this interactive demo and see how Dremio's Intelligent Lakehouse enables Agentic AI
Why This Matters More Than Table Lineage
Table-level lineage tells you that pipeline A feeds table B. That's useful for impact analysis and documentation, but it breaks down the moment someone asks a harder question.
Consider a financial services firm running daily reconciliation. A discrepancy shows up in the settlement table. The investigation needs to know which specific rows changed, in which batch, and whether those rows had been modified before that batch ran. Table-level lineage can't answer that, but row lineage can. Query _last_updated_sequence_number for the affected rows, map the sequence numbers back to snapshots, and you have an exact record of which operation wrote each value.
Or consider a GDPR right-to-erasure request. The compliance team needs to confirm that all rows associated with a specific user have been deleted and haven't been reintroduced by a downstream merge or reprocessing job. With _row_id tracking, you can verify that the identifiers in question no longer exist in any active snapshot.
Data quality debugging is another common case. When a merge job produces unexpected duplicates or incorrect values in a large table, the normal investigation approach is sampling and guessing. With row lineage, you can query which rows were last touched by the suspect operation directly and scope the investigation to exactly those rows.
Getting Started With Row Lineage on Dremio
Row lineage is on by default for every Apache Iceberg v3 table in Dremio Cloud. Creating a v3 table is the only prerequisite. From that point, _row_id and _last_updated_sequence_number are available as queryable columns on every row in the table.
If you want to explore this against your own data, a free Dremio Cloud environment at dremio.com/get-started includes full Iceberg v3 support from day one. Create a v3 table, run a few inserts and updates, and query the lineage columns directly to see how the sequence numbers advance with each operation.
Try Dremio Cloud free for 30 days
Deploy agentic analytics directly on Apache Iceberg data with no pipelines and no added overhead.
Ingesting Data Into Apache Iceberg Tables with Dremio: A Unified Path to Iceberg
By unifying data from diverse sources, simplifying data operations, and providing powerful tools for data management, Dremio stands out as a comprehensive solution for modern data needs. Whether you are a data engineer, business analyst, or data scientist, harnessing the combined power of Dremio and Apache Iceberg will undoubtedly be a valuable asset in your data management toolkit.
Sep 22, 2023·Dremio Blog: Open Data Insights
Intro to Dremio, Nessie, and Apache Iceberg on Your Laptop
We're always looking for ways to better handle and save money on our data. That's why the "data lakehouse" is becoming so popular. It offers a mix of the flexibility of data lakes and the ease of use and performance of data warehouses. The goal? Make data handling easier and cheaper. So, how do we […]
Oct 12, 2023·Product Insights from the Dremio Blog
Table-Driven Access Policies Using Subqueries
This blog helps you learn about table-driven access policies in Dremio Cloud and Dremio Software v24.1+.