h2h2h2h2h2h2h2h2h2h2

19 minute read · October 15, 2022

Apache Iceberg Achieves Milestone 1.0 Release

Alex Merced

Alex Merced · Developer Advocate, Dremio

Apache Iceberg has been a production-ready open source project used to drive analytics at companies like Netflix, Adobe, Apple and more for a long time now. This 1.0 release ensures API stability guarantees and reinforces its status as a production-ready technology servicing data warehousing and data science use cases in the growing adoption of open lakehouses. 

While many Iceberg APIs have been stable for quite some time, this 1.0 release signals that the community has agreed to freeze these APIs so those who weren’t yet ready to build on them can feel comfortable doing so. If you haven’t adopted Apache Iceberg in your data lakehouse yet, now is a great time to do so.

While this milestone release brings API stability guarantees, it also includes new features to improve:

  • Performance and support for additional use cases
    • Spark support for merge-on-read updates and deletes
    • Z-order sort optimization
    • The Puffin format for stats and indexes
    • New interfaces for consuming data incrementally
    • Parquet row group Bloom filter support
    • Support for mergeSchema on writes
    • Vectorized reads for Parquet by default
  • Ease of use
    • Time-travel in Spark SQL
    • The registerTable procedure for catalog migration
    • New all_data_files and all_manifest_files metadata tables
  • Broader ecosystem integration A standard REST API catalog
    • Support for Apache Spark 3.3
    • Support for Apache Flink 1.15

Since the 1.0 release quickly followed the 0.14 release, this blog covers features added in both the recent 0.14 and 1.0 releases.

Performance and support for additional use cases

Merge-on-read for UPDATE and MERGE in Spark [#4047, #3984]

Merge-on-read (MOR) became available in Iceberg 0.13, and was supported for deletes only in Spark. 

MOR is a write mode that enables engines to write data more frequently by writing delete files to track deletions instead of rewriting full data files. This is beneficial when doing high-velocity streaming updates and deletes.

Before this update, the ability to use merge-on-read for updates and merges in Spark wasn’t quite there yet and trying to use it would result in this error:

Delta merges not currently supported

With this release, you can now enjoy merge-on-read on your deletes/updates/merge queries when using Spark. You can set your table settings to merge-on-read for faster delete/updates like so:

ALTER TABLE catalog.db.students SET TBLPROPERTIES (
    'write.delete.mode' = 'merge-on-read',
    'write.update.mode' = 'merge-on-read',
    'write.merge.mode'  = 'merge-on-read'
);
Z-order rewrite strategy [#5229]

Z-order is a method of sorting data by multiple fields where they are weighted equally instead of sorting by one field first and then another. Z-order sorting helps accelerate query planning for tables typically filtered by multiple dimensions and serves as the backbone for advanced indexing strategies. You can now use z-order sorting when running compaction jobs on Iceberg tables.

To run compaction in Iceberg you can use the rewrite_data_files procedure which will re-write your data files depending on the rewrite strategy and sort_order properties you pass. For example, if all you wanted to do was write all smaller data files into larger ones you could use a binpack strategy (the fastest strategy) with an SQL call procedure like so:

CALL catalog.system.rewrite_data_files(
   table => 'db.sample',
   strategy => 'binpack'
)

If you also wanted to sort based on a particular column/s that the table is often filtered by you can use the sort strategy, and it will sort the data while it rewrites the files to improve read times (note:this will take longer than a binpack rewrite):

CALL catalog.system.rewrite_data_files(
    table => 'db.sample',
    strategy => 'sort',
    sort_order => 'id DESC NULLS LAST, name ASC NULLS FIRST'
)

The above provides improved performance for queries that filter by id or id and name, but not queries that filter by just name. That’s where z-order comes in.

With this Iceberg release, you can now also sort using a z-order sort for even better read performance when queries filter by multiple fields independently:

CALL catalog.system.rewrite_data_files(
    table => 'db.sample',
    strategy => 'sort',
    sort_order => 'zorder(id, last_name)'
)

The above will optimize for queries that filter by id, id and name, and queries that filter by just name.

Puffin format for stats and indexes [#5129, #5127]

The Puffin format is a format for storing table stats and index information for Iceberg tables that can be leveraged by any engine to improve query performance. You can read the specification details here.

Since the format specification was just defined, it isn’t being used yet, but with this specification, the door is open to adding different types of indexing and tracking more table stats. This will have a profound effect on the ability to read less data and plan queries better against Iceberg tables in the future.

Range and tail reads for IO [#4608]

The updates to the file readers in the Iceberg library make reading Parquet files more efficient. Tail reads involve reading the final bytes in the file where the footer exists and range reads assist in reading portions of column data throughout the file. 

Certain engines such as Dremio have used their own Parquet readers to take advantage of these and other techniques to read Parquet files fast and efficiently. This feature will help other engines leverage some of these techniques without having to implement it themselves. 

Added new interfaces for consuming data incrementally (#4870, #4580)

Change data capture (CDC) is very important when ingesting data at high speeds. The new interfaces will allow for the implementation of methods that scan for changes between snapshots. This data can be used to update materialized views and indexes efficiently along with updating downstream tables and systems at minimum latency.

Use vectorized reads for Parquet by default (#4196)

Vectorized Parquet reads optimize the use of memory and speed up the reading of Parquet files by reading the data in batches which has been a feature in Iceberg for quite some time. With this update, this feature is on by default so you do not have to manually enable this feature for the performance boost if you use the Iceberg Parquet readers.

Ease of use

Time-travel in Spark SQL [#5156]

Time-travel is a key feature in Iceberg that allows you to run queries on any prior state of the table. With time-travel you can query data at previous points to test machine learning algorithm updates on consistent data, run quality checks, audits and more.

Before this release, if you wanted to run time-travel queries in Spark, you’d have to use Python or Scala to run code that looks like this:

// time travel to June 18th, 2022 at 13:34:03
spark.read
    .option("as-of-timestamp", "1658165643")
    .format("iceberg")
    .load("path/to/table")

With the latest releases you can now do this in Spark SQL:

-- time travel to June 18th, 2022 at 01:21:00
SELECT * FROM catalog.db.table
     TIMESTAMP AS OF '2022-06-18 13:34:03';

-- time travel to snapshot by id
SELECT * FROM catalog.db.table
     VERSION AS OF 10573874102873;
Register tables for all catalogs [#4946, #5037]

Catalog.registerTable is a method that exists in Iceberg Java API for Hive catalogs that allows you to register an existing Iceberg table that exists in another Iceberg catalog. You specify the metadata.json you want to register and it’s added to the Hive catalog. This allows you to migrate an Iceberg table to a Hive catalog without losing snapshot history, as creating a new table using the migrate procedure or a CTAS statement would. 

With the updates in Iceberg 0.14/1.0 there are now implementations for this method for migrating Iceberg tables to other catalogs such as Nessie, DynamoDB, AWS Glue and more. 

So if you have Iceberg tables registered with one catalog and want to migrate them with their snapshot history to another catalog, such as Nessie or AWS Glue, it’s as easy as calling:

// create Nessie Catalog object
var NessieCatalog = ... ;
// register table with catalog
NessieCatalog.registerTable(
    "db.my_table",
    "s3://.../v5.metadata.json"
);
New all_data_files and all_manifest_files metadata tables (#4243, #4693, #4336)

This release includes new metadata tables to query more information on your Iceberg table quickly. The all_data_files and all_manifests metadata tables provide data for the current snapshot as well as all snapshots.

Metadata tables can be used for a variety of use cases:

  • Tabulating the total size of all the table files across snapshots to determine whether it’s worth expiring snapshots to reduce storage use
  • Assess the need for compaction
  • Assess partitioning of tables

All data files: Metadata and info for all data files and delete files.

SELECT * FROM catalog1.db1.table1.all_data_files

All manifests: Metadata on all manifests

SELECT * FROM catalog1.db1.table1.all_manifests

Broader ecosystem integration

REST catalog [#5226, #5235, #5255]

Catalogs in Iceberg are used to atomically track the latest metadata file in a particular table. Many catalogs already support Iceberg, such as Hive Metastore, AWS Glue and Project Nessie. 

In the past, every time a new catalog was supported it required re-implementing many interfaces and supporting libraries. The new REST catalog creates a universal interface to simplify existing catalog solutions.

The REST catalog is an open API specification that you can see here. This allows anyone to implement their own catalog as a REST API. So as new catalog implementations are created, if you follow the API specification, you won’t need to create a new connector; any engine that supports the REST API catalog implementation can use the new catalog immediately.

Additional updates

You can find the full release notes here along with a list of pull requests related to release 0.14 and 1.0.

Conclusion

Apache Iceberg is a production-ready technology used in production at companies like Netflix, Apple, Adobe, LinkedIn, Stripe and many others. Apache Iceberg is a key component in building open lakehouses that allow you to run more of your workloads on the data lake at any scale and without vendor lock-in. 

With this milestone 1.0 release, Iceberg adds API stability guarantees as well as features that improve performance, ease of use and the broader ecosystem. 

There’s never been a better time to adopt Apache Iceberg to build out your data lakehouse.

Other Iceberg resources

Other resources to help you learn more about Apache Iceberg.

Ready to Get Started?

Bring your users closer to the data with organization-wide self-service analytics and lakehouse flexibility, scalability, and performance at a fraction of the cost. Run Dremio anywhere with self-managed software or Dremio Cloud.