Let’s take a deeper look at many of the new features that come with Apache Iceberg 0.13.0. (Release Notes)
Catalog caching now supports cache expiration
Catalog caching is a technique that can help speed up table reads by allowing engines to not have to read the table’s metadata on every read.
Sometimes when multiple readers and writers are using Apache Iceberg tables from different engines the result would be that the cache would not refresh so a manual refresh was required to see the new data. With the new cache expiration feature a time interval can be set for the cache to expire, forcing an automatic refresh to resolve this issue. This can be configured using the setting cache.expiration-interval-ms which will be ignored if cache-enabledis set to false. Read more on this feature here.
Hadoop catalog can be used with S3 and other file systems safely by using a lock manager
When using Iceberg on object stores like S3 and committing to tables from multiple engines concurrently, you previously couldn’t use the Hadoop catalog safely. This is because the check Iceberg relied on to ensure writing via two concurrent separate jobs or engines isn’t atomic, so two concurrent commits could result in data loss.
Iceberg 0.13.0 fixes this issue by leveraging a lock table in services like DynamoDB so all catalogs can use locks for safe concurrency. Read more on this feature here.
Catalog now supports registration of Iceberg table from a given metadata file location
From a HiveCatalog you could drop a table but there was no way to add an existing table to the catalog. With this update you can now add existing tables to your catalog by passing the location of the newest metadata file of the external table. This is especially helpful for interacting with Hive external tables in Spark. Read more on this feature here.
Deletes now supported for ORC Files
In Iceberg’s v2 format, delete files are used to track records that have been deleted. Previously, this wasn’t supported when the table’s underlying file format was ORC. Now, both position and equality deletes are supported for tables backed by ORC files. Read more here.
Along with the core updates detailed previously several vendor integrations were added in version 0.13.0 including.
The table listing API call in the Hive catalog can now return non-Iceberg tables. [#3908]
Apache Iceberg in 2022 is adding the features data engineers want and need which is clear given the attention and momentum it has recently received in just a little over a month into the year.
This momentum is just getting going with multiple announcements already in 2022. Iceberg will be playing a large role in the Subsurface 2022 conference being held Live online March 2-3 featuring several talks on Apache Iceberg. Register for the free conference here so you don’t miss any of the Iceberg sessions.
Alex Merced is a Developer Advocate for Dremio with a history of creating content to enable developers of all types through his personal projects like DevNursery.com, The Web Dev 101 Podcast, and the DataNation podcast. Alex Merced has been a developer with companies like Crossfield Digital, CampusGuard, GenEd Systems and others along with being an Instructor for General Assembly Bootcamps.
Row-Level Changes on the Lakehouse: Copy-On-Write vs. Merge-On-Read in Apache Iceberg