On Wednesday, February 9th, the Apache Iceberg community released 0.13.0 with lots of new features and improvements. Also released was a new website and docs design.

2022 has already been so far a great year for Apache Iceberg seeing great coverage such as new levels of support from platforms like Dremio, AWS and Snowflake, and being chosen as the “Open Source Project of the Week” by Software Development Times for the first week of February 2022.

Let’s take a deeper look at many of the new features that come with Apache Iceberg 0.13.0. (Release Notes)

Catalog caching now supports cache expiration

Catalog caching is a technique that can help speed up table reads by allowing engines to not have to read the table’s metadata on every read. 

Sometimes when multiple readers and writers are using Apache Iceberg tables from different engines the result would be that the cache would not refresh so a manual refresh was required to see the new data. With the new cache expiration feature a time interval can be set for the cache to expire, forcing an automatic refresh to resolve this issue. This can be configured using the setting cache.expiration-interval-ms which will be ignored if cache-enabled is set to false. Read more on this feature here.

Hadoop catalog can be used with S3 and other file systems safely by using a lock manager

When using Iceberg on object stores like S3 and committing to tables from multiple engines concurrently, you previously couldn’t use the Hadoop catalog safely. This is because the check Iceberg relied on to ensure writing via two concurrent separate jobs or engines isn’t atomic, so two concurrent commits could result in data loss.

Iceberg 0.13.0 fixes this issue by leveraging a lock table in services like DynamoDB so all catalogs can use locks for safe concurrency. Read more on this feature here.

Catalog now supports registration of Iceberg table from a given metadata file location

From a HiveCatalog you could drop a table but there was no way to add an existing table to the catalog. With this update you can now add existing tables to your catalog by passing the location of the newest metadata file of the external table. This is especially helpful for interacting with Hive external tables in Spark. Read more on this feature here.

Deletes now supported for ORC Files

In Iceberg’s v2 format, delete files are used to track records that have been deleted. Previously, this wasn’t supported when the table’s underlying file format was ORC. Now, both position and equality deletes are supported for tables backed by ORC files. Read more here.

Vendor Integrations

Along with the core updates detailed previously several vendor integrations were added in version 0.13.0 including.

  • Native GCS FileIO Support [#3711]
  • Support for Aliyun Object Storage Service [#3553]
  • Remove restrictions on S3 endpoint to enable support for any S3 compatible storage [#3656] [#3658]
  • AWS S3FileIO now supports server-side checksum validation [#3813]
  • AWS GlueCatalog now displays more table information including table location, description [#3467], and columns [#3888]
  • Using multiple FileIOs based on file path scheme is supported by configuring a ResolvingFileIO [#3593]
  • Dremio now supports Iceberg tables that use an AWS Glue catalog [20.0]

Tooling Support

Several updates to the support available for many of the most popular data processing tools.

Apache Spark

  • Spark 3.2 support. [#3970]
  • Merge on read delete support in spark-3.2. [#3970]
  • Compaction in Spark with RewriteDataFiles now supports table-based optimization, merge-on-read delete. [#2829]
  • Time travel queries in Spark use the schema for the snapshot used in the query instead of the latest schema in the metadata. [#3722]
  • Spark vectorized reads now support row-level deletes. [#3557]
  • add_files procedure now won’t write duplicate metadata files when calling it multiple times on the same table. [#2779]
  • Stored procedure support for RewriteDataFiles. [#3375]

Apache Flink

  • Flink 1.13 and 1.14 support. [#3116]
  • Easier creation of Iceberg tables from Flink. [#3666]
  • Streaming upsert support. [#2863]

Apache Hive

  • The table listing API call in the Hive catalog can now return non-Iceberg tables. [#3908]

Conclusion

Apache Iceberg in 2022 is adding the features data engineers want and need which is clear given the attention and momentum it has recently received in just a little over a month into the year.

This momentum is just getting going with multiple announcements already in 2022. Iceberg will be playing a large role in the Subsurface 2022 conference being held Live online March 2-3 featuring several talks on Apache Iceberg. Register for the free conference here so you don’t miss any of the Iceberg sessions.

Subsurface 2022 Iceberg Sessions

Recordings of Iceberg Sessions from Subsurface 2020-2021

If not currently using Apache Iceberg for your data lake, give it a test run by creating some Iceberg tables using AWS Glue.