While many Iceberg APIs have been stable for quite some time, this 1.0 release signals that the community has agreed to freeze these APIs so those who weren’t yet ready to build on them can feel comfortable doing so. If you haven’t adopted Apache Iceberg in your data lakehouse yet, now is a great time to do so.
While this milestone release brings API stability guarantees, it also includes new features to improve:
Performance and support for additional use cases
Spark support for merge-on-read updates and deletes
Z-order sort optimization
The Puffin format for stats and indexes
New interfaces for consuming data incrementally
Parquet row group Bloom filter support
Support for mergeSchema on writes
Vectorized reads for Parquet by default
Ease of use
Time-travel in Spark SQL
The registerTable procedure for catalog migration
New all_data_files and all_manifest_files metadata tables
Broader ecosystem integration A standard REST API catalog
Support for Apache Spark 3.3
Support for Apache Flink 1.15
Since the 1.0 release quickly followed the 0.14 release, this blog covers features added in both the recent 0.14 and 1.0 releases.
Performance and support for additional use cases
Try Dremio’s Interactive Demo
Explore this interactive demo and see how Dremio's Intelligent Lakehouse enables Agentic AI
Merge-on-read for UPDATE and MERGE in Spark [#4047, #3984]
Merge-on-read (MOR) became available in Iceberg 0.13, and was supported for deletes only in Spark.
MOR is a write mode that enables engines to write data more frequently by writing delete files to track deletions instead of rewriting full data files. This is beneficial when doing high-velocity streaming updates and deletes.
Before this update, the ability to use merge-on-read for updates and merges in Spark wasn’t quite there yet and trying to use it would result in this error:
Z-order is a method of sorting data by multiple fields where they are weighted equally instead of sorting by one field first and then another. Z-order sorting helps accelerate query planning for tables typically filtered by multiple dimensions and serves as the backbone for advanced indexing strategies. You can now use z-order sorting when running compaction jobs on Iceberg tables.
To run compaction in Iceberg you can use the rewrite_data_files procedure which will re-write your data files depending on the rewrite strategy and sort_order properties you pass. For example, if all you wanted to do was write all smaller data files into larger ones you could use a binpack strategy (the fastest strategy) with an SQL call procedure like so:
If you also wanted to sort based on a particular column/s that the table is often filtered by you can use the sort strategy, and it will sort the data while it rewrites the files to improve read times (note:this will take longer than a binpack rewrite):
CALL catalog.system.rewrite_data_files(
table => 'db.sample',
strategy => 'sort',
sort_order => 'id DESC NULLS LAST, name ASC NULLS FIRST'
)
The above provides improved performance for queries that filter by id or id and name, but not queries that filter by just name. That’s where z-order comes in.
With this Iceberg release, you can now also sort using a z-order sort for even better read performance when queries filter by multiple fields independently:
The above will optimize for queries that filter by id, id and name, and queries that filter by just name.
Puffin format for stats and indexes [#5129, #5127]
The Puffin format is a format for storing table stats and index information for Iceberg tables that can be leveraged by any engine to improve query performance. You can read the specification details here.
Since the format specification was just defined, it isn’t being used yet, but with this specification, the door is open to adding different types of indexing and tracking more table stats. This will have a profound effect on the ability to read less data and plan queries better against Iceberg tables in the future.
The updates to the file readers in the Iceberg library make reading Parquet files more efficient. Tail reads involve reading the final bytes in the file where the footer exists and range reads assist in reading portions of column data throughout the file.
Certain engines such as Dremio have used their own Parquet readers to take advantage of these and other techniques to read Parquet files fast and efficiently. This feature will help other engines leverage some of these techniques without having to implement it themselves.
Added new interfaces for consuming data incrementally (#4870, #4580)
Change data capture (CDC) is very important when ingesting data at high speeds. The new interfaces will allow for the implementation of methods that scan for changes between snapshots. This data can be used to update materialized views and indexes efficiently along with updating downstream tables and systems at minimum latency.
Use vectorized reads for Parquet by default (#4196)
Vectorized Parquet reads optimize the use of memory and speed up the reading of Parquet files by reading the data in batches which has been a feature in Iceberg for quite some time. With this update, this feature is on by default so you do not have to manually enable this feature for the performance boost if you use the Iceberg Parquet readers.
Time-travel is a key feature in Iceberg that allows you to run queries on any prior state of the table. With time-travel you can query data at previous points to test machine learning algorithm updates on consistent data, run quality checks, audits and more.
Before this release, if you wanted to run time-travel queries in Spark, you’d have to use Python or Scala to run code that looks like this:
// time travel to June 18th, 2022 at 13:34:03
spark.read
.option("as-of-timestamp", "1658165643")
.format("iceberg")
.load("path/to/table")
With the latest releases you can now do this in Spark SQL:
-- time travel to June 18th, 2022 at 01:21:00
SELECT * FROM catalog.db.table
TIMESTAMP AS OF '2022-06-18 13:34:03';
-- time travel to snapshot by id
SELECT * FROM catalog.db.table
VERSION AS OF 10573874102873;
Catalog.registerTable is a method that exists in Iceberg Java API for Hive catalogs that allows you to register an existing Iceberg table that exists in another Iceberg catalog. You specify the metadata.json you want to register and it’s added to the Hive catalog. This allows you to migrate an Iceberg table to a Hive catalog without losing snapshot history, as creating a new table using the migrate procedure or a CTAS statement would.
With the updates in Iceberg 0.14/1.0 there are now implementations for this method for migrating Iceberg tables to other catalogs such as Nessie, DynamoDB, AWS Glue and more.
So if you have Iceberg tables registered with one catalog and want to migrate them with their snapshot history to another catalog, such as Nessie or AWS Glue, it’s as easy as calling:
// create Nessie Catalog object
var NessieCatalog = ... ;
// register table with catalog
NessieCatalog.registerTable(
"db.my_table",
"s3://.../v5.metadata.json"
);
New all_data_files and all_manifest_files metadata tables (#4243, #4693, #4336)
This release includes new metadata tables to query more information on your Iceberg table quickly. The all_data_files and all_manifests metadata tables provide data for the current snapshot as well as all snapshots.
Metadata tables can be used for a variety of use cases:
Tabulating the total size of all the table files across snapshots to determine whether it’s worth expiring snapshots to reduce storage use
Assess the need for compaction
Assess partitioning of tables
All data files: Metadata and info for all data files and delete files.
Catalogs in Iceberg are used to atomically track the latest metadata file in a particular table. Many catalogs already support Iceberg, such as Hive Metastore, AWS Glue and Project Nessie.
In the past, every time a new catalog was supported it required re-implementing many interfaces and supporting libraries. The new REST catalog creates a universal interface to simplify existing catalog solutions.
The REST catalog is an open API specification that you can see here. This allows anyone to implement their own catalog as a REST API. So as new catalog implementations are created, if you follow the API specification, you won’t need to create a new connector; any engine that supports the REST API catalog implementation can use the new catalog immediately.
Apache Iceberg is a production-ready technology used in production at companies like Netflix, Apple, Adobe, LinkedIn, Stripe and many others. Apache Iceberg is a key component in building open lakehouses that allow you to run more of your workloads on the data lake at any scale and without vendor lock-in.
With this milestone 1.0 release, Iceberg adds API stability guarantees as well as features that improve performance, ease of use and the broader ecosystem.
With its capabilities in on-prem to cloud migration, data warehouse offload, data virtualization, upgrading data lakes and lakehouses, and building customer-facing analytics applications, Dremio provides the tools and functionalities to streamline operations and unlock the full potential of data assets.
Aug 31, 2023·Dremio Blog: News Highlights
Dremio Arctic is Now Your Data Lakehouse Catalog in Dremio Cloud
Dremio Arctic bring new features to Dremio Cloud, including Apache Iceberg table optimization and Data as Code.
Dec 7, 2023·Dremio Blog: News Highlights
Dremio Cloud on Azure Available Now
Unveiling Dremio Cloud on Azure, a Fast, Scalable and Secure Lakehouse Platform Introduction Today we introduce Dremio Cloud on Azure, newly landing on Azure in November 2023, in public preview. Dremio Cloud is a powerful lakehouse platform providing streamlined self-service, unparalleled SQL performance, centralized data governance and seamless lakehouse management. In this blog post, we […]