What’s New in Apache Iceberg 3.0

Version 3 of Apache Iceberg has been released. A number of features have been added that expand the flexibility of the data table format, including a few much-requested data types, faster deletes, row lineage and default values for NULL types.

This new version — as well the v4 version the core development team is about to start on — will better equip Iceberg for new types of use cases, explained Russell Spitzer, a program manager for Apache Iceberg.

As an open data format, Apache Iceberg has been instrumental in the creation of data lakehouses, which combine multiple sources of structured and unstructured data for large-scale analysis. It uses a sophisticated set of metadata to keep track of the tracks of the changes in the different files it indexes.

Iceberg, along with a good metadata store, keeps track of a schema as it evolves, which gives users more flexibility in updating the schema while maintaining the ability to query older data. It can do time travel and rollbacks. It can also scale without users worrying about partitions.

Apache Iceberg is both a set of specifications as well as a number of reference implementations. There is a view specification, a REST specification for how to communicate with the server. It also includes a specification for Puffin file format for storing indexes, statistics and other data bits that can’t be stored within an Iceberg manifest. There is also a range of implementations written in different languages (Java, Python, Rust, Go, C++), and based on different platforms such as Apache Spark and Apache Flink.

...

Apache Polaris Nears a Big Release Too

Although originally developed at Netflix and subsequently maintained by Dremio, Iceberg has received quite a bit of open source help from Snowflake — in terms of engineering time and even certain features that Snowflake originally developed in-house for its own data formats.

Last year, Snowflake released as open source its own REST catalog it had developed for Iceberg, called Polaris. Iceberg requires a metadata catalog to centralize metadata management, governance and access control for Iceberg tables.

The idea was to “abstract away commit logic from the client and have them in a central server location,” Spitzer said. The catalog often relies on a database for the actual persistence layer. Snowflake’s own commercial Polaris implementation, Open Catalog, uses FoundationDB.

The version one release of Polaris will happen “soon,” Spitzer said. Last minute adjustments are being made for production and security assurances. The software had a lot of Snowflake-specifics that needed to be changed out.

And, of course, the software must be scalable.

“There’s folks who want to use it for their own internal organizations, where 20 transactions-a-second on the catalog is more than enough. But we have some people who want to run it as a service, or run it for a huge organization, where you need to handle thousands of transactions a second. It’s probably very rare, but we want to make sure that it scales up to that,” he said.

Read the full story, via The New Stack.

What’s New in Apache Iceberg 3.0

Apache Polaris Nears a Big Release Too

Get Started Free

See Dremio in Action

Talk to an Expert

Make data engineers and analysts 10x more productive