March 1, 2023

9:35 am - 10:05 am PST

Technical Evolution of Apache Iceberg

Apache Iceberg is a cloud-native table format for huge analytic data sets. This talk will focus on Iceberg’s technical evolution. In particular, the presentation will cover the initial set of features available in 2018, what’s available today, and the project roadmap for the future. Attendees will get a clear picture of the current capabilities of Apache Iceberg as well as learn about what use cases the table format will enable in the near future.

Topics Covered

Open Source

Sign up to watch all Subsurface 2023 sessions


Note: This transcript was created using speech recognition software. It may contain errors.

Anton Okolnychyi:

Welcome everyone. I’m here to walk you through the technical revolution of Apache Iceberg. by the end of this presentation, you’ll know the initial project capabilities, what features we added over the years, and what’s ahead. I’m Iceberg p c member at Apple. For the last time years I’ve been working on Apple Cloud services where my responsibility is to make analytics reliable and efficient at scale. I’m a technical lead for a team that introduced Iceberg to Apple, and I think we were the first team outside of Netflix to heavily invest into the project. And we onboarded our first production use case in 2019 and it still operates all this data today we’ll essentially talk about how the community progressed over the years, and in my view, we can split the project timeline into three different phases with their own priorities. the initial focus was on the core format spec, where a lot of work was driven by limitations in high tables.

We wanted to create a format that would be free of those issues by design. we also wanted to lay the groundwork for the future developments as well. Then we transitioned into phase of making iceberg production rating. It means providing all sorts of connectors for engines making sure that table metadata scales well as well. And providing tooling for admins to manage tables. I think we did a pretty good job in the first two phases, and right now we’re in the phase of proving that in Iceberg Lake House can be as capable and as efficient as any proprietary data warehouse on the market. So we’ve made substantial progress here too, but there are also gaps that we have to fix as a community, and I think this is what we are gonna work on in coming years.

I think it would be fair to say that Iceberg was created because of the frustration in high tables which was the Defactor standard table format at the time. And here I refer to the general idea of keeping a list of partitions in the store and then doing lists to plan your jobs or commit your versions. they have been various attempts to fix hi tables by introducing EIT particles, but s approaches actually solved the underlying problem and what made things even worse. people were moving to the cloud and they were moving to the cloud was the same technology. They ran OnPrem and that added even more complexity in issues. So it was clear a new cable format design for the cloud would give us tremendous boost in performance and scalability.

The first issue that Iceberg is designed to solve is reliability. In hi tables, we didn’t have a concept of a transaction, so we could see partial results as our job progressed, and we could also end up with a corrupted table state if there was a partial failure. In our job. In iceberg hand, every operation is an asset compliant transaction with configurable isolation level. under the hood, iceberg uses multi versions concurrency control, which means every time we modify table, there’s a new version of the metadata being produced. And engines, they pick a version they need and then they stick to that version as long as it, it is required by the operation.

As a consequence, you can also do time traveling. So in your sequel queries you can pick which version of the table you would like to select from. As brick relies on optimistic concurrency was a file level conflict resolution mechanism. So if you have multiple operations that are trying to commit at the same time as we will implicitly check if those operations are in conflict and the check is done at the file level, which allows us to modify the same partitions concurrently as long as those operations are not in conflict. Another critical design decision was to avoid any list or rename operations to plan a job, to commit a new table version and to do table management. Doing lists and rename in an object store is never a good idea. So we wanted to design the format so that those operations are not needed at all.

 another feature is the ability to keep metadata profile. in particular, we keep mean max values for different columns and this enables us to filter files at planning time without opening the footers of those files. even if you have a petabyte scale table and a cluster was just a few notes, you can still execute highly selective queries as long as you have proper partitioning and ordering of data. ISCO also changed the way we work with partitioning because high partitioning is broken in a way cuz it is explicit. let’s consider a very common example where we have a timestamp column and we would like to partition by day deriv from that timestamp. in high tables we would probably need to create an extra column quality and day and populate that column every time. Write to this table what is even worse?

We would have to add extra predicates on that partition column so that we can benefit from partition pruning, which puts a lot of complexity on the user. And the underlying problem here is that the format itself doesn’t really know about the relationship between these two columns and then one is direct from another. as a consequence we need to know how the table is partitioning in order to query it efficiently, which as you can imagine is it’s very easy to miss. And you may end up scanning the entire table in ice. They have. you have a set of transforms which you can use to define partitioning. So in this example, I’m using the days transform on top of the event time column and under the hood, every time I write to this table, iceberg would derive the partition top for new records and store that mapping in the metadata. So whenever I query this table, I don’t have to reason about how it is partitioned, I just focus on my business logic in iceberg case care of the rest.

What is even more iceberg tables can support multiple partitions back at the same time. so for example, if you create a table and then in two years the existing partitioning doesn’t really work anymore for the income and data you can choose to evolve the partitioning spec and you can do that by adding, dropping, or replacing partition fields. whenever we do that, the existing data stays in place, so not actually rewriting that instead it is still persisted using the old partition spec. We simply created in use pack that we are going to use for the incoming data. And iceberg internally keeps mapping for every file. So partition pruning would still work in both specs as well.

And last but not least, iceberg provides your reliable schema evolution using column IDs. column IDs allow us to avoid any edge cases of tracking columns by name or by position so that you can add rename or reorder drop columns without any side effects like you would expect from the normal database. All of these features that we just talked about were available in the project from early on. and they not only allowed us to fix high tables, but also added a lot of advanced technology to our data link. So it was time to make it production ready.

The first phase that we heavily invested is a set of connectors when Iceberg was released and didn’t have, for example, support for Spark because we couldn’t integrate with the data source we want. as a consequence, we partnered with the Spark community on the data source v2, and right now we support we support batch reason rights. you can also point your micro batch stream to an iceberg table and even more you can use Iceberg Table as a streaming source to consume changes outta it. Around the same time we added support for Flink, including Theor functionality. Hi Drio trio, snowflake E M R Data Product Athena, a lot of other vendors as well. So right now you have a set of like a reach set of connectors which you can choose from which gives you a lot of flexibility. So you can use Spark for heavy lifting and then you may choose to use Drio Trina for ad hoc analytics all on the same table.

We also added support humanity to tables which is a distributed way to query your table metadata. It is really powerful feature allows you to build a lot of services that are related to table management overall. There are like 16 tables at the moment and they provide insights into different aspects of your table state. So you can see all of the statistics about data files, about partitions snapshots that are available, history and so on. So in this example I’m selecting from the snapshots mandated table and it gives me information about when a particular snapshot was committed, it’s ID operation, and then a quick summary of how many files I added, how many files are removed and so on.

There will be another talk by Zhan on ventilated tables. I suggest that you watch that one too. it’ll be a deep dive. It’ll explain all types of ventilated tables you have and when to use which, so that should be really helpful if you are into. We also worked on high level SQL procedures to manage tables. So at the moment you can optimize data where you can pick between pin pack and sort and Z order. you can also rewrite metadata so that it’s properly aligned with your partitions and your job plan works as efficiently as possible. You can also expire all snapshots to delete orphan files and roll back table states. So we’re actually a lot more which I won’t be able to cover. in this example I’m exci I’m, I’m expiring snapshots that are older than a particular timestamp and I do that with plain sql. And then under hood, that would be a fairly fairly complicated process of figuring out what files can be deleted from the storage. there is another talk by Russell that will focus how you can use these procedures to manage data files and how you can do data compaction and also you can watch another I in 2020 would deep dive these extensions, all extensions and use.

All right so we talked about how to ma how we made iceberg production ready. Now let’s talk about the ongoing phase of supporting more and more data warehouse use cases on top of data lakes. The first area is role level operations. very few common use cases here. you may need to delete a set of records for regulatory compliance or you may need to correct records because of an issue in the ingestion pipeline, or you may need to populate changes to your iceberg table from other sources. And those changes may not only include insource, but also updates and deletes. In that case, you can use the merge demand, which is a very powerful tool, allows you to perform all those three operations in one transaction.

Iceberg provides a lot of flexibility when it comes to encoding role level operations. we support copy and write, which means we rewrite data files if there are records that have to be changed works really well for bulk operations. but we also support margin read and actually two types of merger read. The first one is using position based delete files and the second one is based on high quality. each of these three approaches have their own trade offs, but you can pick between these in different engines and you can use all three in the same table. So it’s not predefined it’s a fairly complicated topic, so if you’re interested in it you can also watch another talk from 22 that dives into how level operations in Iceberg and if you’re into Spark. There is also a Spark presentation about the ongoing effort in Data V to support those operations.

 I’m also excited to announce a few features that are coming in 1.2 that is about to be released. in in particular we added support for storage partition joins. So if you have two tables that are partitioned in a compatible way and you need to join those tables, you can now do so without any shackles and that also applies your to your aggregations and it actually works for the lead update and merge. So this is the key to enable shop less merge operations in Iceberg. there is one spark limitation at the moment in three three and it should be fixed in the upcoming three four. So once Spark three four is out, it should be able to use shuffles merge operations in Iceberg.

We also added support for aggregate push down. so if you’re doing aggregate on top of an ice brick table, there is a very high chance that we now can push some of that work into ice brick metadata because we store a lot of MinMax statistics, statistics about the count the record calendar and so on. if, if this happens, then you can circum a lot of execution and save a lot of time. We also added support to change data capture in Spark. So we added the change lock, the the change lock table which you can query either through SQL or through the data frame api. and you can also configure your start and and snapshots either through timestamp or through snapshot IDs.

 if you, if you query this, then you would essentially get in data frame where you would have your data columns. in addition with in addition to a few metadata columns that indicate like the change type change ordinal, commit snapshot id. at the moment if you clearly you would get inserts and deletes because there is no concept of an update in iceberg. An update is a delete followed by insert. so this is what you would get if you created the change log table, but we also will provide you a utility to compute pre and past images on the flag because you can reconstruct them if you know how the table format works. so you’ll be able to access the full change log. And the important part about this design is that we don’t store anything extra in your table. We don’t store that change log next to your table. It is computed for you on the fly.

 we also added support for branching and tagging. So you now can create branches in your iceberg tables and then you can commit to those branches. You can perform some operations on them. And if you’re confident that like this is the correct change, you can trade that change into your master branch. So in in, in essentially this will make ice brick table act like a gate repository. there are also a lot of other features that are still in progress. a lot of them are related to ease of views, so we changing some of the default values that didn’t really work that well. Also, there was a problem that you needed to change and configure with cables to get reasonable performance and we’re trying to fix that as well. So we know about that problem. So some of the default values didn’t really work that well.

Also, we providing more utilities for advanced table analysis so that you can be really on top of what, what’s going on in your table. We also are finishing V2 table maintenance. So we will provide compaction for delete files and expire of delete files. Right now it’s kind of limited, we’re working to solve those problems. we will also support end-to-end encryption from catalog to metadata to data files so that you can check the integrity of your table and make sure that nobody corrupted your table. We also have ongoing efforts around secondary indexing. right now we support zone maps but we also plan to support bloom filters inverted indexes and dictionaries profile in the metadata so that that one’s still ongoing. we also want to support the view catalog. right now if you create a view in Spark, that view is not accessible in other engines like Trio because each engine has its own form to define the form of the view and that its own representation of the view.

So they are not really compatible and we will provide the view catalog, which would make those tables well would, would make those views compatible across engines. And on top of that we can build support for materialized use as well. And last but not least, we’re also working on dynamic partitioning because iceberg tables can have multiple specs active at the same time. You can actually build dynamic partitioning in a sense that if you detect that some partitions are too large, you can decide to split them into smaller par sub partitions on the fly.

 that’s all I wanted to cover. There is another talk by Ryan Blue later today about what’s new in iceberg. It’ll dive more into these newer efforts and we’ll explain some of the use cases related to these features. So I would suggest you attend that one too. It would be like a continuation of what I just talked about. That’s all I wanted to cover. I wanted to finish with a few key takeaways. So even today, iceberg loves a lot of data warehouse, like use cases on top of data lakes. but there are also a few gaps and I predict in the next couple of years, a lot of features that you’ll see in iWork will be performance and ease of use oriented because that’s probably the areas where we have most of the work to catch up with data warehouses. if you wanna be part of that journey we always welcome a new perspective in the community and we would love you to join that community as well. and thanks for your attention. if you have any questions, ill be to.