March 1, 2023
11:45 am - 12:15 pm PST
What’s New in Apache Iceberg
This session offers an update on what’s happening in the Apache Iceberg community. Join us to find out what the community has released and what new use cases those features enable.
This talk will include details on changelog scans, the Puffin stats/index format, and table references, as well as important new areas of development for 2023.
Sign up to watch all Subsurface 2023 sessions
Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.
Alright, so I’m going to be talking today about what’s new or at least what’s landed over the last year or so in Apache Iceberg. And I’ve also just wanted to point out quickly that we’ve put together an Iceberg cheat sheet for Spark. If you want to download it, scan the QR code here. I also put it on the next slide so that I don’t have to keep talking to the title slide and we can actually get rolling. So let’s talk about what we’re going to cover today. We’ve only got half an hour and there’s been a lot of great development in the Iceberg community. So today I’m going to cover the addition of PyIceberg, which is basically a brand new Iceberg implementation entirely in Python in the last year. We’re going to talk about the addition of branches and tags to the format.
We’re going to cover the REST catalog spec, as well as the puffin format, which is a file format for sketch data. Another thing that I want to highlight is adoption and then a little forward looking. We’re going to take a sneak peek at what’s coming up in 2023. So with that, I hope everyone has had an opportunity to scan the QR code and we’ll get moving.
All right, so the first thing I want to talk about is PyIceberg. I’m really excited about PyIceberg. The old Python implementation that we’ve been calling the legacy version was really based on Java. And you could tell that we basically took Java and translated it directly into Python, and there were a lot of things that didn’t make a ton of sense like empty classes that just defined methods that didn’t do anything.
And then we extended those. So we were basically taking Java interfaces and translating them directly into the most direct thing in Python. So we’ve replaced the entire legacy version at this point with a new Pythonic implementation of Iceberg that works really well. So we’ve gotten to the point where we’re at feature parity with the old legacy version. So we’ve got current interaction with catalogs including Hive, Glue and the new REST catalog. You can read and interact with all the table metadata. There’s even a CLI that lets you explore table metadata. That is really cool. And most importantly, the new version or new implementation can plan scans. So you can go and actually load a table from a catalog, go and say here’s my filter, here are the columns I want, and it’ll hand you back scan tasks.
That is pretty much where we need to be in order to facilitate interaction with the rest of the Python and C++ ecosystem that builds on Python as this Glue layer. So being able to plan scans, I think, is really huge. Now we’re looking towards integration with other libraries like Dask and Ray and Bodo. And so we’re really excited about what we’re going to see coming up, especially in the next year now that this exists. Another aspect of PyIceberg that’s really cool is that we’ve translated some of the interfaces and approaches that we had on the Java side. So like Java, you can actually plug in your own file IO implementation, and we have two default ones. There’s PyArrow that supports basically everything under the sun. But we also have some very specific ones for when you don’t want a C++ backend or can’t get it installed, like s3fs.
So really cool development over on the PyIceberg side. Now you. It’s not just that you can plan scans, there are actually some really great ways that you can interact with Iceberg tables in PyIceberg. So I briefly mentioned that there’s a CLI now and PyIceberg is actually one of the easiest ways to interact with a catalog and explore it. You basically have all the same functionality as show tables, show databases, those sorts of commands from the CLI and it’s super quick. You just run the PyIceberg list and you’re up and running. So the CLI is a really cool feature. You can even drill down into tables and see the metadata tree and all sorts of things in your Iceberg tables.
You can also, if you don’t want to use just the CLI to interact with metadata, you can read directly into PyArrow tables and Pandas. Well, I guess you convert PyArrow to Pandas, so it’s not direct, but you can read into Pandas data frames. So this gives you the ability to connect to your existing Iceberg warehouse. Read any table, push down a filter, pull all of that data back and then crunch it locally. So it’s a really cool place that we’re at right now. Basically full read support on the Python side and looking towards integrations with those distributed Python engines. And of course you can’t have a talk these days without talking about DuckDB. Of course we have DuckDB integration, which I’m going to show off on the next slide. It’s actually really cool to be able to connect to a catalog, pull data down, put it in DuckDB, and then run whatever analytic queries you want on it. And I also really love this meme from mullensms or mullen’s ms. I’m not really sure which, but I love this. I found this on TechCrunch. Putting some DuckDB on it is exactly what we did.
PyIceberg Examples CLI and API
So I just want to make this a little more concrete and show a couple examples of what you can do in PyIceberg, both the CLI and the API. So on the CLI side, what we’re doing is we’re basically exploring the catalog that we’ve configured. So you basically go edit .pyiceberg.yaml in your home folder pointed at your Hive Metastore or REST catalog. And all of a sudden you’re set up for operations like list, which shows all the top level name spaces. Then you can drill in from there, list, tables, and examples. And then you can also say, “Hey, give me the schema of examples.nyc_taxi_yellow.” So it’s really easy to interact with those catalogs, explore, find out what’s there, and then it’s equally easy to hop over into a Python notebook or a script that you want to run.
You don’t even need all that many import statements. Two import statements get you all the way to load a catalog, load a table, scan that table, push it into a DuckDB connection, and then execute DuckDB queries on it. So really great where we’re at with Python. And I want to thank the entire Python side of the community. We’ve seen a ton of contributions here from Faco at my company, Tabular, but also from June at Netflix from Ted Gooch who was formerly at Netflix and is now at Stripe. Contributions from Apple and at Nvidia CCCS just a really broad range of contributors have really made this happen. And it’s great to see basically a huge part of the community growing to support the Python implementation. So huge thanks to all those guys. I didn’t actually, I get to talk about all this, but the contributors are the ones doing all the really important work.
Tag and Branch Use Cases
So, moving on, the next feature that I’m super excited about that came in the last year is tagging and branching. So Iceberg tables and I guess tables in general have never really supported multiple versions, keeping old versions around and that sort of thing. So Iceberg has always had this Git-like structure where you keep all the snapshots around and you keep adding new versions of the table. Unlike Git, we actually clean out the old version so that we can eventually delete data because data files are a lot more expensive to keep around than source files, but we’ve never really taken advantage of that structure any more than just having time travel in our tables. So what tagging and branching does is it adapts some of those Git-like concepts to the Iceberg metadata because we have the same sort of reused tree or persistent tree structure, we can really easily tag older versions and say, “Hey, let’s keep this around for a year or three years or something like that.”
And that is really important for a number of use cases that people really want. So the first one that we encountered at Netflix was audited financials. So if you have a table and you’re using that for your financials, you want to keep the audited version around to make sure that everything went right and you can reproduce what was done at that audit time. So it’s really important to be able to keep those versions around, but you also don’t want to keep them around forever, cause they do cost money to keep around. And once you’re past your regulatory requirements for use cases like this, you generally want to get rid of them. So tagging has features for managing both of those concerns. Basically being able to say, “Hey, keep this version around, but only for two years,” for example.
Another use of tags. it’s basically the idea is just, “Hey, keep this around for a while.” But the second use case that we’ve heard a lot of people want is model training. So when you train a model, you want to be able to reproduce that model by knowing exactly what data and what code produced it. That way you can go back and figure out, “Oh, hey, what went wrong with this? Did we do something wrong? Was it the data or the code that made this model underperform?” And what you don’t want is for that version of your data to expire after some number of days. So you can really easily tag some training dataset and say, “This is the exact version that we used for our model training making all of this AI model building a lot more reproducible.”
So I’m really excited about tagging. The cousin of tagging is of course branching. And branching probably has more impactful use cases overall. So the first one that makes me really excited is this idea of inline or integrated auditing. So this is something that we did quite a bit at Netflix where we would write to a table but not actually update the current state of the table. So we would basically build this snapshot that was not in the table’s history yet. Then what we would do is audit that snapshot, run null checks, and null counts and make sure that the data looked good, was consistent with the previous data sets, and then we would publish after that point. And that’s a really important pattern. A lot of people ask us about lineage as data travels through Trino and Spark and Dremio and all these different different engines.
And one of the main things I always come back to is you don’t need to rely on lineage if your primary use of lineage is to figure out where bad data went. What we want to do instead is to prevent bad data from going downstream. So, branching is a generalization of that pattern where you want to stack up not just one commit, but potentially multiple commits run your audit, or quality checks and then publish by fast forwarding your main branch to that branched state. So I’m really excited about making that pattern much easier because you can do things like create a branch in DDL these days. You can stack up multiple rights and fast forward to that point in time. You can also do things like stage data across multiple tables and publish those at the same time so that if you have a downstream job that accesses multiple tables, you’re not accessing different states at different times depending on when jobs are completed.
So I’m really excited about the branching use cases and the ability for us to really improve data quality by building better processes around our tables. And the last one is another request that goes way back just being able to test new code. So being able to say, I’m going to run this job that produces my data, run it on the exact same table so that you can use all the old data. If you’re running a merge that takes a set of new rows and merges them into the old table data, you don’t want to have to copy the entire table into a new place in order to run that. And in order to be able to test your ETL with newer code before you publish that ETL. So instead what you can do is just set up your code so that it writes to a branch, compare the branch with the current version or the data that was produced by the old code and verify that they match and then deploy your new code safely. And then drop the branch and it’ll be cleaned up automatically.
One thing I do want to mention here is that branching is definitely not for transactions. A lot of people really want to stack up changes and then merge those changes into another branch and have the basic commit semantics of a transaction. But unfortunately this is a lot like Git where the changes that you push into other branches are at the file level. So it’s add these delete files or add these data files and remove these other files, and you lose the semantic layer of what that commit was actually doing. Whether it was a merge or a delete or something like that. And so it’s not a mechanism that is safe for building transactional semantics. We’re going to cover transactions later. It’s just not a good idea to use branches for it.
Tags and Branches in Metadata
So that was a long explanation of the use cases around branches. I want to go into a little bit, I mentioned the Git-like model. Well, here you can see a table with a couple of different snapshots and the idea behind branching and tagging is actually really simple. And it’s super cheap. So all we’re doing is we’re saying that a table knows about some number of older snapshots. We’re just labeling those snapshots with names like Main or Q3 2023, sorry, Q3 2022. And then we’re basically updating the expiration logic so that it keeps those references around longer than it would snapshots that are just in the history of some reference. So it’s actually a very simple model and that means refs are cheap and you can use them very easily. Create them, blow them away, it’s all just metadata operations.
Last thing I want to cover on branches is just a little bit on how to create them, branches and tags. So very simple syntax. Huge thanks to ammo and leeway and a number of other contributors in China that have really made all of this syntax happen. This is going out in the upcoming release and really great to see all of these. So we have creator replace tag, basically alter table and then creator replace tag with some tag name, same thing with a branch. So you can see how those would happen here. I mentioned specific retention for tags and you can just set that up by saying retain some number of days. So that’s the key word that you say, “CREATE TAG v5_ training_set RETAIN 365 DAYS” to keep that around for a year.
You can also tag some older versions. So by default it’ll use the current main, but if you want to create a tag from an older version, you can just say version as of just like you would for time travel. Branches add in another layer of complexity, which is how many snapshots or how much history behind that branch do you actually keep. So you can add the, with snapshot retention, to set the, the number of either snapshots or the amount of time that you want to keep snapshots behind that branch reference. Now that will keep data around, so be careful with us and by default it will also just use the table level settings, which are reasonable defaults, I think five days and at least one snapshot. So I wouldn’t necessarily mess with this, but it’s good to know that it’s there.
REST Catalog Spec
All right, the next thing I want to cover is the REST catalog. This has been a huge development in the last year. It is basically a standard protocol for interacting with any catalog in Iceberg. There are a few motivations for introducing the REST catalog. It’s basically like the Hive Thrift interface where everything knows how to talk to the Hive Thrift interface, whether that’s actually Thrift or whether it’s sorry, whether that’s actually the Hive Metastore or whether that’s Glue. We have this sort of standard protocol that you can use in Athena and Snowflake and Trino and all these different systems. The issue is that we didn’t have anything like that for Iceberg catalogs. Iceberg, we basically said, “Yes, it should be totally easy for you to use your own custom catalog. And we’ve actually had a number of catalogs that were contributed to the project.
But the problem is, what, what you had to do was add an implementation into Iceberg as a new module and produce a JAR. And the problem there is, of course, commercial databases don’t just accept random JARs from customers that want to get code into their database. So the joke here is Snowflake isn’t going to add your custom catalog JAR to their runtime, right? So what we needed to do was much like the Thrift interface, standardize the protocol for talking to a catalog. That way we all have a common client that we can include everywhere, and you can still have that same flexibility to be able to say I want all of my catalog data stored in DynamoDB or JDBBC or something like that. So standardizing that protocol I think was really important.
The other thing is why didn’t we just use Thrift? Why didn’t we just fall back to the old Thrift interface for a Hive metastore? Well, the challenge there is that the Thrift interface is really specific to Hive and Hive tables and how they work. And there’s just a lot there that we don’t need. Plus it’s very difficult to work with. So what we decided was REST was a much better way of building a protocol. REST and JSON are very portable, easy to use across languages. And we also could then extend catalogs with a number of new features. And those we’re really excited about as well. So, two that have already landed are change based commits and lazy snapshot loading. One of the bottlenecks and Iceberg tables is how big your metadata JSON file gets.
And by moving to change based commits, you don’t need to push all of the commits for a table or write out all of the known snapshots for a table into metadata every single time. You can just say, “Hey, I’m adding this new snapshot, please make it the current.” So those requests are actually super small and much more efficient to get done. On the flip side we can lazily load the snapshots in a table. So you don’t need to pull thousands of snapshots down just to see the current state of the table and run your count star query. So lazy Snapshot loading is also something that just landed in master. Thanks to Dan Weeks for doing that. And it is really great to see these new catalog features coming.
We can also do a whole lot better and we’re looking to add things like undroping tables. So, oops, I was in the wrong database and dropped a table, pretty common problem. This is something that the catalog can and probably should support doing depending on your catalog implementation. We also use this as a way to clean up a few problems like metadata evolution. If your metadata JSON file is written by any client that happens to be in your infrastructure then the versions between those clients starts to matter. In order to take advantage of newer structures in your metadata file, you actually have to get every writer to upgrade to the new version. We didn’t really like that situation. So another thing that happens with the REST catalog is you sort of centralize the process that’s writing those metadata JSON files that takes care of a couple circular dependencies as well as make sure that you only have to update one place in order to evolve your metadata and start using those newer table features. No matter, even if you have older client versions out there.
So that’s super cool. The one I skipped that I really want to highlight here is multi table transactions as well. So this is something where the old catalog implementations probably are not going to be able to support what we want to in the future. So multi table transactions are just a really important piece. And that’s something that is easy to specify and rest and build into a new catalog implementation, but is not something that we’re ever going to be able to shoehorn into a Hive Thrift interface. So we’re really excited about upcoming multi table transactions. Very quickly I want to say, “Hey, it’s an open API spec,” so it’s very easy to extend and use. There are also examples to help you either wrap an existing catalog JAR for use in a REST service. There’s a REST image, there’s a Docker image that helps you do this as well as just Glue code called catalog handlers. And then both Java and Python environments already support REST.
Last, I want to cover very quickly cause I’m running out of time Puffin, which is a file format for sketch data. So Puffin is a new file format, and we get asked the question all the time, “Hey, why didn’t you just use Parquet or Arrow or some other existing project?” And the answer is that we need some file format to handle large sketch data. So a sketch is something like an HLL buffer or a Bloom filter. And if you think of a Bloom filter, say you have a file with a million distinct values or distinct IDs. In order to have an effective Bloom filter with a reasonable false positive rate, you need about one bite per unique value in that Bloom filter. So already, if we’re talking about one file with 1 million unique values, we’re talking more than a megabyte in storage.
One megabyte blobs of data just don’t really fit well into the Parque format or Avro or anything else that uses very small chunks of data and is basically geared towards eight values that are maybe four or eight bites. Like a long or a float. So what Truth Puffin is telling us here is that sketches just don’t fit in Parquet. So we built a new format specifically for sketch data so that we could store things like the buffers used to estimate NDVs, write HLL buffers, and things that actually support incremental updates. Because what we want to do is go do some expensive work, calculate the NDVs across this dataset, then save the intermediate state and just incrementally update it from there. And that’s what Puffin does.
So currently we have theta sketches that are used for union and intersection and basically NDV estimation. We’re adding Bloom filter sketches and we’re going to use those for indexing in the future. So this is all geared around better cost-based planning and better data feeding into query engines. So we’re really excited about Puffin.
The last thing I want to highlight in the last year is not actually a feature of Iceberg. It is the amazing increase in commercial adoption that we’ve seen. I’ve got Dremio up here first because I want to say thank you for hosting Subsurface this year. Well, and every year. But there is a ton of great Iceberg content and a huge thank you to DremIo. They’ve been Iceberg fans from the very beginning. But this year we’ve also seen a number of announcements from a lot of other big database vendors.
So commercial adoption is really coming along. So we’ve got Snowflake, BigQuery, AWS, Athena and EMR, Starburst. Cloudera is moving all of their customers over to Iceberg. This is really a huge vote of confidence from database vendors and database engineers and builders that Iceberg is the right format moving forward and the one that they’re all betting on. And I think that is really a reflection of the amazing Iceberg community. So Iceberg tries to be vendor neutral. Really wants to include everyone so that companies like these can invest in the format and we can all share this data layer essentially. And so I think that that has been a really crucial strategy. As well as just the amazing contributions like the ones that I just highlighted from the entire community.
And today I also want to add one more vendor to that list. I want to announce that this morning we actually opened up sign ups for Tabular, which is my company. So Tabular is a storage layer. It’s basically everything you need except for the compute engine. It’s a storage layer that you can use with basically any of those vendors or anything that supports Iceberg. So we are really excited about the announcement and getting people’s hands on Tabular. So go, please sign up for it. Tell us what you think. And the first few, I guess no. The first signups and the signups in the first 24 hours we will also send you a T-shirt. So thank you all for bearing with me while I talk about my company.
Coming in 2023
All right. Lastly, I want to just highlight a few things that are coming up. Multi table transactions, of course, table encryption, portable views. Those are going to be priorities in the next year. Also the 1.2 release is coming out very soon and that has some really amazing features. So it’s got all of that DDL for tags and branches, lazy snapshot loading, Spark and Flank reads and writes for branches and tags. And it also has storage partition joins in Spark, which is a huge win for joins. If you have two data sets that are bucketed or even partitioned by the same date field then you can use a storage partition join and basically avoid a lot of shuffle. So it’s a really cool feature and glad to see that coming out.