March 1, 2023

12:20 pm - 12:50 pm PST

Apache Iceberg’s Best Secret: A Guide to Metadata Tables

Apache Iceberg’s rich metadata is its secret sauce, powering core features like time travel, query optimizations, and optimistic concurrency handling. But did you know that this metadata is accessible to all, via easy-to-use system tables? This talk will walk through real-life examples of using metadata tables to get even more out of Iceberg and address questions such as: What is the last partition updated and when? Why are there too many small files? What Iceberg maintenance procedures can give us better query performance? Can we start building more advanced systems like data audit and data quality? How many null values are being added per hour? What is the latency of data ingest over time? We will also cover metadata table performance tips and tricks, and ongoing improvements in the community. Whether you are already using Iceberg metadata tables or interested in getting started, attend this talk to learn how this under-utilized feature can help manage data tables more effectively than ever before.

Topics Covered

Enterprises

Iceberg

Open Source

Sign up to watch all Subsurface 2023 sessions

Speaker

Szehon Ho

Software Engineer

Transcript

Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.

Szehon Ho:

Thanks everyone who joined and thanks to Subsurface for the invite to give this talk. As Tony mentioned, my name is Szehon. I’m a software engineer at Apple. I work on mostly open source Apache Iceberg on the same team as Anton and Russell, who’s also in this talk. Before Apple I was working in other companies like Informatica Cloudera and Crypto. In the Apache world, I’m a committer in Iceberg. But before this I worked many years on Apache Hive. I’m a committer in PMC there, so I have more experience in the Hive, but lately I’ve been really interested in Apache Iceberg.

Background of Apache Iceberg

So yeah, I think let’s just get started. So first quick brief overview about Apache Iceberg. I think there’s already a lot of talks in Subsurface about Iceberg. I heard some really good ones from Anton and Ryan. So the core concept, Iceberg, is a table format. It’s a way to lay out files in a table. It’s a specification, really. I would say the very core building block of Iceberg is the metadata file. This essentially is just metadata of other files. There could be other metadata files but eventually want to be metadata about data files. And what we track is just basic stuff. For example, membership, like what data files belong to the table, what data files belong to different partitions within a table. But aside from that, Iceberg has built many more features on the metadata file. I think some previous talks mentioned before, we keep column statistics in our metadata files. So for each data file we track a lot of column stats.

So help us prune away entire files during query planning time, which leads to really nice query performance at the table is structured correctly. And then I think there are some talks as well about snapshots. That idea is just basically, again, based on metadata files. We keep around the old ones every time we update the table. Those are now called snapshots. And with that, we can implement features like time travel, which maybe, you know where we can use those old snapshots to explore the table at those points in time. And I think from the previous talk tagging and branching is built on top of that as well. So a ton of features in Iceberg are built around metadata files. I mean, it’s all built on metadata files.

Metadata Tables

That’s the only physical manifestation of Iceberg. Many more that I didn’t have time to list, but today’s talk is about metadata tables. So metadata tables are just a way for me and you as users to get exposed to the metadata file and we can leverage all the raw metadata that Iceberg has. It’s exposed to us in the form of SQL, like system tables. We can query using SQL just as easily as we query the data tables. And I would even go so far to say, if you can answer a question using metadata tables you should do so rather than data tables. Just because querying the metadata files is going to be much faster than querying the data files. Just given the size.

So to me, this is why I think it’s really Apache Iceberg’s best secret. It’s an underutilized feature, but it really makes Iceberg just totally open and transparent as a table format. Users and increasingly tools using metadata tables to see under the hood how the system works and they can have a chance to improve it. So my talk is trying to be more solving user commonly asked questions. So I’ll try to go over how to use metadata tables to answer most problems you may have about Iceberg. Secondly, how systems can use the metadata tables to preemptively optimize the table and prevent performance problems down the line. And finally, if I have time, I even want to show how we can use metadata tables to go beyond what Iceberg has, like using Iceberg’s own metadata to build systems beyond what Iceberg itself provides. For example, data monitoring, auditing, and quality. So I think with that, I’m just going to go straight to the first example.

My First Metadata Table

Yeah, so my first metadata table and this is my own first one, but I think also a lot of first time users of Iceberg may hit this as well. So coming from High World, the first I remember the very first query I tried on Iceberg Spark was a show partitions just out of habit. And if you ever try that in Iceberg, you will know that this is not supported even today. So I remember being super confused. As Iceberg obviously supports the concept of partitions. So I asked on our internal Slack channel and then I remember getting the answer saying, oh, we support it, but it’s actually via the metadata table form. So instead of querying the data table where you query DB dot table, you do db dot table dot the keyword partitions and the partitions are going to be the type of metadata table.

And I remember trying that and immediately I was really impressed by how user-friendly this whole experience was compared to Hive. In Hive, the show partitions was always a one-off strange Hive specific statement. Whereas now with SQL it’s much more natural. Like, you can do cool things like what I’m doing here, you can order by sorting it. You can join this with maybe other data tables. So SQL brings a lot of power. And then secondly, I really like the fact that this table also had additional columns. So in Hive it returned only the partition values, but here we’re returning other things like record count and file count, which I think was super useful to have when you’re interested in partitions. So this is actually how I got very interested in Iceberg and started digging myself deeper into it.

As I dug into how partition’s table was implemented, I realized that it was actually just an aggregate view of an even richer table called the files table. And as I dug deeper, I think I realized Iceberg just has like so many metadata tables with very rich metadata that I think a lot of users may not even really know about. So hence my talk. I think Anton’s list this morning said there was 16. So this list is ever increasing. The community has added many metadata tables and is adding more as we speak. So I would encourage everyone to come back from time to time to see what new cool things metadata tables will offer. So for today’s talk, obviously with 30 minutes, I don’t have time to go over every single metadata table.

Tour of Metadata Files

So I think I’m just going to, in a broad stroke, try to tell you like in the broad categories which ones refer to which. So to me it was always very helpful to have this metadata diagram of Iceberg in the back. To me actually, metadata tables were a great tool to even learn Iceberg as I can visualize the metadata files in my own table. But essentially Iceberg’s metadata files are kind of hierarchical. That’s the main takeaway of this slide on the top. So skipping the catalog for a moment, which isn’t metadata file. At the top we have this root metadata file that has children, which are the snapshot files that has the children, which are the manifest files, and then the has the children, which are the data files. So an easy hierarchical structure.

Each metadata table corresponds to one type of this metadata file in this hierarchy. So like the latest table, I think is metadata logs or one of the later tables. And that it will give information about the root metadata file. If you want information about the snapshot files, you will go to snapshot’s history. That’ll give you how these snapshot corners I was talking about, where you can go back in time and do time travel. Underneath you have the manifest table that talk about manifest files, and then you have a whole bunch of other tables talking about the data files.

Metadata Files about Data Files

I guess the question is why we have so many tables, if there’s only this many layers. So it’s actually for convenience just to talk about the data files. A lot of users when they use Iceberg, they’re more interested initially in how Iceberg lays out their data, right? They may not necessarily know enough in the beginning to know or care about Iceberg’s own metadata. And even in there we have views that are more for simple use cases and then views that are more for advanced use cases. So the best example again, is the partition’s table that I mentioned, like my first metadata table. Even though partitions actually kept on each data file, right? Each data file has an annotation which partition it belongs to. We don’t necessarily want the user to have to query the files table to get the list of partitions, right? The very first time users. So hence we have a partition’s table, which is like a pre-aggregated view for the users.

And then if they want the more complete view, they’ll have the files table, which has partition information for each file plus other attributes, physical attributes, for example, like the file size, file length, file path, file content, things like that. And if you want, even like every single metadata we have on every data file, that’s going to be the entries table, which is kind of hidden. It’s more for advanced users, but that has even things like the snapshot ID of which made the file and things like that, kind of context information. I’ll try to go over some examples later. And the other dimension I’ll talk about is that we have this all underscore tables like you may see in the list, all files, all manifest and then just files, just manifest.

So the point here is like if you just do without all that’s going to be the metadata files of the current snapshot but if you do all, then you’ll get like as the name implies, the metadata files across all snapshots of that one type of that one layer. And then there’s also a third dimension, which I won’t have time to talk about data and delete files. These tables, I think I added at the end of last year, essentially you may have heard Iceberg V2 merge on Reed has this concept with delete files, which are kind of like negative data files in a way. And these are just views for that. But I don’t have time in today’s talk to talk about it.

FAQ: Partition Information

So I apologize, that was like a big whirlwind tour of like, because given we have so many metadata tables in the short time of the talk. I’m going to try to go over examples of commonly asked questions and see how we can use metadata tables to answer these questions. So start off with partition questions. And these are actually questions I think came up on the Iceberg dev list and some users as well has asked me this. So common questions about partitions are how many files you have per partition, what’s the total file size of each partition, and what’s the last update time per partition? So let’s just go over them one by one. How many files per partition? Like, how would you answer that using the metadata table? So if you remember from my first metadata table, like the partition’s table, that was easy. It already has a field called file count.

And so the answer here is just you just select that table and that field and you get the answer how many files per partition. For the total size of each partition, this is a good example where the partition’s table in Iceberg, we don’t have this field pre aggregated for the user. It’s pretty easy to implement, but we just don’t have it at the moment. So here’s how a user would be able to just use the underlying files table to get the answer to do the aggregation themselves. So each file again has a partition annotation, so you just group files by partition. Now you have different groups of files and for each group, I’m just summing up the file size and bites. So that’s going to be the total file size per partition.

Still kind of easy to understand. The last query on this slide is the biggest one and it looks complex, but I promise you this is probably the most complex one in this slide deck. So the idea here is that underneath files, there’s like a bigger thing called entries table, which has even more information. And there it has information about snapshot, like with snapshot added this file, because files update time in Iceberg, a file is not actually considered added to a table until it’s part of a snapshot. So we actually have to join the entries table with the snapshot table to find the files, commit time because that’s going to be the same for all the files in that snapshot. So that’s what we’re doing here.

We’re joining entries with the snapshots table, but then we’re doing the same thing. We’re grouping by the partitions, so for each partition group we go through and find the maximum snapshot commit time, and that translates really to the last update time per partition. I would say if you understand this query, you’re already like an expert in the Iceberg metadata tables and can answer pretty much any question someone throws at you about files and partitions.

Closer Look at Snapshots

I’ll explain this last query a little bit. So the concept of Snapshot ID also sometimes confuses the users. I think I’ve gotten this question because snapshots in Iceberg typically refer to a state of a table at some point in time, like example, snapshot Fu points to like 10,000 files. But snapshot is sometimes also referred to as an operation on the Iceberg table.

For example, snapshot Fu has added like five files that weren’t part of the previous snapshot. And that’s really what we track in the entries for each data file. We track what snapshot has added it and what Snapshot has deleted it. Because if you think about it, files live in an Iceberg table, it’s added once and then it goes to many snapshots where it doesn’t touch it and then it gets deleted once. So the point here is that entry snapshot ID tracks which snapshot added and deleted it, and then the status will be what’s the operation.

Snapshot Questions

And so to illustrate that a little bit, I’ll pose two questions that look similar but are slightly different versus what files have been added by Snapshot Fu. So the answer to this is that the previous slide entries table has snapshot ID, so you just query all the entries with Snapshot Fu and then status equal one, meaning added by Snapshot Fu. So that’s pretty easy. Status equal two obviously will be what files were deleted by Snapshot Fu. And then the second one, what files are all referenced by Snapshot Fu. Again, this will be a much bigger list than the first one. So this one, instead of that entries table, we’ll actually be using time travel. So if you know time travel, like this is the exact same syntax I think Iceberg uses for data time travel. And I’m just showing you can do the same thing with metadata time travel. So keep that in mind. You can select from files table version as of Fu and then get all the files as of time Fu, I think that’s sometimes helpful to do snapshot analysis.

FAQ: Disk Usage and Expired Snapshots

So, that’s like commonly asked questions. Now I’m going to talk about how systems can use Iceberg or metadata tables to know when to optimize or maintain Iceberg tables. I think there’s already talks in this Subsurface about table maintenance operations. This is going to be more at like the which the metadata tables to use level. One very common one is expired snapshots. I mean obviously it’s great that Iceberg keeps the old metadata, old data around for time travel purposes, but they do come at the cost of disc usage, right? So often you’ll have customers or users like first time users like deleting data and compacting data, but still not seeing the disc usage decrease. So the tables I like to query in this situation are going to be those all underscore tables that I talk about a few slides back.

And then compare that with the non all versions, the current snapshot versions. So these is actually some queries I like to graph in my own dashboard, like periodically just to visualize it. You can just query the file size of all files and all manifest compare with the files in manifest. And if the former is much greater than the latter, that means that you have a lot of data stuck in old snapshots and not necessarily the current snapshot. So you’re actually going to gain a lot by expiring snapshots in this case.

FAQ: When to Optimize Metadata

Another good operation to do from time to time is optimizing metadata. So this will really help the query planning time and even one thing is it will even help the runtime of metadata table quarries themselves, right? It’s like maybe if you have a show partition’s query taking a long time you would want to do this operation.

So the key table in this scenario to know when to execute this is the manifest table. Basically the simplest way is just to get accounts of the manifest files. If you know you have as many manifests as you have data files and something is quite wrong, you don’t want too many small manifest files, you actually want each manifest to have maybe a thousand, ten thousand or some big number of data files that it points to. That way you get a balanced hierarchical treat. You can even, in the manifest table, graph out for each manifest how many files they point to to see is there a point in time where something went wrong. And for very advanced users, you can even do things like sorting of manifest files. And so what does that mean, sorting manifest files.

So each manifest points to maybe a lot of data files and each of those data files have partition. So you could actually sort the manifest files so that each has like a discreet bounds of the partitions. And that way when you get like pruning in the partition you can prune away entire trees, sub trees quite easily. This is more of an advanced technique, I would say.

FAQ: Optimizing Data

Optimizing data. So this is also very important obviously for query performance. Small file problem has been the bane of query performance forever, right? Just due to the overhead it takes to open each file. Your friend here is the files table. So similar to the manifest table, the best way is just to get a count of files and the size and I think files table, you can actually group by the partitions.

And what I’m doing here is I’m getting the file count, which is just the count star, and then the average size, which is just again, the total size and bites divided by the count star. And so if you see your files each have only a few kilobytes, then that’s really too small and that’s going to be a really bad query to run on that table. You want to have bigger, maybe something bigger like HDFS block size, some tried and true values, things like that. And then the next one is like you can optimize the data files as well by sorting them. This is a column I added very recently, readable metrics column. I think it’s quite useful. In my previous talk on this subject, before this was available, it was returning a byte array, which was not very parsable.

With readable metrics, we can actually now drill down on each column and then get things like lower bound and upper bound for the metrics we have for that data file. And then that will come as that type, that column. So in my example, I have a string column and now I can visualize for each data file lower and upper bounds and again, see if I have the data file sorted, right? Like if they do have discreet lower and upper bounds. If they do, then obviously when you have a predicate pushdown on that column, you can prune away a lot of files that way. Yeah, I don’t have that much time left, so I just have a few more slides.

Ingest Monitoring

This is an example of a use case where I had a little bit to build something a little beyond Iceberg using what Iceberg provided. So in our architecture, we had an ingest pipeline of events flowing from like Kafka through Flink into Iceberg. And we wanted to just measure like the ingest pipeline, the performance of it in terms of data latency and data completeness. If you note, this is usually a hard problem to measure an ingest, but with Iceberg it was very easy. For data completeness, the trick here is just to make sure the source and target are bucketed by the same event time definition, and then compare the size of both, right? And so in our data model we just bucketed each event by the event time in Iceberg, which was our target. And then using the partition’s table just took a record count of each partition and then compared that to the each event time corresponding bucket in the source to see data completeness, like how many events were dropped or made it or did not make it to Iceberg.

Data Latency

Data latency, this is measuring how long events take to make it into Iceberg. And this is actually very reminiscent of that first query that was quite hard. The last update, time per partition. So if you remember over there we were trying to find in each partition, the maximum update time, the last commit time, right? So here we’re doing almost the same thing, but instead of that, finding the max difference between the commit time and then the event time, when that event was created. But how do we know in the metadata tables, what’s the least event time of a file? Well, again, we have lower bounds for that column, so we can just take the max difference of that lower bound event time. And then the commit time, which we get again by joining the snapshot table to get the data latency. So for each bucket how long it takes from the events, the maximum time it takes for an event source to land into Iceberg.

Data Quality

Yeah, just one more slide. So data quality I think is something that can really benefit from metadata tables. We have a lot of statistics again for each data file, for every column. And some interesting things are, for example, we can graph how many non-values we have, not a number of values which will indicate error. You can create alerts on that, maybe no values if they exceed the threshold, maybe lower and upper bounds if they have something you don’t expect is an interesting feel for data quality.

Future

On that topic, I think there’s a lot more metadata tables coming in the future. One interesting direction the Iceberg community is heading, very exciting, is the puffin files. I think it’s introduced and being used already. This is going to track even more advanced statistics like data sketches, maybe possibly boom filters. And the point here is hopefully there’ll be metadata tables on that and we can, as users, query them to get interesting things as well, like maybe histograms or even distinct value counts for different columns. So I think I’ll end here. I’ll leave some time for questions. But again, thanks everyone for attending this talk and hope you found it useful.