May 2, 2024
Iceberg Development at Apple
Apache Iceberg has evolved from a niche project to a major industry player, boasting a global community of skilled engineers from companies like Amazon, Alibaba, Cloudera, Dremio, Tabular, and notably Apple. Apple, a significant open-source supporter, has been actively involved with Iceberg since its inception, contributing with five members on the PMC. Our teams are pivotal in driving Iceberg’s development, enhancing features like Metadata Tables, ZOrdering, and Vectorized Reads. In our presentation, we’ll discuss how Iceberg aligns with our needs and outline future plans for tackling upcoming challenges.
Topics Covered
Sign up to watch all Subsurface 2024 sessions
Transcript
Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.
Russell Spitzer:
I’m here to talk a little bit about iceberg development at Apple. Just for a quick introduction of who I am, I am currently a member of the Apache Iceberg PMC. I’m also an engineering manager at Apple. Our team basically focuses on how to use iceberg technology as part of a lake house at Apple, make that available to all of the engineers and users within our system. Now, it’s gone a long way. It’s been a long move for Apple to move to iceberg, but we basically use it all over the place now. And what I’d like to talk about in this presentation is how we got to this position. So we’re talking about iceberg development at Apple. My goal right now is to tell you about all of the things that drew Apple as an organization to working with iceberg and what we’ve done to improve the project and what we like to do in the future. So let’s get started. So what we’ve done to improve the project and what we like to do in the future.
So high-level overview, we’ll start off by talking about how is Apache Iceberg actually used at Apple? What are we using it for and why did we choose to use it? We’ll talk about that question a lot, about why Apache Iceberg? Why not other technologies? Why is this the thing that we’ve decided to invest development energy into? Then, of course, what are the things we’ve done for the Apache Iceberg project? I know a lot of people don’t know this, but Apple actually has significant investments in open source technology. And Apache Iceberg is one of those technologies where we’ve been contributing not just to the Apache Iceberg project, but to a lot of other partner projects that work with Apache Iceberg. We’ve been trying to improve the entire ecosystem around Apache Iceberg so that we can use it internally. And, of course, that those improvements can be used across the industry. And last up, I’m going to talk a little bit about what we’re excited about working on next in the Apache Iceberg project, because we are not done with our investments in Apache Iceberg. We still have a lot to do and a lot of exciting work coming ahead.
How Apache Iceberg is Being Used Right Now at Apple
So let’s go straight into how Apache Iceberg is being used right now at Apple. To remind everyone about what Apple is, which I probably don’t have to do, but we are an extremely large company dealing with a huge number of use cases and a almost unprecedented scale of data. And, of course, variety of use cases. Now, unfortunately, I can’t go into a lot of discrete details on the things that we’re working on, but I can go at a very high level over the sorts of things that we’re doing with Iceberg at Apple. The first thing to let you know is Iceberg is ubiquitous. We’re seeing Iceberg in all sorts of different use cases across all different parts of Apple. We’re talking about tables that go from the hundreds of megabytes up into the many petabytes. In all of these use cases, Iceberg has found its place to be the analytics choice when you’re making new tables. We often have folks coming to us and asking how they can keep moving all of their old data sets into Iceberg because everyone has been seeing that once you’ve gotten into an Iceberg format, there’s a lot of benefits and you become interoperable with a whole lot of other systems.
And, of course, we’re also dealing with a lot of different ingestion modes. So at Apple, we have a lot of different modes of ingesting data. We have a lot of real-time streaming, micro-batches, and, of course, just straight-up ETL batch workload. So for all of these different things, we’re using Iceberg. So we’re not just using it for one kind of ingestion, one scale, or one particular type of use case. It is going across the entire spread of Apple use cases.
Now, the thing that I’d like to say is that we’re not just using Iceberg a lot. We’re also investing a lot of time into the development of Iceberg. So if you aren’t aware, at Apple, we have open-source developers on a variety of projects, and all of them have been helping out in this Iceberg effort. So, for example, we have Iceberg developers working on Spark, Flink, and Trino, and the integrations of all of those things together. We’re constantly collaborating both inside of Apple and outside of Apple to try to make sure that performance and usability are at the level that is required to run at Apple scale, basically.
So as I said, one of our goals here is to make sure that the project becomes an industry standard, that everyone is using it, because the more people that use this project, the more people that are contributing, the more likely it is that all of our investments here will also pay off in the long run. So why did we choose Iceberg as the technology to make this big investment and to have everyone be working on, to be migrating use cases to? There’s obviously a lot of different technologies we could have chose from and a lot of different vendors offering data lakes, data warehouses, and other platforms. But it comes down to a couple specific things, which is why did we care about open-source software so much, and why choose Iceberg specifically of the open-source options? So I’ll talk about this basically on our key points on why we chose Iceberg first. One is that it is a truly open-source platform. Open-source software is extremely valuable to all of our work because of a couple key benefits. One, all of the feature work ends up being in the open. Whenever anyone makes improvements to the software package, we get those improvements. And whenever we make improvements, we know not only is everyone going to be able to take advantage of them, but they’ll be maintained by other folks as well. The more eyes on the code, the better it is, the more bugs are found, the more that the system improves over time. We really want to make sure that when we’re working with a project, it’s something that we can contribute to and doesn’t have any specific lockouts for parts of the code base. It’s one of the things we worry about with systems that are partially open-source, where some of the work happens in an open-source platform, but some of it is hidden away.
The next thing that we thought was really compelling about the Iceberg project was the ACID guarantees. We’ve actually had many use cases at Apple that have tried to invent their own ACID guarantees on top of Hive or on top of file-based systems. And this ends up pushing the complexity of keeping a consistent system from the infrastructure technology and into user application code. We wanted to make sure that those kind of guarantees were baked into the table format itself so that when people are building their own applications, they aren’t thinking about, “Well, how do I actually write a serializable transaction that modifies, like, 30 rows in my table?” We wanted that to be a built-in part of the table format.
And, of course, high performance. We’re talking at Apple about tables on the scales, as I said, of many, many petabytes very frequently. We needed a system that would be able to scale up to that volume of data and be able to handle updates on that volume of data. I’ll talk a little bit more about the things we’ve done to help that out in the future, but we basically compared this to other systems and found that Iceberg delivered the performance we needed at the scale and volumes that we required.
And then, last but not least, we wanted a system that was relatively simple, didn’t have a lot of server components, was very easy for someone to develop a new reader for or a new implementation for. And the reason for this is that we have developers who want to use Iceberg from a variety of different programming languages. It’s really important that the standard and setup of the language itself was strongly defined and very simple so that we could have various writers and readers in different programming languages and have it all work on the exact same table at the exact same time.
So, there are obviously other choices we could have made, but taking back to the story, when we started looking at Apache Iceberg, we had a couple other things we were looking at. Obviously, we had a lot of data warehousing technologies. We had a lot of Hive systems and we had a lot of file-based tables. Now, in all of these cases, there were key problems with those systems. As I said, Hive basically drove the complexity of doing ACID transactions onto the user space rather than being as part of the table format. Data warehouses are often extremely expensive and the cost scales with the amount of data usually that you put inside of them. And of course, file-based tables on their own are extremely not ACID-compliant and, again, push a lot of maintenance onto the user side. So, when looking through the different technology options we had at the time, there was really only one truly open-source technology that was simple, highly performant, and completely open-source, and that was Apache Iceberg. That led to us starting our journey of not only integrating it into our actual use cases, but our own development in the Apache Iceberg project. And that’s what I’d like to talk about next, is what have we actually done as developers to help out the Apache Iceberg project?
How Apple Has Contributed to Apache Iceberg
So, what have we done to the Apache Iceberg since it was in that pre-incubating phase and not the current juggernaut that is taking over the industry that it is today? Well, it kind of boils down to we were worried about certain use cases at Apple and how we would be able to solve them with Apache Iceberg. So, at a very high level, these are three of the things we were thinking about. First was regulatory compliance. With big data systems, you often have to maintain certain things for regulatory compliance, like DMA or GDPR, where you have to go into an existing data set and modify records on a row basis, not on a partition basis. Now, in systems like Hive, a partition-level update is extremely easy. You can basically wipe out the entire partition and replace it with all new data. But the kind of updates we need to do to handle new privacy regulations requires going in and eliminating maybe a single row and a single file within a partition. If we have to update the entire partition, that ends up being extremely expensive. So, we decided we were going to make a significant investment in making sure that row-level operations and row-level updates worked well inside of Apache Iceberg.
Following that, we also have a lot of streaming use cases, cases in which the data within the table is being changed constantly. Now, with Apache Iceberg, that meant that we needed to have ways to perform maintenance operations in an efficient way that scaled for all of our different use cases. So, for that, we worked on a lot of different maintenance procedures, which are now part of the Apache Iceberg code base. And last but not least is scale. We wanted to make sure that Apache Iceberg didn’t just work for those smaller tables. We wanted to make sure it would scale up to the largest tables we have. So, we were constantly looking for places that were bottlenecks in the Apache Iceberg code and trying to figure out ways that we could improve that so that we could actually handle every single table at out.
Solving the Problems Apple Ran Into
So, let’s talk a little bit in detail about how we solved some of these problems and the kind of changes we made to Apache Iceberg to have that happen. So, for row-level operations, the key work has been done in a couple different sub-projects. One is getting copy-on-write done. So, copy-on-write is a type of row-level operation that makes it possible to do updates, deletes, merge operations that affect single files rather than an entire partition in a table. So, like I said before, in old systems, this was possible to, you could do updates, but only at a partition level. Copy-on-write suddenly made it so that you could start doing these operations and modifying single files within your command. So, suddenly you can do an update or an overwrite that would only affect a single file in a partition rather than the entire partition. So, this was really good for use cases where the amount of changes we were doing was extremely dense or affected most of the rows in a single data file. So, this was actually pretty common for use cases where all of the data was sorted and we were taking care of large blocks of contiguous data at the same time and trying to rewrite it.
But it turns out that that doesn’t actually do really well in other use cases, use cases where the deletes, the updates are actually very diverse, very spread out amongst the entire data set, and probably only one or two records are affected in every data file. For that, merge-on-read was implemented. So, merge-on-read is a system of doing updates inside of Apache Iceberg with the creation of what we call delete files. A delete file sits alongside a data file and marks specific rows which have been removed. By doing this, we don’t actually have to rewrite an entire data file when we want to apply a delete. Instead, we write a new Delta file that sits next to it, and when we read that file later, we read the data file and the delete file at the same time and ignore all of the records which have been deleted. So, in this way, we’re able to successfully do deletes that are extremely sparse without writing a whole lot of new information or rewriting a lot of old information.
In particular, some of the exciting things that we’ve done within this part of Apache Iceberg is distributed planning and delete file read improvements. So, if you look through the code base, you’ll see that recently we’ve added a whole lot of changes that help you plan the delete application in a distributed fashion, as well as more efficiently read those delete files. Because, as we started using this technology internally, we found out a lot of the bottlenecks that were happening when we started getting up to those many petabyte tables. So, of course, we made fixes to improve that performance and shared those back with the community.
Now, those are just kind of half of the solution. Copy on write and merge on read are both ways of effectively rewriting a table without having to rewrite every single data file in that table or rewriting entire partitions. But one of the key improvements that we’ve made is a dual project improvement, which is storage partition joins. Now, this is an improvement which not only affected Apache Iceberg, but also affected Apache Spark. So, engineers at Apple worked on both projects at the same time to make sure that this sticks would go in and be applicable to anyone who wanted to take advantage of it. Now, storage partition join made it so that the actual lookup of which records needed to be modified no longer would require a shuffle on the Spark side. So, previously, having to do either copy on write or merge on read would first require identifying which records needed to be modified, which in Spark basically meant an expensive shuffle.
Now, obviously, this was a big pain and a big performance bottleneck. Storage partition joins now enable us to look up all of the data within two partitions by aligning them first rather than having to do a full shuffle across partitions. So, basically, this is a way of getting away from shuffles and still being able to perform semantically the exact same join. So, again, these are all improvements that not only did we make for our own internal use cases, but we shared back with the project in general and made sure would be open to anyone who wanted to use them.
Addressing Streaming and Ingestion Issues
Now, next up, I want to talk a little about streaming and ingestion issues and the work we’ve done around there. So, obviously, all the things I just talked about are really great for when you need to modify existing data, but what about getting that data into Iceberg in the first place? One thing that we quickly realized is we have a lot of users who are writing data really frequently into their Iceberg tables at a high enough frequency that the amount of metadata that was building up was becoming, again, a bottleneck on their performance. So, one of the first things that we started doing is writing new distributed versions of all of the maintenance procedures. So, if you look inside of the Spark implementation, for example, inside of Iceberg, we have distributed versions of rewrite data files, rewrite metadata, rewrite delete files, all of these things built into the service now, all written by engineers, at least mostly written by engineers at Apple. We had users who basically were using the original implementations for these maintenance procedures when they just started out, and we realized that for them, for example, something like rewrite metadata would take something like two hours, where when we switched it to a version that utilized Apache Spark, suddenly we were able to bring that down to several minutes. So, these are the kind of things that we’ve been investing a lot in to make sure that these ingestion procedures can be handled and that maintenance can successfully make Iceberg tables usable over time as they scale up in size and scale up in number of files.
In addition to that, we’ve done another dual-project kind of thing, which is write distribution modes. So, write distribution modes caused us to make changes, again, to Apache Spark and to Apache Iceberg to make sure that when you’re inserting and appending data into your Iceberg tables, we can align it correctly with the partitioning of the Iceberg table before writing. Now, those of you who are familiar with the early days of Apache Iceberg might remember those days where when you didn’t append, you had to always sort your information first before writing it into your table, and you would have to sort it or rearrange it in a way that matched the partitioning of your Iceberg table. If you didn’t do this, you would end up with a Spark task, which would end up writing data to every single, writing a new file for every partition inside of that Spark task.
Now, if you had randomly distributed data, this would end up basically being a multiplying effect. For every partition in your Iceberg table, you would then multiply that by the number of Spark tasks that you were using to write, and that would be the number of files you would produce in your append job. What we did with write distribution modes is we basically gave Spark the ability to ask a data source when it’s writing, how is your data actually supposed to be laid out, and then Spark could automatically take that information and shuffle it or rearrange it in a way that would ideally align with the table that you were writing to. Now, this is another one of those improvements that not only did we make in open source, we made it in such a way that any data source that integrates with Apache Spark can take advantage of this interface and just clear their own distribution modes to efficiently write data.
So again, open source work in Apache Spark and in Apache Iceberg that not only enables new use cases, easier writing to tables, but also opens that up to other technologies that might want to use this kind of algorithm. And then finally, one of the things that we’ve been focused on a lot is Flink improvements. We actually have a lot of Flink use cases at Apple, and we’ve been working really hard with our Flink team and the Apache Iceberg folks that work with me to make sure that Flink is continually improving. We want to make sure that Flink is always able to be used as a source for writing to Iceberg tables, and eventually, and I’ll talk about this a little more later, we want to make sure that Flink can basically act as a sole standalone Iceberg application, doing its own compaction, its own maintenance, and its own ingestion all at the same time without the need for some kind of ancillary system like Spark to do all those maintenance procedures that I talked about above.
Working at Scale
So obviously, we’ve done all of these sorts of things for ingestion, but lastly, I wanted to bring up some of the changes that we’ve done to address the scale at which we’re working. So one of the key things that we’ve hit multiple times is trying to figure out how to work with all of this Iceberg technology at extremely large-scale tables. Now, for this, we’ve often had to basically think about new ways of doing the same kind of problem, or new ways of doing old problems, or just completely alternative implementations. For example, we’ve been working a lot on distributed planning of delete files, as I’ve talked about previously, because we’ve ended up with many tables where just the amount of metadata itself should be distributed before reading, because there’s just so much of it.
In addition, we’ve done a lot of work on the vectorized readers, because for us, even getting like a 50% boost in performance really multiplies out over a long period of time because of how large our tables are. And then, of course, in talking about enhanced pushdowns, one of the key things that we’ve often been asked when we originally brought Apache Iceberg on was we already have all of this metadata about what’s in the files, what’s in the partitions, so why do we have to look at any files at all when we’re making queries that aggregate data from that metadata? So in order to make that better, we’ve had engineers both on the Trino side, and on the Spark side, and the Iceberg side, all working together again to make sure that not only can you do those queries, but when you do an aggregate query, that will get pushed down to your Trino engine, or pushed down through Spark all the way down to Iceberg, returning information without having to do any data file scanning at all. We’ve had some users that recently put this into production, and took, again, queries that took over an hour before, now down to the second range, because they no longer have to touch data files at all. In addition, we’ve worked hard to make sure that we’ve had pushdown of new Spark V2 transforms. This is basically one of those new technology leaps that lets us start letting you query Iceberg internal transforms from Spark.
And of course, one of the last things I wanted to quickly talk about is zero-copy migrations. As I said, Iceberg is taking over Apple pretty rapidly. And one of the key things we’re always asked is, we already have tables in Hive, we already have a file-based table, how do we make sure that table gets into an Iceberg table? So one of the very first things we wrote and started contributing back into the Apache Iceberg project was the snapshot and migration utilities. We wanted to make sure that we would be able to take any user who wanted, and take their data from their pre-existing system, and move it into Iceberg without doing any copy of data files. Again, when we’re working at the kind of scales that we are at Apple, we need to do that without rewriting data files. If we had to rewrite our data files every single time we wanted to migrate a table, it would be a significant cost. Now, of course, this is a lot of work. We’ve done a lot of different things here. But you might say, isn’t that enough? Haven’t you gotten enough into the project that you’ve accomplished all of your goals? And I’d say this is actually just the beginning of the work we’re doing.
What’s Coming Next
So what’s coming next for Apple and the Iceberg project? One of the things that we’ve been looking to most frequently are new use cases. And those new use cases are almost always focused on performance. Suddenly, people are beginning to say, “Well, you were able to do all these things with row-level operations, and that’s pretty amazing. But we see all of these other proprietary features in other systems, like secondary indexing and materialized views. Can we make those a part of Apache Iceberg as well?” So we’re devoting a lot of time right now into trying to make those things core parts of the Apache Iceberg project so that we can start, again, enabling use cases that previously were unthinkable or just basically impossible to do even with proprietary technologies. In addition, we’ve been looking a lot at how to better incorporate statistic information.
Currently, you might be aware that Trino is able to generate Puffin files with NDV stats, but those stats are not usable by Spark. So one of the things we’re looking at is how can we improve that on the Spark side and have that incorporated as well? And finally, one of the new areas that we’re really excited about is geotypes. So geotypes and geopartitioning are widely in use at Apple, and we would really like to make that, again, a first-class feature of Apache Iceberg so we can get even more adoption for use cases without the need for users to build their own geotypes or their own abstractions over Apache Iceberg tables. And then last but not least is encryption. But you may know that Apple is an extremely privacy-focused company. We think it’s a very high priority for ourselves and for our users, and we want to make sure that you can use native encryption with your Iceberg table so that even if your storage level is compromised, your Iceberg data will still be safe.
So again, this is all things we’re working on now. Encryption is actually almost done, and I’m really hoping it’ll be in the next release of Iceberg. We’ll see how that goes. But one of the key things I wanted to bring up about all of these projects and all of these future directions is that we’re really interested in having all of these work with the community. We want everyone in the community to help us out with this because we know any future we build in collaboration is going to last.
So my last call-out to everyone here is that we would love to work with you. Talk with us on Apache Iceberg Slack. If you heard any of these projects, you think that they’re interesting, reach out to us. We’d love to collaborate. We’d love to improve this project together and make sure that Apache Iceberg is successful, not just for Apple, but for everyone in the industry.