March 1, 2023
10:10 am - 10:40 am PST
Migrating from Delta Lake to Iceberg
In this session, Scott Teal from Snowflake will describe advantages of Apache Iceberg over Delta Lake as a table format, and walk through approaches and considerations for migrating tables.
Sign up to watch all Subsurface 2023 sessions
Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.
Thanks, everyone for tuning into this session. I’m excited whenever I get the chance to talk about Iceberg. Thanks to the Dremio team for having me here today. At this conference, the Subsurface Data Lake House Conference, you’re going to hear lots about Data Lake House, table formats, specifically about Apache Iceberg. And the chances are that you’ve probably come across or heard of or even are using another table format in the form of Delta Lake. What my goal is, is that after this introductory level session, that you’re going to walk away with some basic understanding of some ideas to consider for different approaches to converting from Delta Lake to Apache Iceberg.
We should be able to see the slide. Here’s an overview of how we’re going to go about doing that today. First, just a quick recap, quick summary. Most of you probably know about Delta Lake and Iceberg, but in case you don’t, we’ll just do a quick summary of the two. If you’re interested in doing a deeper dive on either of the table formats, there’s lots of information, documentation online, and also other sessions at this conference. We’ll go through a summary of the different strategies that you can take to convert Delta tables into Iceberg format. Share a simple demo of these different strategies hands on and I’ll save time for a few minutes at the end for some live Q&A.
Before we dive in, I think the first thing that I want to cover is why are we talking about these things? Why are we talking about specifically, converting Delta Lake into Iceberg? Delta Lake and Iceberg, both really interesting technologies. Each of them have advantages, disadvantages and great for you, different use cases. Ultimately, you should be choosing the tools, the table formats, the architecture that best serve your needs and use cases. This session is going to be most valuable for people who are maybe learning about Iceberg, find it intriguing, curious what a potential migration could look like, or those who are thinking about using Delta Lake, but also want to make sure that you have an exit strategy in the future if needed. First, let’s walk through the two different table formats.
Two Different Table Formats
So, like I said, some of this stuff you might already know, but just to level set on how Delta Lake works. Every modification to a table, inserts, updates, deletes, they essentially add and remove files. They’re an update metadata, and they generate what’s called a new commit to keep track of how the table is changing. Those commits are stored in what’s called a Delta Log which is in JSON file format. After a few commits, Delta Lake summarizes this into what’s called a checkpoint file and Parquet file format. All of these commits are essentially metadata describing the changes to the actual underlying data files, which are stored as parquet. In an architecture without a catalog like Glue or Hive MetaStore, engines are relying on the folder structure to locate and address tables in order to operate on them like reeds and rights.
Iceberg is similar to Delta Lake in that, well where they are similar is that both are trying to create abstraction in order to provide asset transactions, time, travel, pruning, all the metadata operations. Let me just describe what Iceberg does and maybe how it’s a little bit different. With Iceberg, there’s multiple layers of metadata. At the top in this figure on the right here, there’s the catalog to keep track of tables in their locations to make sure that engines know where tables are located and how to address them. Then there are metadata files that keep track of the table schema, partitioning, properties, and snapshots. There’s the manifest list, that keeps track of the various manifests, the partition stats, file counts for pruning, and query execution. Then there’s the next layer down, which is manifest files, which is essentially just a list of data files, partitions, and summary stats.
Then at the lowest level are the actual data files, which you can think of if you were to query a table, these are the things that are the actual fields and rows and columns and values. But, like I said, they’re both Delta Lake and Iceberg. Similar in that they are trying to use metadata to create some abstraction from the data files into tables. That’s how they’re able to provide asset transactions, time, travel schemes, and evolution. That said, there’s a few in my opinion. I think there’s a few advantages to using Iceberg. One is that Iceberg doesn’t depend on the table’s physical layout. Iceberg tables can evolve partition schemes over time through metadata only operations. Another is that Iceberg keeps track of relationships between partitions, columns, and transformations.
Filtering by date, time, value, even though the tables partition by date this functionality is what the Iceberg community has come to know as hidden partitioning. And lastly, Iceberg is an example of a really thriving open source project. Your use case calls for an open source table format, and you probably want to implement an architecture that you can trust for the foreseeable future, right? The ecosystem that is around Iceberg, that continues to grow, it’s already quite large and rapidly growing. It’s a signal of the shelf life that this project is going to have, in the longevity that it’s going to have, as a meaningful and important part of your overall data architecture. So much credit for that reason, so much credit is owed to the founders, the Iceberg community, the contributors, and the ASF for supporting this project.
Converting From Delta Lake Table to Iceberg
That’s just an overview of the two different table formats. Let’s walk through a couple of why or how you might convert from Delta Lake table to Iceberg. At a high level, in credit to Alex Merced, who’s on the Dremio team, I thought I’ve seen some of his content and thought that he does a really good job of summarizing a lot of these concepts. Here’s a high level process of what a conversion would look like. Phase one kind of status quo, if you’re using Delta Lake today, you’re reading and writing to Delta Tables. Phase two is that you created an Iceberg target table so that all the new rights are going to both the Delta and the Iceberg Tables to get them synchronized. For the time being you’re still using your Delta Table for the production reads.
Phase three is essentially flipping the switch where both of the tables are synced up, the Delta and the Iceberg table, all the reads are now pointing to the Iceberg Table, but underneath, you’re still writing to both the Delta and the Iceberg Tables, checking it for quality assurance. The Delta Table is there in case there’s any need to roll back from the new Iceberg Table. In phase four, everything looks good. You only need to write to the Iceberg Table now. All of your reads are already happening on the Iceberg table. Let’s take a closer look at specifically phase two for how to actually create the Iceberg Table from the existing Delta Table. There’s a couple different strategies on how to do this table conversion. Each method has its different trade-offs, and it’s important that you make the decision of what matches the needs of your use case.
Different Strategies for Table Conversions
The first, which is the top row where you’re actually copying and rewriting data. Dan Weeks, one of the Iceberg co-creators, called this the Brute Force Method, and Alex Merced called this the Shadow Migration, where you’re basically reading the entire Delta Table and you’re writing it out as a brand new Iceberg Table. Now, it may seem less than ideal to pay for the compute involved in writing all those new data files, all those Parquet files being rewritten into the Iceberg table, but as part of that, you get all of the column IDs that are generated. What that provides is all the awesome benefits of Iceberg. You get clean, reliable schema evolution, partition evolution, hidden partitioning. You get all that stuff very easily. Overall, this one is a very clean approach.
The second is what Alex calls the In Place Migration, where you’re actually not copying the underlying data files and you’re just using the Parquet files as they are from the Delta Table. Pretty obviously, the nice thing is that you don’t have to rewrite all those Parquet files, but the downside is that you didn’t rewrite all those Parque files, so you saved some of that money upfront on the compute, but it could lead to headaches and pains later without the optimizations that Iceberg can provide. Plus, you don’t have that clean column ID mapping. That is what enables safe, schema, Partition Evolution, and all those metadata operations with Iceberg. This is mainly focusing on the conversion of data files, like I covered in the Iceberg diagram earlier, there’s that catalog component to it, right?
Let’s talk quickly about catalogs in this migration. I’m not going to do a deep dive on the different catalogs and compare which one to use when. I simply wanted to provide a pretty high level summarized list of what different options are. All by which to illustrate that there’s lots of options with Iceberg and you have options with Delta Lake as well. If you’re using Hive MetaStore with Delta Lake, you can stick with Hive MetaStore and Iceberg. Or if you’re using Glue Catalog with Delta, you can stick with Glue Catalog in Iceberg. If you aren’t using a catalog with Delta, there are lots of options to choose from in terms of which catalog to implement. I will also call out that the Hadoop Catalog with Iceberg is essentially relying on the file system.
It’s about as close as you get to having no catalog as with Delta Lake. Also, relying on a file system, if you’re not using one of these catalogs, Hadoop is probably the closest on the Iceberg side. With all of these Iceberg Catalogs, each one has inconsistent implementations. You might want to make sure that accessing Iceberg tables from different clients follow the same rules. To address this, the Iceberg community has created what’s called a REST Catalog implementation. More to come on this. I know that the community is hard to work on this. I would expect to see that the Iceberg community focus on this into the future to get that consistent experience.
Delta Lake Table Into an Iceberg Table: CTAS Method
Now we’re at the demo portions. I will now play a quick little demo showing these different conversion strategies in action. In this demo where I’m going to show you how to convert a Delta Lake Table into an Iceberg Table. Let’s first take a look at what tools I’m using. First, you see here I have an S3 Bucket located in AWS, US-West-2. Like I mentioned before, there’s some pretty handy procedures with Apache Spark and Iceberg. I’m going to be using EMR and EMR Cluster version 6.9.0, running Spark version 3.3.0. I’ve configured my cluster to have Delta and Iceberg enabled, using their default properties, or sorry, their default versions. Those versions happen to be for Delta Lake 2.1.0, and Iceberg O.14.1. And I’m going to be using Iceberg SQL extensions.
I’ve disabled the Delta’s safety check that will allow me to vacuum the entire Delta Table to get only the latest version of Parquet files, which you’ll see where that comes into play later. For simplicity, I’m just going to be using a catalog type, Hadoop. I’m calling the catalog, Demo. I’ve specified that the warehouse is located for the Iceberg Table is located at this S3 path. With that, let’s jump over to the Notebook, where I’m actually going to be running the Spark jobs. I already have the Spark application running, so we don’t have to wait on that. As I noticed, as you probably noticed just a moment ago, this S3 bucket is empty. I don’t currently have a Delta Table to convert into an Iceberg Table.
Let’s just create a simple little Delta Table using the range method to write that into a data frame called, data. Then I’m writing the contents of that data frame out to a table in the format of Delta, and saving that table in Delta format to this Delta Table prefix. Spark is working away, created that data frame and wrote it to a Delta Table at that location. Now, let’s just read the table. Here I am reading the table. I’m specifying that, hey, this table’s in Delta format and it’s at this location and let’s see what the contents look like. Simple little table, one column, few integers. Great, just like we expected. In case you didn’t believe me before, here I just refreshed the bucket. Here we have all the data, metadata, couple Parquet files, the transaction log for Delta Table.
Great. As you’ll see in one of the future end method 2 for add_files, we’ll need to get to the latest version of the Delta Table. Just for the sake of simulating a new transaction or multiple transactions to a Delta Table, I’m again going to be using the range method. Let’s just actually modify this to from 0 to 10. I have the range method, I’m putting that into a data frame, and I am writing the contents of this data frame to Delta Table, same S3 location, and I’m specifying to overwrite the table.
While this is working away. As you saw a moment ago. There was just one JSON file, and now there are two, which means that there’s a new transaction. If I go back, there are more Parquet files, just as expected. Just so you can see the actual, that I can read the table, let’s just see what the table looks like. There you have it, same table, few more integers. Great. We have the Delta Table and S3. Now we want to convert it into an Iceberg Table. We already saw that there are multiple commits and the Delta log, two JSON files, we already read the table. What we’re going to do is we’re going to read the Delta Table into a dataframe that I’m just calling DF_delta. Then I’m reading, I’m writing out the DF_ delta into the table at this location, which is the Demo Catalog database, just called db, and a table named Iceberg_ctas.
I need to create that table, so I’m using the create method. Like I described earlier, this is a full reading, the full Delta Table in its latest version, and writing out to a brand new table, an Iceberg format. Just to show that I, it looks exactly the same as when we were reading in Delta format. Now I’m reading from this location, but in Iceberg format. Just to prove it to you, if I go back to the bucket, there’s this new Iceberg Tables Prefix. If I click through, I have the DB, database, Iceberg_ctas Table, and there are my data files and Parquet format, and there are my metadata files. Great.
I just showed you in the S3 user interface, but I could also query this file’s location and just basically read out all of the file paths for the Parquet files for the respective Iceberg Table. Here we have it, there’s Iceberg Tables, DB the database Iceberg_ctas the table, and that’s the paths of all the Parquet files for that table. In that method, To summarize, I created a few Parquet files and created a table in Delta format. I read the contents of that entire Delta table, and I wrote them out to a brand new Iceberg table. Brand new set of Parquet files in this location.
Add Files to an Iceberg Table
Now, what if I didn’t want to write out brand new Parquet files? What if I was maybe more cost conscious? I cared less about schema and partition evolution in the future.
I just wanted the simple methods to say, Hey, I already have Parque files for my Delta Table. I would love it if I could just put layer Iceberg metadata over top of that. To do that, first, I need to, like I said, have those Parque files as part of the Delta Table. I still have that df_delta dataframe. I’m just going to use that dataframe to write a new table. I’m calling it iceberg_add_files. I’m creating a new table, and I’m just reading basically the header row. I’m just essentially copying the schema as defined from this dataframe. From that, creating basically an empty Iceberg Table. I can even show it to you if I did a spark.read.format”Iceberg”.load, at this location that I just specified.
Just an empty table, no records, just an empty table with a field called ID. We have, if you recall, from the Delta Table, we have a few transactions. If I go into the Delta log, there’s a few transactions. Imagine if I had lots and lots of transactions, which means that Delta is writing a lot more and more Parquet files. Well, I want to make sure that I’m not adding old Parquet files into the new Iceberg Table. What I need to do is I need to vacuum the Delta Table, and I am going to vacuum at this location. I’ll go ahead and kick this off. I’m specifying a retention period of zero, which is basically, Hey, I’m telling Spark, I only need the latest version of the table and clean out all the other Parquet files that aren’t part of that latest version of the table.
Later when I’m adding those files to the Iceberg metadata, when I’m pointing Iceberg to those Parquet files, I’m making sure that I don’t have any extraneous Parquet files in there. Here I vacuum the table. Now let’s go back and check out what is probably half the number of Parquet files. Which is a good sign, means that it is an indicator that the vacuum worked like I wanted it to. Now we have this procedure called Add Files that I described earlier. What I’m going to do while I just go ahead and kick this off, I’m telling this procedure, I’m specifying that the table that I want to add files to is at this location, that empty table that I created earlier called iceberg_add_files. The source of those files that I want to add to it are at this S3 location and Parquet format.
That procedure ran, and then I also just did a simple SQL select on this add files table and showed me the results. Here are the results. Just like they looked from that DF_Delta Dataframe, in the latest version of that dataframe, not the initial version where we just had a few integers. To show you that I know it maybe not terribly obvious that you’re just trusting me, that the Parquet files are still in the Delta path and that we didn’t write out new Parquet files. We can do that in a couple ways. One, let’s go check out the UI. So, in the S3 I have Iceberg tables, there’s my DB database. Let’s check out the Iceberg_add_files. Notice that there’s only metadata. There’s not a data folder like there was in my other Iceberg table, an Iceberg-ctas.
There’s not a data folder, right? I can also verify that by doing a select on the files themselves, just like I did earlier, where earlier I had these Parquet files at this location. What I would expect to see here is that the location for these files is actually like expected in the Delta Table location. I’m breeding the table in Iceberg format, but the underlying Parquet files I did not rewrite at all, and I’m using the files at these locations. There you have it. There’s a quick demo. Very, very simple demo where we showed the two methods. One where we did the ctas method, where we read the entire Delta Table, wrote it out as an Iceberg table. We also showed the second method where I don’t rewrite any of the Parquet files and I just add files to an Iceberg Table. Now let’s jump back over to the slides to close this out. All righty, so that was a fun demo and that’s pretty much all we have time for today.