March 1, 2023

9:35 am - 10:05 am EST

How to Migrate Your Tables to Apache Iceberg

An Iceberg based data lakehouse has several benefits in not just query performance and cost, but in scalability, consistency and more. Migrating your existing data lake or data lakehouse to Iceberg doesn’t have to be difficult, but there are some important decisions and considerations. This presentation hopes to give you a high-level roadmap to planning your successful Apache Iceberg migration journey:
– Why migrate to Apache Iceberg
– Which Catalog should I use
– How to migrate data in-place
– How to migrate by restating my data
– How to plan the phases of my migration

Topics Covered

Iceberg

Sign up to watch all Subsurface 2023 sessions

Transcript

Alex Merced:

I’m going to be presenting today on the topic of, “How to Migrate Tables in Apache Iceberg.” As you can see here at Dremio, I wear many hats, being moderator, and the speaker right now. What I want to talk about today is how to migrate your tables to Apache Iceberg. Who am I? My name is Alex Merced. I’m a developer advocate at Dremio. I like to do a lot of things such as creating a lot of content around, things like Apache Iceberg, Apache Arrow, the Data Lake House in general. You’ve seen in our weekly broadcast, Gnarly Data Waves, which if you haven’t subscribed on iTunes or Spotify, you should. Also, as you hear, I’m one of the authors of that upcoming book, Apache Iceberg, the Definition of Guide, so look forward to that.

Today we’re going to be talking about several things basically, migrating to Apache Iceberg, not just the how, but the why, and where are we coming from? Yes, we want to go to Apache Iceberg, but why and where are we coming from? We’re going to have that discussion today. We’ll discuss sort of what the old standard was, which was Apache Hive. We talk about what the problems with that old standard were and why we want to migrate to Iceberg, and then how to do it. When we talk about how to do it, we’ll talk about how to do it from anywhere, not just from Apache Hive, but if you have Delta Lake tables, you want to turn the Iceberg, you have Hudi tables, you want to turn into Apache Iceberg, you will learn how by the end of this presentation.

Before the Modern Era of the Data Lake House

First, let’s just discuss where we were before. Before the modern era of the Data Lake House, you had data lakes and you had this need to interact with your data on the data lake as sort of individual tables. You may have a data set that’s a thousand Parquet files, a thousand ORC files. You didn’t want to have to sit there and say, these files are the data set every single time you wanted to run a query. So, what would you do? You needed a table format, and that’s where Hive came in. They came up with a way to be able to sit there and say, these thousand files, these 10,000 files, whatever number of files that they’re one singular dataset. That was the Apache Hive format. The way they did it though is they did it by taking a directory and saying, Hey, all the files in this directory, well that’s a table.

That worked out really well. You could sit there and say, okay, I want to query this table, and it knows that all the files in this folder are a table. This did come with several problems, and prose. So, prose is that this was the best approach they had at the time, and it became the de-facto standard. You see Apache Hive tables everywhere. It did enable things like being able to do certain practices like partitioning on the data lake, which was nice. It was file format agnostic and you could atomically update whole partitions. These were all good things and it did give us consistency. It gave us one answer to the question of what is the table? When it came to more granular changes, like you wanted to update a row, well, you’d have to update the whole partition. Smaller updates were inefficient. Concurrent updates were really difficult. Having to list all these directories and subdirectories for partitioning would really slow down queries. There were all these imperfections, with that way.

We flash forward to today. We’ve moved to the era of the modern data format. You’ve heard of Apache Iceberg, Apache Hudie, Delta Lake, and at a high level, they all give you the main benefit, the ability to one, recognize the table on your data lake. They’re going to give you asset transactions so that way you can, transact atomically on your data lake and they’re going to give you the ability to do time travel.

What is Apache Iceberg

There are reasons you might specifically want to migrate to Apache Iceberg. Again, what Apache Iceberg does differently is that it really breaks away from that direct format. It doesn’t matter where your files are with Apache Iceberg, what your table is, is defined by a tree of metadata, which has three layers. You are the top layer which defines the table, those are the metadata files. You have the middle layer, the snapshot files or manifest lists that define individual snapshots of the table. Then every snapshot is made up of many manifests, which are groups of files that make up the table at that point in time. Each of these layers with metadata that allow engines to query that data more efficiently and faster. That’s kind of how it achieves its performance. Because the manifests can list exactly where the file is, it doesn’t matter if they’re all in the same folder. That provides huge benefits. We’ll talk about some of those very unique to Apache Iceberg benefits now.

Unique Benefits to Apache Iceberg

This enables certain features that cannot be enabled anywhere else, that rely more on that direct restructure, such as partition evolution, because we don’t care about what folder the files are in for them to be partitioned. Because, all the partitioning data is actually tracked in the metadata. It just says, okay, hey, this column applying this transform. That’s how this table’s partitioned. It doesn’t matter what the folders they’re in, you can actually have partition evolution, meaning you can actually change the way the table is being partitioned anytime you want. You might have data that was partitioned by year and then now you want to partition it by month. You can change that and all data going forward is partitioned by month without having to rewrite the old data, which was partitioned by year. You don’t have to rewrite petabytes of data just because you decided, hey, my partitioning needs have changed. That’s pretty awesome. That’s enabled because again, the metadata doesn’t work, not the physical structure or the physical location of the files.

Compatibility mode. One of the issues when you have your data on object stories like S3 is you might have throttling when too many requests to files under the same prefix or folder happens. That can cause, well, slow down, because you’re getting throttled requests. Well, you avoid that with Apache Iceberg because it has a compatibility mode that instead of distributing your files over folders for each partition, it creates a bunch of hashed folders. The dials for individual partitions might be stretched across tens, twenty, thirty file folders or prefixes, which reduce that partitional throttling. You don’t have that issue. You have hidden partitioning. The metadata is what tracks the partitioning and the metadata can track a sort of function or transform that transforms the column data.

You don’t necessarily have to create an additional field. In the past, what you would have to do if you wanted to partition the data by day, you would have to create an additional field beyond your timestamp columns, and then you would have to actually include those columns in your queries. If you’re the end consumer and you don’t know that the table is architected in this way, you could accidentally do a full table scan and that’d be a much slower query. You have these features that are unique to Apache Iceberg. Not only are the features reasons to migrate, but there’s also the tooling. Apache Iceberg has really been embraced by a lot of tools. So you have the ability to Read Apache Iceberg tables, the ability to Write Apache Iceberg tables are from a lot of different tools.

Apache Iceberg Support Growing

Pretty much name the tool. It pretty much supports Iceberg at some level today, and that’s only growing. You have that wide ecosystem of tools you can use Apache Iceberg. As you saw in today’s keynote, the whole idea of data is code. You have the Dremio Arctic catalog, which is powered by that open source Project Nessie technology that’s going to allow you to do branching and merging and all this kind of isolation and version control with your Iceberg tables. That’s versioning the Iceberg catalog. This gives you a whole other level of benefit that you don’t necessarily get right now, at the catalog level with the other formats. Then there’s just the openness of the project itself. The patchy Iceberg project is super transparent. They have weekly meetings every other week that anyone can join you.

If you want to know what’s the direction of the project, what features you’re developing. You can just sit out on those meetings, read the notes, and the actual notes from those meetings or minutes are there in a document that’s easily accessible. There’s a Slack channel where you can actually see the discussions on development, see discussions about helping people implement Apache Iceberg. Everything is very, very visible. You have a wide variety. There’s no particular company that counter controls a project, Apple, Netflix, Tabular, Dremio all contributing to the project, to everyone’s benefit. It’s a real community project. You have the features, you have the tooling, you have the openness. These are all reasons why in particular you would want to migrate to Apache Iceberg. Now that we know what Apache Iceberg is and why we’d want to migrate Apache Iceberg, let’s talk about how.

Choosing Your Iceberg Catalog

The first step in your migration journey is choosing an Iceberg catalog. Because essentially any tool that you’re going to use to access your Iceberg Tables, generally its first stop is whatever your catalog is. It’s going to go to your catalog and say, Hey, where is this particular table located? Each catalog has its different pros and cons. Let’s talk about those pros and cons.

There’s Project Nessie, that’s the open source project Nessie, that’s the thing that actually powers Dremio Arctics data as code features, like the whole branching merging. Some pros using the Project Nessie as your Iceberg catalog is one, you get that Git like functionality. You have the ability to do branching, to do merging. Which enables all sorts of cool things like multi table transactions, the ability to roll back your entire catalog instead of having to roll back each table individually. The cool thing about it, it’s really easy to adopt. You can just go on Dremio Cloud platform and a couple clicks and you have an Arctic catalog. Literally, it’s just a couple clicks and you have a catalog. It can connect via Spark. It can connect via Flink. It can connect via Dremio. You can connect via Presto and hopefully soon Trino. There’s a pull request that’s pending there onto the Trino project. You can use all your favorite tools to access your Iceberg catalog and get those Git like features. You have a cloud managed service making it really easy to deploy.

The only con is, if you’re not using Dremio Arctic then you would have to stand up your own Project Nessie Server. That’s also an option. It’s an open source project. You can stand up your own server. You can get those features.

Hive Metastore

Another option is Hive Metastore. Now, Hive Metastore makes a lot of sense if you already use Hive Metastore. In that case, you don’t have to stand up another service. You don’t have to worry about another service. You can just use your existing Hive Metastore as your catalog for your Iceberg tables. The con is if you don’t have Hive Metastore, then you’d actually have to set it up. Which in that case you might as well then just use Project Nessie if you’re going to stand up a server. It is an option. Then there’s AWS Glue.

Several Catalog Options

AWS Glue is basically a catalog built into AWS services. The awesome thing about using AWS Glue as a catalog is that it really interrupts very well with all of AWS’s other services. If you are using Athena, if you’re using Redshift, you’re using all these other AWS tools, then AWS Glue makes a lot of sense as your Iceberg catalog. If you’re not using a lot of AWS tools, then maybe not. Then maybe one of the other options is a better choice. There are several other options such as JDBC, where you can use any database as your catalog. HDFS, DynamoDB and the New REST catalog. The only issue always is engine compatibility. Just because a different tool supports Iceberg doesn’t necessarily mean they support connecting via other catalog that you use. Generally, you want to choose one of those more supported catalogs.

That way you make sure that as many tools are available to you as possible. Every day more tools support more catalogs. Eventually, we’ll hit a point where generally most tools will support most catalogs and this becomes more of an afterthought. It is something, like the first major decision you need to make in that migration journey. It will determine which tools you can use.

Moving Data – In Place Migration

Now, once you’ve decided on what your catalog’s going to be, then the matter is just to start moving all your data to tables in that catalog. How do I do that? Well, there’s generally two strategies to doing this. You can either do an In Place Migration, which just means take all those files you already have and all those Parquet files you already have and just reuse them. No need to write new ones, just add them to an Iceberg table. Which is great because there’s a lot less work involved. You don’t have to rewrite any data. That’s the beauty of Iceberg because Iceberg doesn’t depend on the file location. You can actually just point the files that are already somewhere else as part of your table without actually having to rewrite them into a different folder. Which is really nice.

Moving Data – Shadow Migration

Now the other option is Shadow Migration, where you rewrite all those tables. Which is a fine option, but again, it’s going to cost more because if you’re rewriting petabytes of data, that’s a lot of computes, but there are some benefits. We’ll talk about these different options and the different ways you can go about these options.

Now, an In Place Migration, you’re reusing the files you already have. We’re going to assume that all your data is in Parquet files. Let’s say you have a Hive table. First scenario, you have a Hive table and you want to migrate that. That’s already built into the Iceberg libraries. There’s a command or a procedure called Migrate. What you can do is just say, Hey, I have this Hive table right here.

Then it will just turn that Hive table into an Iceberg table in your catalog. It will not rewrite any of the files. It’ll just write new metadata across the files you already have. You avoid all the rewriting of data. Super easy. The only downside or the only thing to be careful of is that when you run the migrate command, it does destroy the Hive table. The Hive table won’t be accessible anymore. Generally you want to test it out first. The way you test it out is you use the snapshot procedure and these are all Spark procedures. The assumption here is part of the Iceberg Spark libraries. As Dremio announced this morning, there are different new features on Dremio that allow you to do similar things that copy into and vacuum for different types of maintenance.

These are specifically Spark procedures in the Spark libraries for Iceberg. Migrate, we’ll turn that Hive table into an Iceberg table. Before you do that, you want to use a snapshot command. That will create a Read Only version without destroying the Hive table. That way you can see if the output is right. Like, hey is this actually what I expected? Then if it all looks good, then you go back and run the migrate command. That way you can test it out before you make that leap. You don’t accidentally destroy your Hive table accidentally and realize, oh, this didn’t work out the way I expected. That would be bad. That would be the way for a Hive table. That is pretty straightforward. What if it’s not a Hive table? What if I have a whole bunch of Parquet files that are not using any table format?

Add_File Procedure In Place Migration

What if I’m using Hudi? What if I’m using Delta Lakes? Well, you have what’s called the Add_Files procedure. Now with the Add_Files procedure, it just takes whatever files you point to and then writes metadata to include them in your table. It doesn’t matter where the files are located, it doesn’t matter what table format that it might have been associated with before, as long as your Iceberg table has a matching schema. First, you would create an empty Iceberg table with the matching schema. Then you can just run this procedure and say, Hey, this folder of Parquet files, add that to this table. Now going forward, those files are part of the table. The metadata is written, no data files are rewritten. The only thing to keep in mind is that if you’re talking about a Delta Lake or a Hudi table, then what you’d probably want to do is make sure that you only have the current snapshot in the folder.

You would probably expire all previous snapshots and then run an operation to clean up all the other Parquet files. So, that the only Parquet files are the current snapshot for when you add it, you have the current state of the table. That is going to get a lot easier because right now, just recently they merged this.

New Delta Lake Module

They just merged this Delta Lake module into the Iceberg code base. Going forward, there’s actually a built-in way to sit there and say, Hey, this Delta Lake table, make it an Iceberg table. This should probably show up in the next Iceberg release since it’s been merged into the repository. Essentially, right now you just have this function that you can see here at the bottom. You just pass at the Spark session, you pass at the identifier of your new Iceberg table, and then you pass it the location of your Delta Lake table.

Then there it is, it just takes the current snapshot and makes it your Iceberg table. Then I think they’re going to continue adding to it so that way eventually you can capture the whole history as well. Right now it’ll just take that new snapshot and turn it into an Iceberg table.

Upcoming Hudi Module

But, and even cooler, guess what? They’re also doing the same thing for Hudi, the exact same API. It’s going to take the exact same arguments. This one is currently a work in progress, but eventually, that’ll just be baked into the Iceberg libraries. Being able to take your data from a Hive table, a Hudi table, a Delta Lake table, you’ll be able to just run a function and move it over. That’s going to get easier very, very soon. That is how you would have done an In Place Migration, taking the existing files and moving them over.

Shadow Migration

How would I do a Shadow Migration where I just rewrite the data? Why would I want to do a Shadow Migration if I don’t have to rewrite all the files? Well, maybe the files you have now aren’t the files you want. Maybe the schema is not exactly what you wanted. Maybe you would rather resort than recluster the data so that it’s more performant when you have this new table. If you have to rewrite the data or you choose to rewrite the data, you might as well change the partitioning, change the schema, make it exactly what you would want. That’s a nice thing about shadow migration.

Migration by Restating the Data

Since you’re rewriting the data, you can make those kinds of changes. This would typically be done through a CTAS statement or a CREATE TABLE…AS statement. You would just do a CREATE TABLE…AS statement like the one I have here. Here what I’m doing is I’m basically changing the partitioning. Since I’m rewriting the table, I’m going to decide the partition using a bucket, a 16 part bucket, partition by days in the timestamp column, I’m going to truncate the partition by the last name column as well, based on the first two digits. I can do that because I’m rewriting the data anyways, I can take advantage of that. That’s one nice thing about a Shadow Migration, but the catch is , you are rewriting all the data.

Best Practices

Now that we’ve seen the two options, you can either keep all the data you have and do an In Place Migration. You have all those different options for how you’re going to pull that off, or you can do a Shadow Migration, rewrite the data through a CTAS statement. But in general, I now need to design the actual steps of doing that. I see how to do the actual movement, but I need to create a strategy of how I’m going to phase in from old table to new table.

When To Use Which Approach

Let’s talk a little bit about that. First off, we’ll just kind of reiterate the different options you have. First you have the migrate option. That’s going to be best if you have a Hive table, because that’s currently designed for Hive tables. Basically you have a Hive table, it’ll just move it right over. Right now if you have anything else, Add_Files is going to be your best option for an In Place Migration. You just take the Parquet files regardless of whether it’s no table format at all or any other table format, you can just take those Parquet files and add them manually to your Iceberg table. CTAS if you’re doing a Shadow Migration, the benefit of In Place again is you don’t have to rewrite data files, but again, you can’t change the schema or anything like that when you use the same data files. CTAS benefit is you can change your schema, you can change your partitioning, but you are rewriting your data file.

So again, you have the cost of doing that. One other thing I want to point out is, that migration is not just about migrating from now to Iceberg, but again, as Iceberg catalogs proliferate, there’s also going to be the issue of what if I want to switch catalogs? Right now it might make sense for you to use Hive because you have an existing Hive Metastore, but you see things like Dremio Arctic, Project Nessie and you’re like eventually I would like to use that. I want those data as code features. How would I do that later? Well there’s this Registered Table operation that’s built into the catalog class in the Java API. What it does is you just say, Hey, I have this catalog, I would just create an instance of my Project Nessie catalog and say, Hey, I have this table in my Hive.

Here’s the location of the metadata file and it will just basically write it as a new table in there. No files get rewritten. It just takes that same table and just literally registers it with that other catalog. You can now just use it with that other catalog as you would use it if you had created it in that catalog in the first place. They make that even easier. Instead of you having to write like Java. Every time I want to say something like you write a script in the Java programming language it is hard because when you say Java script, it’s kind of implying something else. You would write a script in Java. Instead of doing that in the future, you’ll be able to use a CLI tool. It’s actually being written by someone at Dremio. You’ll be able to actually do all that from a CLI tool to actually move your data from one catalog to another catalog. To move it from Hive to Project Nessie or from Hive to Glue or even to Hive, whatever you want to do.

This whole catalog issue becomes less and less of an issue as all these technologies. We’ll come back to questions in just a moment. Our last piece here is basically, you’re going to want to schedule this in like phases. Essentially, you want to start out this whole migration process. First you want to basically Read, from your old table, because you know your old table’s still working. You still want to Write to your old table. Then we begin the migration process. The first step is to create that new table and begin every time new data comes in. Write to the old and new table, because you still want the old table until you know you can move off of it. In phase two, all your Reads, all your readers should still be reading the old table, the Hive table, the Delta Lake table, the Hudi table.But all your Writes should be writing to both the old table and the Iceberg table.

Once you’re pretty comfortable that they’re in sync and basically the Writers have kind of kept these things in sync, what you would then do is you would move into phase three. Now you would switch over all your Readers to only Read from to the Iceberg table, but you’d still Write to both tables just in case. So if there’s any issues that come up, you still have that backup, that table is your old table just in case. Once you’re a hundred percent sure now everything’s running smoothly and you can support it and all good, then you can pull the plug on that old table and move on and basically save all that storage and basically, get rid of all the stuff you don’t need anymore. Generally you want to think of those phases and think about how you’re going to execute each of those phases.

Recap

To recap. First thing you want to do is decide, I want to migrate to Iceberg. Two. Then you need to decide what catalog we’re going to use. Once you decide what catalog you’re going to use, you can decide how you’re going to add your tables to that catalog, either In Place or Shadow. Then you put together your step-by-step plan to actually go through these phases. So that’s sort of like the journey and the thought process of migrating to the Iceberg. With that, thanks. Now it’s going to be time for questions. Again, my name is Alex Merced, that’s my email. If I can’t get to every question today, just make sure to email me and also follow me on Twitter @amdatalakehouse.

header-bg