Option 1: In-Place Data Migration of Hive Tables
In-place data migration avoids rewriting the data. Instead, you write new Apache Iceberg tables that comprise the already existing data files in your data lake.
by Alex Merced, Developer Advocate at Dremio
So, you’ve heard about the trend that data lakes are migrating to Apache Iceberg as their table format. You’ve heard how Apache Iceberg can enable ACID transactions, partition evolution, schema evolution, time travel, version rollback, concurrent writing, and more on your data lake. If you aren’t familiar with Apache Iceberg read this great deep dive into its architecture.
This all sounds great, but all your existing data lake tables are currently in the legacy Hive format and you’d like to know how best to migrate this data. In this article, I hope to illuminate the differences between the two migration options, along with the benefits and drawbacks of each approach:
After reading this article laying out the principles of migrating make sure to checkout this article with a hands-on exercise to see these steps in action.
In-place data migration avoids rewriting the data. Instead, you write new Apache Iceberg tables that comprise the already existing data files in your data lake.
To allow for this kind of migration, Iceberg has several procedures built into its Spark extensions. Which procedure you should use will be based on whether you are using a Spark Catalog.
Use the migrate procedure to write a new table using the source table’s data files. The new Iceberg table uses the same data files as the source Hive table. Ideally, first you’d create a temporary test table using the snapshot procedure and then use the migrate procedure when ready to migrate. The source data files must be Parquet, AVRO, or ORC.
Use the add_files procedure. This procedure adds the existing data files in the source Hive table to an existing Iceberg table with a new snapshot that includes the files. The source Hive table and the destination Iceberg table will both be referring to the same data files.
With a shadow migration you are not only creating a new Iceberg table and all the metadata that comes with it, but also restating all data files. This new table is known as the shadow. It becomes the main table once all data is in sync. For larger datasets, you may want to restate them in batches. This approach will have its own set of pros and cons
For smaller datasets, restating a table in whole can easily be done with CTAS statements. For larger datasets, you may want to restate an initial batch of data with a CTAS statement and follow up with batched inserts until the entire source is ingested into the new table.
Whether you restate the data with a shadow migration or avoid restating the data with an in-place migration, the actual migration process needs careful architecture.
In general, it should go through a four-phase process:
As you progress through these phases, be sure to check for consistency, logging errors, and logging progress. These checks can help you manage quality, troubleshoot, and measure the length of the migration.
While Iceberg’s procedures and CTAS statements give you a simple path to migrating existing Hive tables, the true art is in architecting your migration plan to minimize or eliminate disruption. Follow the advice in this article to have a solid migration plan and you’ll be enjoying the benefits of Apache Iceberg on your data lake in no time.