h2h2h2h2h2h2h2h2h2h2h2

21 minute read · August 8, 2024

Migration Guide for Apache Iceberg Lakehouses

Alex Merced

Alex Merced · Senior Tech Evangelist, Dremio

In optimizing our data infrastructure for cost-efficiency, ease of use, and business value, adopting new, valuable patterns can often be challenging due to migration obstacles. These challenges include:

  • How do I move my data from format/location A to format/location B?
  • How do I make these movements with minimal disruption to consumers?

In this guide, we will provide an in-depth look at designing your approach for migrating to an Apache Iceberg Lakehouse. Here are links to a few of our previous migration articles for reference.

Why Migrate to An Apache Iceberg Data Lakehouse?

Data Lakehouse architecture transforms your data lake storage into a tool similar to a database or data warehouse with SQL-enabled tables and semantics. This is achieved by taking datasets in analytics-optimized file types like Apache Parquet and writing metadata to group those files into singular tables. This metadata layer is known as a data lakehouse table format, a category that includes Apache Iceberg, Apache Hudi, Apache Paimon, and Delta Lake.

By migrating to a data lakehouse architecture with any table format, you gain:

  • A database-like table that multiple tools can access
  • ACID guarantees
  • Time travel
  • Schema evolution

With Apache Iceberg, you receive additional valuable benefits such as:

Apache Iceberg has been growing by leaps and bounds as an open-source project, gaining significant industry mindshare, and is quickly becoming the industry standard for data lakehouse architecture.

First Step - Choosing an Apache Iceberg Catalog

Before we begin discussing how to move your data into Iceberg tables, you need to decide what you will use to catalog those tables. An Iceberg catalog essentially works as a table lookup for your favorite tools, matching table names with references to the location of the table's latest metadata, ensuring consistency and concurrency. Here are a few articles on Apache Iceberg catalogs to help you better understand this concept.

Here are some of the primary options available for reading and writing to Apache Iceberg tables across multiple tools:

Your initial research should involve hands-on experience with these different catalog options to determine which best fits your workflows, security needs, and desired ergonomics. If you later choose to switch catalogs, this is no longer a significant issue, as the Nessie project has created a catalog migration CLI tool that makes migrating your list of tables to a new catalog possible with a single command.

Two Approaches to Moving Your Data in Apache Iceberg Tables

When migrating data into Apache Iceberg tables, you have two main approaches: using your existing data files, known as an "in-place" migration, or rewriting all the data files, called a shadow migration.

Whichever approach you take, using a "blue/green" migration strategy executed in phases is advisable. A blue/green migration strategy is a deployment approach that minimizes downtime and reduces risks during the migration process. In this strategy, two identical environments are maintained, referred to as the "blue" and "green" environments. The blue environment is the current production environment that users interact with, while the green environment is a duplicate setup where the new changes or data migrations are implemented. By running both environments in parallel, you can thoroughly test the new configuration in the green environment without affecting the live production system.

Once the green environment is fully validated and all tests have passed, traffic is gradually redirected from the blue environment to the green environment. This can be done incrementally, allowing for close monitoring and quick rollback to the blue environment if any issues arise. This approach ensures a seamless transition with minimal end-user disruption, as they experience no noticeable downtime. By carefully managing the migration in phases, you can maintain system stability and identify and resolve any potential issues before the green environment becomes the new production standard.

 +----------------+      +----------------+
 | Blue (Current) |      | Green (New)    |
 |  Environment   |      |  Environment   |
 +----------------+      +----------------+
          |                     |
          |                     |
          V                     V
 +---------------------------------------+
 |               Traffic                 |
 |                Split                  |
 +---------------------------------------+
          |                     |
          |                     |
          V                     V
 +----------------+      +----------------+
 | Blue (Current) |      | Green (New)    |
 |  Environment   |      |  Environment   |
 +----------------+      +----------------+
          |                     |
          |                     |
          V                     V
 +---------------------------------------+
 |          Monitor & Validate           |
 +---------------------------------------+
          |                     |
          |                     |
          V                     V
 +----------------+      +----------------+
 | Blue (Current) |      | Green (New)    |
 |  Environment   |      |  Environment   |
 +----------------+      +----------------+
          |                     |
          |                     |
          V                     V
 +---------------------------------------+
 |       Switch Traffic to Green         |
 +---------------------------------------+
                          |
                          |
                          V
 +----------------+
 | Green (New)    |
 |  Environment   |
 +----------------+

Now, let's examine how we will move the data.

In-Place Migration

An in-place migration allows you to avoid rewriting your existing Parquet files and instead add the existing files to an Apache Iceberg table. This can be done in a few different ways:

Using Spark and the "add_files Procedure

When migrating data into Apache Iceberg tables, one method is to use the add_files procedure in Spark. This approach allows you to add existing Parquet files directly to an Iceberg table without rewriting them, making it an efficient solution for in-place migrations.

The add_files procedure works by incorporating files from a Hive or file-based table into a specified Iceberg table. Unlike other procedures, add_files can target specific partitions and does not create a new Iceberg table. Instead, it generates metadata for the new files while leaving the files in their original location. This means the procedure will not validate the schema of the files against the Iceberg table, so ensuring schema compatibility is essential to avoid issues.

One of the key benefits of using the add_files procedure is that it allows the Iceberg table to treat the newly added files as part of its dataset. This enables subsequent Iceberg operations, such as expire_snapshot, to manage these files, including their potential deletion. It's a powerful feature, but one that requires careful consideration, especially if schema validation is not enforced.

Practical Examples

To illustrate, let's consider a scenario where you need to add files from a Hive or Spark table registered in the session catalog to an Iceberg table. Suppose you only want to add files within specific partitions where region equals NorthWest:

CALL iceberg_catalog.system.add_files(
  table => 'db.sales',
  source_table => 'sales_legacy',
  partition_filter => map('region', 'NorthWest')
);

Using Spark and the "migration" Procedure

One method for doing this is through the migrate procedure, which replaces an existing table with an Iceberg table, using the source table's data files. This procedure ensures that your table schema, partitioning, properties, and location are all preserved during the migration process.

The migrate procedure is designed to facilitate migration copying over the original Hive table’s metadata, creating a new Iceberg table with all the necessary mappings. This process ensures that the original data files are integrated into the Iceberg table, allowing seamless access and functionality. By default, the original table is kept as a backup, renamed with the suffix _BACKUP_, providing a safety net during the migration process. This is particularly useful for maintaining data integrity and allowing for a rollback if needed.

Practical Examples

To migrate a table named db.sales in Spark's default catalog to an Iceberg table and add a custom property, the command would look like this:

CALL iceberg_catalog.system.migrate('db.sales');

Using Apache XTable

Regarding enhancing interoperability between different table formats without moving or copying underlying data files, Apache XTable™ (Incubating) is an available tool. Apache XTable allows you to sync your source tables in various target formats seamlessly. This feature enables you to expose a table ingested with one format, such as Delta Lake, as an Iceberg table, maintaining similar commit history for accurate point-in-time queries.

Using Xtable just involved creating a configuration file like below (this is an example, in practice you'd have to pass your catalog information as well):

sourceFormat: DELTA
targetFormats:
  - ICEBERG
datasets:
  -
    tableBasePath: file:///tmp/delta-sales-dataset/sales
    tableDataPath: file:///tmp/delta-sales-dataset/sales/data
    tableName: sales

Then, running Apache XTable against that table:

java -jar xtable-utilities/target/xtable-utilities-0.1.0-SNAPSHOT-bundled.jar --datasetConfig my_config.yaml

This would output Apache Iceberg metadata to enable querying the table as an Apache Iceberg table.

A Consideration with In-Place Migration

One major consideration with an in-place migration is that the original files were not written with an Apache Iceberg Parquet writer. This is significant because Apache Iceberg assigns a unique ID to every column, and these ID mappings are typically embedded within the Parquet files. This ensures consistent column references, even if columns are renamed or evolve over time.

When you add files that were not written with these IDs, you may encounter varying levels of read support from different engines. This is because some engines may rely on the presence of these column IDs to correctly map data file columns to the metadata columns. Engines that do not have built-in capabilities to fall back on using column names for this mapping might experience issues. Therefore, it's crucial to understand how the engines in your ecosystem handle this fallback mechanism to ensure seamless data operations.

Shadow Migration

In a shadow migration, you rewrite the data, allowing you to use any engine that supports writing to Apache Iceberg tables with a simple CREATE TABLE AS (CTAS) statement. This approach enables you to migrate data from any source that the engine supports reading from. For instance, in Dremio, you can use CTAS to migrate data from databases, data lakes, and data warehouses into Apache Iceberg tables. By rewriting the data, you have the opportunity to rethink your schema, clustering, and partitioning to ensure that the new table is perfectly optimized from the start.

An added benefit of using Dremio for shadow migration is its ability to migrate data from JSON and CSV files effortlessly. By utilizing the COPY INTO command, Dremio ensures that all data is correctly mapped into the proper schema. This feature, along with other Dremio Iceberg ingestion patterns, is covered in the following articles, providing you with comprehensive guidance on effective data migration.

How Dremio Can Make Your Migration Journey Easier

Dremio not only facilitates the migration process by providing tools for writing to Apache Iceberg tables but also serves as a uniform interface between your old and new systems. This minimizes friction and change management with consumers, as they can continue using the same tool to access data before and after the migration, provided the old system is compatible with Dremio.

Steps to Streamline Migration with Dremio

  1. Connect the Old System to Dremio:Begin by connecting your existing data sources to Dremio. Dremio supports a wide range of data sources, including traditional databases, data lakes, and data warehouses, allowing you to integrate your old system seamlessly.
  2. Get Consumers Used to Dremio's Interface:Introduce your consumers to Dremio’s intuitive and easy-to-use interface. This step is crucial as it helps them become familiar with Dremio’s functionalities, reducing the learning curve and easing the transition.
  3. Connect the New System to Dremio:Once consumers are comfortable with Dremio, connect your new system, which uses Apache Iceberg tables, to Dremio. This ensures that the same interface is used to access data from both the old and new systems, providing a consistent user experience.
  4. Progressively Migrate Workloads to the New System:Start migrating workloads to the new system incrementally. This progressive approach allows for careful monitoring and troubleshooting, ensuring that any issues are addressed without major disruptions to the overall workflow.
  5. Retire the Old System Upon Completion:After successfully migrating all workloads and verifying that the new system operates smoothly, you can retire the old system. By this point, consumers will be fully accustomed to accessing data through Dremio, making the final transition seamless.

Benefits of Using Dremio for Migration

By leveraging Dremio as a uniform interface, you minimize the complexities and disruptions typically associated with migration projects. Consumers benefit from a consistent user experience, while the underlying data infrastructure transitions smoothly from the old system to the new Apache Iceberg-based system. This approach not only simplifies change management but also ensures continuity in data access and operations.

Conclusion

Migrating to an Apache Iceberg Lakehouse enhances data infrastructure with cost-efficiency, ease of use, and business value, despite the inherent challenges. By adopting a data lakehouse architecture, you gain benefits like ACID guarantees, time travel, and schema evolution, with Apache Iceberg offering unique advantages. Selecting the right catalog and choosing between in-place or shadow migration approaches, supported by a blue/green strategy, ensures a smooth transition. Tools like Dremio simplify migration, providing a uniform interface between old and new systems, minimizing disruptions and easing change management. Leveraging Dremio's capabilities, such as CTAS and COPY INTO, alongside Apache XTable, ensures an optimized and seamless migration process, maintaining consistent user experience and robust data operations.

Contact us today to allow Dremio to help you in planning your Apache Iceberg migration!

Some Exercises to Get Hands-on with Apache Iceberg, Dremio, and More!

Ready to Get Started?

Enable the business to create and consume data products powered by Apache Iceberg, accelerating AI and analytics initiatives and dramatically reducing costs.