9 minute read · August 8, 2024

Apache XTable: Converting Between Apache Iceberg, Delta Lake, and Apache Hudi

Alex Merced

Alex Merced · Senior Tech Evangelist, Dremio

Dipankar Mazumdar

Dipankar Mazumdar · Staff Data Engineering Advocate (Onehouse)

Data lakehouse architectures, embodying the best of data lakes and warehouses, have emerged as a pivotal strategy for enterprises aiming to harness their data's full potential. Central to these architectures are table formats such as Apache Iceberg, Delta Lake, and Apache Hudi, which introduce structured metadata layers on top of the underlying data files, typically stored in formats like Parquet. These metadata layers enable sophisticated data management features like schema evolution, time travel, and transactional integrity, which are crucial for reliable data analytics and machine learning workflows. 

However, organizations often face challenges when the tools they wish to use do not support their data's specific table format, creating barriers to accessing and analyzing their data efficiently. Individual Parquet files are at the core of these formats, which are inherently agnostic to the metadata layer imposed on them. 

This is where Apache XTable (incubating) offers a solution by translating the metadata layer between these table formats. Here is a description of the project from Dipankar Mazumdar, one of the project's primary evangelists, “An open-source project that facilitates omnidirectional interoperability among various lakehouse table formats. This means it is not limited to one-directional conversion but allows you to initiate conversion from any format to another without needing to copy or rewrite data.”

Apache XTable offers a different omni-directional approach to table format interoperability than Delta Lake’s “Uniformat” approach, which provides one-way read-only access of tables written in Delta Lake as Iceberg or Hudi.

How It Works

The format conversion with XTable operation involves downloading a Java archive (jar) file containing the necessary utilities to perform the metadata translation. Users then configure a settings file, specifying the source and target table formats and the paths to their respective data. Here is an example of converting a configuration file for converting from Delta Lake to Apache Iceberg:

sourceFormat: DELTA # convert from delta
targetFormats:
  - ICEBERG
datasets:
  -
    tableBasePath: s3://path/to/source/data
    tableName: 2023_sales
    partitionSpec: partitionpath:department

This configuration defines the transformation process, guiding Apache XTable in converting the metadata from one format to another without altering the underlying Parquet files. This method ensures that the data remains intact while the metadata layer is seamlessly transitioned, enabling migration across different data management systems. Once the configurations are ready, you'll run the process using a jar you built from XTable’s source code.

java -jar xtable.jar --datasetConfig my_config.yaml

This utility only converts metadata as files and doesn’t automatically register the metadata with any catalog, allowing you to register the resulting Apache Iceberg metadata with the preferred catalog (although the table is written with the version-hint.text file to be read using the file system catalog). Let’s try registering a table with the Dremio Lakehouse Platform’s integrated catalog and the open source Nessie catalog which both can be used for tracking Apache Iceberg tables and enable cutting-edge DataOps on the data lakehouse.

Registering a Table with Dremio or Nessie

The simplest way to register an existing Apache Iceberg table with a catalog is to use Apache Iceberg’s “registerTable” Spark procedure to register the table with the desired catalog. Since this is a lightweight operation, you can do it at zero cost using an Apache Spark Docker container and running it locally. (There is also a CLI Catalog Migration Tool)

You can run this Spark 3.5 + notebook image I created to get up and running very quickly:

docker run -p 8888:8888 --env AWS_REGION=us-east-1 --env AWS_ACCESS_KEY_ID=XXXXXXXXXXXXXXX --env AWS_SECRET_ACCESS_KEY=xxxxxxx --env TOKEN=xxxxxx --env WAREHOUSE=xxxxxx --env CATALOG_URI=xxxxxx --name spark-notebook alexmerced/spark35notebook

Make sure to swap out the environment variables with the proper values:

  • AWS_ACCESS_KEY_ID: AWS access key
  • AWS_SECRET_ACCESS_KEY: AWS secret key
  • TOKEN: Your Dremio/Nessie authentication token
  • WAREHOUSE: S3 address to warehouse new tables
  • CATALOG_URI: The URI for your Dremio or Nessie catalog

Once you start the container in the terminal output you’ll find the URL to access the notebook, copy that into the browser and create a new notebook with the following code.

For a Dremio catalog:

import pyspark
from pyspark.sql import SparkSession
import os


## DEFINE SENSITIVE VARIABLES
CATALOG_URI = os.environ.get("CATALOG_URI") ## Nessie Server URI
TOKEN = os.environ.get("TOKEN") ## Authentication Token
WAREHOUSE = os.environ.get("WAREHOUSE") ## S3 Address to Write to


conf = (
    pyspark.SparkConf()
        .setAppName('app_name')
  #packages
        .set('spark.jars.packages', 'org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.3.0,org.projectnessie.nessie-integrations:nessie-spark-extensions-3.5_2.12:0.77.1,software.amazon.awssdk:bundle:2.24.8,software.amazon.awssdk:url-connection-client:2.24.8')
  #SQL Extensions
        .set('spark.sql.extensions', 'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,org.projectnessie.spark.extensions.NessieSparkSessionExtensions')
  #Configuring Catalog
        .set('spark.sql.catalog.dremio', 'org.apache.iceberg.spark.SparkCatalog')
        .set('spark.sql.catalog.dremio.uri', CATALOG_URI)
        .set('spark.sql.catalog.dremio.ref', 'main')
      .set('spark.sql.catalog.dremio.authentication.type', 'BEARER')
        .set('spark.sql.catalog.dremio.authentication.token', TOKEN)
        .set('spark.sql.catalog.dremio.catalog-impl', 'org.apache.iceberg.nessie.NessieCatalog')
        .set('spark.sql.catalog.dremio.warehouse', WAREHOUSE)
        .set('spark.sql.catalog.dremio.io-impl', 'org.apache.iceberg.aws.s3.S3FileIO')

)

## Start Spark Session
spark = SparkSession.builder.config(conf=conf).getOrCreate()
print("Spark Running")

## Register Table with your Dremio Catalog
spark.sql('CALL dremio.system.register_table(  table => 'db.tbl',  metadata_file => 's3://path/to/metadata/file.json');').show()

For Nessie catalog without authentication:

## DEFINE SENSITIVE VARIABLES
CATALOG_URI = os.environ.get("CATALOG_URI") ## Nessie Server URI
WAREHOUSE = os.environ.get("WAREHOUSE") ## S3 Address to Write to


conf = (
    pyspark.SparkConf()
        .setAppName('app_name')
  #packages
        .set('spark.jars.packages', 'org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.3.0,org.projectnessie.nessie-integrations:nessie-spark-extensions-3.5_2.12:0.77.1,software.amazon.awssdk:bundle:2.24.8,software.amazon.awssdk:url-connection-client:2.24.8')
  #SQL Extensions
        .set('spark.sql.extensions', 'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,org.projectnessie.spark.extensions.NessieSparkSessionExtensions')
  #Configuring Catalog
        .set('spark.sql.catalog.nessie', 'org.apache.iceberg.spark.SparkCatalog')
        .set('spark.sql.catalog.nessie.uri', CATALOG_URI)
        .set('spark.sql.catalog.nessie.ref', 'main')
      .set('spark.sql.catalog.nessie.authentication.type', 'NONE')
        .set('spark.sql.catalog.dremio.catalog-impl', 'org.apache.iceberg.nessie.NessieCatalog')
        .set('spark.sql.catalog.dremio.warehouse', WAREHOUSE)
        .set('spark.sql.catalog.dremio.io-impl', 'org.apache.iceberg.aws.s3.S3FileIO')

)

## Start Spark Session
spark = SparkSession.builder.config(conf=conf).getOrCreate()
print("Spark Running")

## Register Table with your Nessie Catalog
spark.sql('CALL nessie.system.register_table(
  table => 'db.tbl',
  metadata_file => 's3://path/to/metadata/file.json'
);
').show()

Conclusion

Apache XTable offers a way to convert your existing data lakehouse tables to the format of your choice without having to rewrite all of your data. This, along with robust Iceberg DML support from Dremio, offers an additional way to easily migrate to an Apache Iceberg data lakehouse along with the catalog versioning benefits of the Dremio and Nessie catalogs.

Learn more about migrating to an Iceberg Lakehouse with this article, get hands-on experience with this tutorial, and contact us to help you start your migration today!

Ready to Get Started?

Bring your users closer to the data with organization-wide self-service analytics and lakehouse flexibility, scalability, and performance at a fraction of the cost. Run Dremio anywhere with self-managed software or Dremio Cloud.