Dremio Blog

9 minute read · August 8, 2024

Apache XTable: Converting Between Apache Iceberg, Delta Lake, and Apache Hudi

Alex Merced Head of DevRel, Dremio

Dipankar Mazumdar Staff Data Engineering Advocate (Onehouse)

Start For Free

Copied to clipboard

Apache XTable: Converting Between Apache Iceberg, Delta Lake, and Apache Hudi

How It Works

Registering a Table with Dremio or Nessie

Conclusion

Data lakehouse architectures, embodying the best of data lakes and warehouses, have emerged as a pivotal strategy for enterprises aiming to harness their data's full potential. Central to these architectures are table formats such as Apache Iceberg, Delta Lake, and Apache Hudi, which introduce structured metadata layers on top of the underlying data files, typically stored in formats like Parquet. These metadata layers enable sophisticated data management features like schema evolution, time travel, and transactional integrity, which are crucial for reliable data analytics and machine learning workflows.

However, organizations often face challenges when the tools they wish to use do not support their data's specific table format, creating barriers to accessing and analyzing their data efficiently. Individual Parquet files are at the core of these formats, which are inherently agnostic to the metadata layer imposed on them.

This is where Apache XTable (incubating) offers a solution by translating the metadata layer between these table formats. Here is a description of the project from Dipankar Mazumdar, one of the project's primary evangelists, “An open-source project that facilitates omnidirectional interoperability among various lakehouse table formats. This means it is not limited to one-directional conversion but allows you to initiate conversion from any format to another without needing to copy or rewrite data.”

Apache XTable offers a different omni-directional approach to table format interoperability than Delta Lake’s “Uniformat” approach, which provides one-way read-only access of tables written in Delta Lake as Iceberg or Hudi.

How It Works

The format conversion with XTable operation involves downloading a Java archive (jar) file containing the necessary utilities to perform the metadata translation. Users then configure a settings file, specifying the source and target table formats and the paths to their respective data. Here is an example of converting a configuration file for converting from Delta Lake to Apache Iceberg:

sourceFormat: DELTA # convert from delta
targetFormats:
  - ICEBERG
datasets:
  -
    tableBasePath: s3://path/to/source/data
    tableName: 2023_sales
    partitionSpec: partitionpath:department

This configuration defines the transformation process, guiding Apache XTable in converting the metadata from one format to another without altering the underlying Parquet files. This method ensures that the data remains intact while the metadata layer is seamlessly transitioned, enabling migration across different data management systems. Once the configurations are ready, you'll run the process using a jar you built from XTable’s source code.

java -jar xtable.jar --datasetConfig my_config.yaml

This utility only converts metadata as files and doesn’t automatically register the metadata with any catalog, allowing you to register the resulting Apache Iceberg metadata with the preferred catalog (although the table is written with the version-hint.text file to be read using the file system catalog). Let’s try registering a table with the Dremio Lakehouse Platform’s integrated catalog and the open source Nessie catalog which both can be used for tracking Apache Iceberg tables and enable cutting-edge DataOps on the data lakehouse.

Try Dremio’s Interactive Demo

Explore this interactive demo and see how Dremio's Intelligent Lakehouse enables Agentic AI

Registering a Table with Dremio or Nessie

The simplest way to register an existing Apache Iceberg table with a catalog is to use Apache Iceberg’s “registerTable” Spark procedure to register the table with the desired catalog. Since this is a lightweight operation, you can do it at zero cost using an Apache Spark Docker container and running it locally. (There is also a CLI Catalog Migration Tool)

You can run this Spark 3.5 + notebook image I created to get up and running very quickly:

docker run -p 8888:8888 --env AWS_REGION=us-east-1 --env AWS_ACCESS_KEY_ID=XXXXXXXXXXXXXXX --env AWS_SECRET_ACCESS_KEY=xxxxxxx --env TOKEN=xxxxxx --env WAREHOUSE=xxxxxx --env CATALOG_URI=xxxxxx --name spark-notebook alexmerced/spark35notebook

Make sure to swap out the environment variables with the proper values:

AWS_ACCESS_KEY_ID: AWS access key
AWS_SECRET_ACCESS_KEY: AWS secret key
TOKEN: Your Dremio/Nessie authentication token
WAREHOUSE: S3 address to warehouse new tables
CATALOG_URI: The URI for your Dremio or Nessie catalog

Once you start the container in the terminal output you’ll find the URL to access the notebook, copy that into the browser and create a new notebook with the following code.

For a Dremio catalog:

import pyspark
from pyspark.sql import SparkSession
import os


## DEFINE SENSITIVE VARIABLES
CATALOG_URI = os.environ.get("CATALOG_URI") ## Nessie Server URI
TOKEN = os.environ.get("TOKEN") ## Authentication Token
WAREHOUSE = os.environ.get("WAREHOUSE") ## S3 Address to Write to


conf = (
    pyspark.SparkConf()
        .setAppName('app_name')
  #packages
        .set('spark.jars.packages', 'org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.3.0,org.projectnessie.nessie-integrations:nessie-spark-extensions-3.5_2.12:0.77.1,software.amazon.awssdk:bundle:2.24.8,software.amazon.awssdk:url-connection-client:2.24.8')
  #SQL Extensions
        .set('spark.sql.extensions', 'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,org.projectnessie.spark.extensions.NessieSparkSessionExtensions')
  #Configuring Catalog
        .set('spark.sql.catalog.dremio', 'org.apache.iceberg.spark.SparkCatalog')
        .set('spark.sql.catalog.dremio.uri', CATALOG_URI)
        .set('spark.sql.catalog.dremio.ref', 'main')
      .set('spark.sql.catalog.dremio.authentication.type', 'BEARER')
        .set('spark.sql.catalog.dremio.authentication.token', TOKEN)
        .set('spark.sql.catalog.dremio.catalog-impl', 'org.apache.iceberg.nessie.NessieCatalog')
        .set('spark.sql.catalog.dremio.warehouse', WAREHOUSE)
        .set('spark.sql.catalog.dremio.io-impl', 'org.apache.iceberg.aws.s3.S3FileIO')

)

## Start Spark Session
spark = SparkSession.builder.config(conf=conf).getOrCreate()
print("Spark Running")

## Register Table with your Dremio Catalog
spark.sql('CALL dremio.system.register_table(  table => 'db.tbl',  metadata_file => 's3://path/to/metadata/file.json');').show()

For Nessie catalog without authentication:

## DEFINE SENSITIVE VARIABLES
CATALOG_URI = os.environ.get("CATALOG_URI") ## Nessie Server URI
WAREHOUSE = os.environ.get("WAREHOUSE") ## S3 Address to Write to


conf = (
    pyspark.SparkConf()
        .setAppName('app_name')
  #packages
        .set('spark.jars.packages', 'org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.3.0,org.projectnessie.nessie-integrations:nessie-spark-extensions-3.5_2.12:0.77.1,software.amazon.awssdk:bundle:2.24.8,software.amazon.awssdk:url-connection-client:2.24.8')
  #SQL Extensions
        .set('spark.sql.extensions', 'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,org.projectnessie.spark.extensions.NessieSparkSessionExtensions')
  #Configuring Catalog
        .set('spark.sql.catalog.nessie', 'org.apache.iceberg.spark.SparkCatalog')
        .set('spark.sql.catalog.nessie.uri', CATALOG_URI)
        .set('spark.sql.catalog.nessie.ref', 'main')
      .set('spark.sql.catalog.nessie.authentication.type', 'NONE')
        .set('spark.sql.catalog.dremio.catalog-impl', 'org.apache.iceberg.nessie.NessieCatalog')
        .set('spark.sql.catalog.dremio.warehouse', WAREHOUSE)
        .set('spark.sql.catalog.dremio.io-impl', 'org.apache.iceberg.aws.s3.S3FileIO')

)

## Start Spark Session
spark = SparkSession.builder.config(conf=conf).getOrCreate()
print("Spark Running")

## Register Table with your Nessie Catalog
spark.sql('CALL nessie.system.register_table(
  table => 'db.tbl',
  metadata_file => 's3://path/to/metadata/file.json'
);
').show()

Conclusion

Apache XTable offers a way to convert your existing data lakehouse tables to the format of your choice without having to rewrite all of your data. This, along with robust Iceberg DML support from Dremio, offers an additional way to easily migrate to an Apache Iceberg data lakehouse along with the catalog versioning benefits of the Dremio and Nessie catalogs.

Learn more about migrating to an Iceberg Lakehouse with this article, get hands-on experience with this tutorial, and contact us to help you start your migration today!

Try Dremio Cloud free for 30 days

Deploy agentic analytics directly on Apache Iceberg data with no pipelines and no added overhead.

Start For Free

Article Topics

Dremio Blog: Open Data Insights

Sep 22, 2023 Dremio Blog: Open Data Insights

Intro to Dremio, Nessie, and Apache Iceberg on Your Laptop

We're always looking for ways to better handle and save money on our data. That's why the "data lakehouse" is becoming so popular. It offers a mix of the flexibility of data lakes and the ease of use and performance of data warehouses. The goal? Make data handling easier and cheaper. So, how do we […]

Alex Merced

Aug 16, 2023 Dremio Blog: News Highlights

5 Use Cases for the Dremio Lakehouse

With its capabilities in on-prem to cloud migration, data warehouse offload, data virtualization, upgrading data lakes and lakehouses, and building customer-facing analytics applications, Dremio provides the tools and functionalities to streamline operations and unlock the full potential of data assets.

Alex Merced

Aug 31, 2023 Dremio Blog: News Highlights

Dremio Arctic is Now Your Data Lakehouse Catalog in Dremio Cloud

Dremio Arctic bring new features to Dremio Cloud, including Apache Iceberg table optimization and Data as Code.

Jeremiah Morrow

Apache XTable: Converting Between Apache Iceberg, Delta Lake, and Apache Hudi

Table of Contents

How It Works

Try Dremio’s Interactive Demo

Registering a Table with Dremio or Nessie

Conclusion

Try Dremio Cloud free for 30 days

Ready to Get Started?

Table of Contents

How It Works

Try Dremio’s Interactive Demo

Registering a Table with Dremio or Nessie

Conclusion

Try Dremio Cloud free for 30 days

Related Dremio Articles

Intro to Dremio, Nessie, and Apache Iceberg on Your Laptop

5 Use Cases for the Dremio Lakehouse

Dremio Arctic is Now Your Data Lakehouse Catalog in Dremio Cloud

Ready to Get Started?