9 minute read · October 3, 2024

Using Nessie’s REST Catalog Support for Working with Apache Iceberg Tables

Alex Merced · Head of DevRel, Dremio

Enter the REST Catalog

Hands-on Exercise with Docker Compose

PySpark Example with REST Catalog

Conclusion

The recent support for the Iceberg REST Catalog Spec in Project Nessie marks a significant improvement in how we interact with the Iceberg catalog across multiple environments and languages. Historically, using a catalog required the client side be implemented for each language (Java, Python, Rust, Go), resulting in disparities of catalog support across different tools and languages. With the REST catalog spec, a single client implementation in each language can be used for all catalogs that support the spec.

More Info on Iceberg and Iceberg Catalogs

For example, before the REST catalog was introduced, setting up the Nessie catalog in an Apache Spark environment involved configuring storage and catalog credentials. Here's an example of how this configuration might look in PySpark before REST catalog support:

conf = (
    pyspark.SparkConf()
        .setAppName('app_name')
        .setMaster(SPARK_MASTER)
        .set('spark.jars.packages', 'org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.4.3,org.projectnessie.nessie-integrations:nessie-spark-extensions-3.3_2.12:0.76.0,software.amazon.awssdk:bundle:2.17.178')
        .set('spark.sql.extensions', 'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,org.projectnessie.spark.extensions.NessieSparkSessionExtensions')
        .set('spark.sql.catalog.nessie', 'org.apache.iceberg.spark.SparkCatalog')
        .set('spark.sql.catalog.nessie.uri', ARCTIC_URI)
        .set('spark.sql.catalog.nessie.ref', 'main')
        .set('spark.sql.catalog.nessie.authentication.type', 'BEARER')
        .set('spark.sql.catalog.nessie.authentication.token', TOKEN)
        .set('spark.sql.catalog.nessie.catalog-impl', 'org.apache.iceberg.nessie.NessieCatalog')
        .set('spark.sql.catalog.nessie.warehouse', 's3a://my-bucket/path/')
        .set('spark.sql.catalog.nessie.io-impl', 'org.apache.iceberg.aws.s3.S3FileIO')
        .set('spark.hadoop.fs.s3a.access.key', AWS_ACCESS_KEY)
        .set('spark.hadoop.fs.s3a.secret.key', AWS_SECRET_KEY)
)

As you can see, this setup requires several steps and hardcoding of credentials, which can make it cumbersome to manage across different environments and languages.

Enter the REST Catalog

With the introduction of Nessie’s REST catalog support, the server handles many of these configuration details, offloading complexity from the client side. Instead of configuring credentials and within each client, the REST catalog simplifies the setup by consolidating logic on the server side, including the handling of storage credentials.

Now, the interaction becomes much more straightforward, like in the following example with PySpark:

# Initialize SparkSession with Nessie, Iceberg, and S3 configuration
spark = (
    SparkSession.builder.appName("Nessie-Iceberg-PySpark")
    .config('spark.jars.packages', 'org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.0,software.amazon.awssdk:bundle:2.24.8,software.amazon.awssdk:url-connection-client:2.24.8')
    .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
    .config("spark.sql.catalog.nessie", "org.apache.iceberg.spark.SparkCatalog")
    .config("spark.sql.catalog.nessie.uri", "http://nessie:19120/iceberg/main/")
    .config("spark.sql.catalog.nessie.warehouse", "s3://my-bucket/")
    .config("spark.sql.catalog.nessie.type", "rest")
    .getOrCreate()
)

In this example, the complexity of managing the credentials for storage is now shifted to the server. The only requirement is to provide the necessary server configurations, making the development process faster, simpler, and more secure.

Hands-on Exercise with Docker Compose

To illustrate how Nessie’s REST catalog simplifies the setup, let's go through an exercise using Docker Compose. We'll spin up a basic environment with Nessie, MinIO (for object storage), and Spark, and walk through configuring the catalog to work with PySpark.

Here’s a simple docker-compose.yml configuration:

version: '3'

services:
  nessie:
    image: ghcr.io/projectnessie/nessie:0.99.0
    container_name: nessie
    ports:
      - "19120:19120"
    environment:
      - nessie.version.store.type=IN_MEMORY
      - nessie.catalog.default-warehouse=warehouse
      - nessie.catalog.warehouses.warehouse.location=s3://my-bucket/
      - nessie.catalog.service.s3.default-options.endpoint=http://minio:9000/
      - nessie.catalog.service.s3.default-options.access-key=urn:nessie-secret:quarkus:nessie.catalog.secrets.access-key
      - nessie.catalog.service.s3.default-options.path-style-access=true
      - nessie.catalog.service.s3.default-options.auth-type=STATIC
      - nessie.catalog.secrets.access-key.name=admin
      - nessie.catalog.secrets.access-key.secret=password
      - nessie.catalog.service.s3.default-options.region=us-east-1
      - nessie.server.authentication.enabled=false
    networks:
      nessie-rest:

  minio:
    image: quay.io/minio/minio
    container_name: minio
    ports:
      - "9000:9000"
      - "9001:9001"
    environment:
      - MINIO_ROOT_USER=admin
      - MINIO_ROOT_PASSWORD=password
      - MINIO_REGION=us-east-1
    entrypoint: >
      /bin/sh -c "
      minio server /data --console-address ':9001' &
      sleep 10;
      mc alias set myminio http://localhost:9000 admin password;
      mc mb myminio/my-bucket --ignore-existing;
      tail -f /dev/null"
    networks:
      nessie-rest:

  spark:
    image: alexmerced/spark35nb:latest
    ports:
      - 8888:8888
      - 8080:8080
    environment:
      - AWS_REGION=us-east-1
      - AWS_ACCESS_KEY_ID=admin
      - AWS_SECRET_ACCESS_KEY=password
    networks:
      nessie-rest:

networks:
  nessie-rest:

Once this environment is running, you can open a Jupyter notebook on localhost:8888 and begin writing pySpark code. Notice how we configure the server's storage credential in the environmental variables for the server.

Nessie Server Configuration Settings

PySpark Example with REST Catalog

After you’ve spun up the Docker environment, run the following code in your Jupyter notebook to create a Spark session and start interacting with the Nessie catalog:

from pyspark.sql import SparkSession

# Initialize SparkSession with Nessie, Iceberg, and S3 configuration
spark = (
    SparkSession.builder.appName("Nessie-Iceberg-PySpark")
    .config('spark.jars.packages', 'org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.0,software.amazon.awssdk:bundle:2.24.8,software.amazon.awssdk:url-connection-client:2.24.8')
    .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
    .config("spark.sql.catalog.nessie", "org.apache.iceberg.spark.SparkCatalog")
    .config("spark.sql.catalog.nessie.uri", "http://nessie:19120/iceberg/main/")
    .config("spark.sql.catalog.nessie.warehouse", "s3://my-bucket/")
    .config("spark.sql.catalog.nessie.type", "rest")
    .getOrCreate()
)

# Create a namespace in Nessie
spark.sql("CREATE NAMESPACE IF NOT EXISTS nessie.demo").show()

# Create a table in the `nessie.demo` namespace using Iceberg
spark.sql(
    """
    CREATE TABLE IF NOT EXISTS nessie.demo.sample_table (
        id BIGINT,
        name STRING
    ) USING iceberg
    """
).show()

# Insert data into the sample_table
spark.sql(
    """
    INSERT INTO nessie.demo.sample_table VALUES
    (1, 'Alice'),
    (2, 'Bob')
    """
).show()

# Query the data from the table
spark.sql("SELECT * FROM nessie.demo.sample_table").show()

# Stop the Spark session
spark.stop()

Conclusion

With the introduction of REST catalog , managing and interacting with Apache Iceberg catalogs has been greatly simplified. This shift from client-side configurations to server-side management offers many benefits, including better security, easier maintenance, and improved scalability.

By following this hands-on exercise, you’ve seen firsthand how much easier it is to configure and use Nessie’s REST catalog support with Spark and MinIO for object storage. The simplified architecture ensures that your data platform can scale easily, reducing complexity and improving your ability to manage data across different environments.

This is an exciting step forward for data engineers and architects working with data lakehouses, and it’s clear that Nessie is setting the standard for how metadata and catalogs are managed in the world of modern data platforms.

Try Nessie with Dremio

Contact Dremio to setup a free Architectural Workshop

Get Started with Dremio

Article Topics

Dremio Blog: Open Data Insights