9 minute read · October 3, 2024
Using Nessie’s REST Catalog Support for Working with Apache Iceberg Tables
· Senior Tech Evangelist, Dremio
The recent support for the Iceberg REST Catalog Spec in Project Nessie marks a significant improvement in how we interact with the Iceberg catalog across multiple environments and languages. Historically, using a catalog required the client side be implemented for each language (Java, Python, Rust, Go), resulting in disparities of catalog support across different tools and languages. With the REST catalog spec, a single client implementation in each language can be used for all catalogs that support the spec.
More Info on Iceberg and Iceberg Catalogs
For example, before the REST catalog was introduced, setting up the Nessie catalog in an Apache Spark environment involved configuring storage and catalog credentials. Here's an example of how this configuration might look in PySpark before REST catalog support:
conf = ( pyspark.SparkConf() .setAppName('app_name') .setMaster(SPARK_MASTER) .set('spark.jars.packages', 'org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.4.3,org.projectnessie.nessie-integrations:nessie-spark-extensions-3.3_2.12:0.76.0,software.amazon.awssdk:bundle:2.17.178') .set('spark.sql.extensions', 'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,org.projectnessie.spark.extensions.NessieSparkSessionExtensions') .set('spark.sql.catalog.nessie', 'org.apache.iceberg.spark.SparkCatalog') .set('spark.sql.catalog.nessie.uri', ARCTIC_URI) .set('spark.sql.catalog.nessie.ref', 'main') .set('spark.sql.catalog.nessie.authentication.type', 'BEARER') .set('spark.sql.catalog.nessie.authentication.token', TOKEN) .set('spark.sql.catalog.nessie.catalog-impl', 'org.apache.iceberg.nessie.NessieCatalog') .set('spark.sql.catalog.nessie.warehouse', 's3a://my-bucket/path/') .set('spark.sql.catalog.nessie.io-impl', 'org.apache.iceberg.aws.s3.S3FileIO') .set('spark.hadoop.fs.s3a.access.key', AWS_ACCESS_KEY) .set('spark.hadoop.fs.s3a.secret.key', AWS_SECRET_KEY) )
As you can see, this setup requires several steps and hardcoding of credentials, which can make it cumbersome to manage across different environments and languages.
Enter the REST Catalog
With the introduction of Nessie’s REST catalog support, the server handles many of these configuration details, offloading complexity from the client side. Instead of configuring credentials and within each client, the REST catalog simplifies the setup by consolidating logic on the server side, including the handling of storage credentials.
Now, the interaction becomes much more straightforward, like in the following example with PySpark:
# Initialize SparkSession with Nessie, Iceberg, and S3 configuration spark = ( SparkSession.builder.appName("Nessie-Iceberg-PySpark") .config('spark.jars.packages', 'org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.0,software.amazon.awssdk:bundle:2.24.8,software.amazon.awssdk:url-connection-client:2.24.8') .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") .config("spark.sql.catalog.nessie", "org.apache.iceberg.spark.SparkCatalog") .config("spark.sql.catalog.nessie.uri", "http://nessie:19120/iceberg/main/") .config("spark.sql.catalog.nessie.warehouse", "s3://my-bucket/") .config("spark.sql.catalog.nessie.type", "rest") .getOrCreate() )
In this example, the complexity of managing the credentials for storage is now shifted to the server. The only requirement is to provide the necessary server configurations, making the development process faster, simpler, and more secure.
Hands-on Exercise with Docker Compose
To illustrate how Nessie’s REST catalog simplifies the setup, let's go through an exercise using Docker Compose. We'll spin up a basic environment with Nessie, MinIO (for object storage), and Spark, and walk through configuring the catalog to work with PySpark.
Here’s a simple docker-compose.yml
configuration:
version: '3' services: nessie: image: ghcr.io/projectnessie/nessie:0.99.0 container_name: nessie ports: - "19120:19120" environment: - nessie.version.store.type=IN_MEMORY - nessie.catalog.default-warehouse=warehouse - nessie.catalog.warehouses.warehouse.location=s3://my-bucket/ - nessie.catalog.service.s3.default-options.endpoint=http://minio:9000/ - nessie.catalog.service.s3.default-options.access-key=urn:nessie-secret:quarkus:nessie.catalog.secrets.access-key - nessie.catalog.secrets.access-key.name=admin - nessie.catalog.secrets.access-key.secret=password - nessie.catalog.service.s3.default-options.region=us-east-1 - nessie.server.authentication.enabled=false networks: nessie-rest: minio: image: quay.io/minio/minio container_name: minio ports: - "9000:9000" - "9001:9001" environment: - MINIO_ROOT_USER=admin - MINIO_ROOT_PASSWORD=password - MINIO_REGION=us-east-1 entrypoint: > /bin/sh -c " minio server /data --console-address ':9001' & sleep 5; mc alias set myminio http://localhost:9000 admin password; mc mb myminio/my-bucket --ignore-existing; tail -f /dev/null" networks: nessie-rest: spark: image: alexmerced/spark35nb:latest ports: - 8888:8888 - 8080:8080 environment: - AWS_REGION=us-east-1 - AWS_ACCESS_KEY_ID=admin - AWS_SECRET_ACCESS_KEY=password networks: nessie-rest: networks: nessie-rest:
Once this environment is running, you can open a Jupyter notebook on localhost:8888
and begin writing pySpark code. Notice how we configure the server's storage credential in the environmental variables for the server.
Nessie Server Configuration Settings
PySpark Example with REST Catalog
After you’ve spun up the Docker environment, run the following code in your Jupyter notebook to create a Spark session and start interacting with the Nessie catalog:
from pyspark.sql import SparkSession # Initialize SparkSession with Nessie, Iceberg, and S3 configuration spark = ( SparkSession.builder.appName("Nessie-Iceberg-PySpark") .config('spark.jars.packages', 'org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.0,software.amazon.awssdk:bundle:2.24.8,software.amazon.awssdk:url-connection-client:2.24.8') .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") .config("spark.sql.catalog.nessie", "org.apache.iceberg.spark.SparkCatalog") .config("spark.sql.catalog.nessie.uri", "http://nessie:19120/iceberg/main/") .config("spark.sql.catalog.nessie.warehouse", "s3://my-bucket/") .config("spark.sql.catalog.nessie.type", "rest") .getOrCreate() ) # Create a namespace in Nessie spark.sql("CREATE NAMESPACE IF NOT EXISTS nessie.demo").show() # Create a table in the `nessie.demo` namespace using Iceberg spark.sql( """ CREATE TABLE IF NOT EXISTS nessie.demo.sample_table ( id BIGINT, name STRING ) USING iceberg """ ).show() # Insert data into the sample_table spark.sql( """ INSERT INTO nessie.demo.sample_table VALUES (1, 'Alice'), (2, 'Bob') """ ).show() # Query the data from the table spark.sql("SELECT * FROM nessie.demo.sample_table").show() # Stop the Spark session spark.stop()
Conclusion
With the introduction of REST catalog , managing and interacting with Apache Iceberg catalogs has been greatly simplified. This shift from client-side configurations to server-side management offers many benefits, including better security, easier maintenance, and improved scalability.
By following this hands-on exercise, you’ve seen firsthand how much easier it is to configure and use Nessie’s REST catalog support with Spark and MinIO for object storage. The simplified architecture ensures that your data platform can scale easily, reducing complexity and improving your ability to manage data across different environments.
This is an exciting step forward for data engineers and architects working with data lakehouses, and it’s clear that Nessie is setting the standard for how metadata and catalogs are managed in the world of modern data platforms.