h2h2h2h2h2h2h2h2h2h2h2h2h2h2

21 minute read · October 22, 2024

Hands-on with Apache Iceberg Tables using PyIceberg using Nessie and Minio

Alex Merced

Alex Merced · Senior Tech Evangelist, Dremio

Flexibility and simplicity in managing metadata catalogs and storage solutions are key to efficient data platform management. Nessie’s REST Catalog Implementation brings this flexibility by centralizing table management across multiple environments in the cloud and on-prem, while PyIceberg provides an accessible Python implementation for interacting with Iceberg tables.

In this blog, we’ll walk through setting up a local environment using Docker Compose to integrate Nessie’s REST Catalog, MinIO for object storage, and a Jupyter Notebook environment. We’ll then demonstrate connecting to the catalog and performing basic table operations using PyIceberg. By the end of this guide, you’ll have a local setup that allows you to explore Iceberg tables and manage metadata simply, reproducibly.

Section 1: Prerequisites

Before we dive into setting up the environment, ensure that you have the following tools installed on your system:

  1. Docker: Docker allows us to create and manage containers for the various services we’ll be using. If you don’t have Docker installed, you can download it here.
  2. Docker Compose: Docker Compose enables the management of multi-container Docker applications. Typically, Docker Compose is bundled with Docker Desktop, but you can also install it separately by following the official instructions.
  3. Basic Knowledge of PyIceberg, Nessie, and MinIO:
    • PyIceberg: A Python implementation of Iceberg, a table format for large analytics datasets. It enables you to work with Iceberg tables without needing a Java Virtual Machine (JVM).
    • Nessie: A metadata management service providing Git-like operations on data tables, with REST Catalog support to simplify catalog configurations.
    • MinIO: A high-performance object storage solution compatible with S3, which we will use as the storage backend.

With these tools in place, let’s move on to setting up the environment using Docker Compose.

Section 2: Docker Compose Setup

To create a local data environment using Nessie’s REST Catalog, MinIO for storage, and Jupyter for data exploration, we will leverage Docker Compose. This setup allows us to easily manage and orchestrate all the necessary services, ensuring a seamless workflow for working with Iceberg tables.

Step 1: Creating the Docker Compose File

First, we need to define our docker-compose.yml file, which will configure and launch the necessary services. This file will include:

  • Nessie REST Catalog: To manage metadata and Iceberg tables.
  • MinIO: As the object storage service where our Iceberg tables will reside.
  • Jupyter Notebook: To interact with Nessie and perform table operations using PyIceberg.

Below is an example docker-compose.yml configuration:

version: '3'

services:
  nessie:
    image: ghcr.io/projectnessie/nessie:0.99.0
    container_name: nessie
    ports:
      - "19120:19120"
    environment:
      - nessie.version.store.type=IN_MEMORY
      - nessie.catalog.default-warehouse=warehouse
      - nessie.catalog.warehouses.warehouse.location=s3://my-bucket/
      - nessie.catalog.service.s3.default-options.endpoint=http://minio:9000/
      - nessie.catalog.service.s3.default-options.access-key=urn:nessie-secret:quarkus:nessie.catalog.secrets.access-key
      - nessie.catalog.service.s3.default-options.path-style-access=true
      - nessie.catalog.secrets.access-key.name=admin
      - nessie.catalog.secrets.access-key.secret=password
      - nessie.catalog.service.s3.default-options.region=us-east-1
      - nessie.server.authentication.enabled=false
    networks:
      - nessie-net

  minio:
    image: quay.io/minio/minio
    container_name: minio
    ports:
      - "9000:9000"
      - "9001:9001"
    environment:
      - MINIO_ROOT_USER=admin
      - MINIO_ROOT_PASSWORD=password
      - MINIO_REGION=us-east-1
    entrypoint: >
      /bin/sh -c "
      minio server /data --console-address ':9001' &
      sleep 5;
      mc alias set myminio http://localhost:9000 admin password;
      mc mb myminio/my-bucket --ignore-existing;
      tail -f /dev/null"
    networks:
      - nessie-net

  notebook:
    image: alexmerced/datanotebook:latest
    ports:
      - "8888:8888"
    environment:
      - AWS_ACCESS_KEY_ID=admin
      - AWS_SECRET_ACCESS_KEY=password
    networks:
      - nessie-net

networks:
  nessie-net:
 

Key Points in the Configuration:

  • Nessie Service: The Nessie REST server runs on port 19120 and is configured to store Iceberg table data in the s3://my-bucket/ path on MinIO.
  • MinIO Service: MinIO runs on ports 9000 and 9001 and is initialized with a bucket called my-bucket for storing table files.
  • Jupyter Notebook Service: This service uses the alexmerced/datanotebook image, providing a Python environment with all the necessary libraries pre-installed, including PyIceberg.

Step 2: Running Docker Compose

Once the docker-compose.yml file is ready, we can bring up the environment by running the following command:

docker-compose up

This command will start all the services defined in the Compose file. You should see output logs as each service starts up. Once everything is running, you can verify the services are working as expected:

  • MinIO Console: Access the MinIO console at http://localhost:9001 to view the object storage and confirm that the bucket my-bucket has been created.
  • Nessie REST Catalog: Nessie will be available at http://localhost:19120 (No UI).
  • Jupyter Notebook: The notebook environment can be accessed by navigating to http://localhost:8888 in your browser.

Step 3: Accessing Jupyter Notebook

When you open http://localhost:8888, you will be greeted by the Jupyter Notebook interface. No token is required to log in, as token authentication is disabled in this setup for simplicity. From here, you can start a new Python notebook and begin interacting with Nessie and MinIO through PyIceberg.

Section 3: Connecting to Nessie with PyIceberg

Now that our local environment is up and running, it's time to connect to Nessie's REST Catalog using PyIceberg from within our Jupyter notebook. This section will guide you through installing PyIceberg in the notebook environment and configuring it to interact with the Nessie catalog and MinIO storage.

Step 1: Installing PyIceberg in the Notebook

To use PyIceberg in your notebook, we need to install the necessary dependencies. Since our Jupyter environment is already running inside the alexmerced/datanotebook Docker container, open a new Python notebook and run the following command to install PyIceberg with support for S3 and SQL-based catalogs:

!pip install pyiceberg[s3fs,sql-sqlite]

This installs PyIceberg along with the S3FS library, which allows PyIceberg to interact with MinIO’s S3-compatible object storage. Additionally, we include support for SQL-based catalogs in case you want to experiment with SQL-backed catalogs later.

Step 2: Configuring the Connection

Now that PyIceberg is installed, the next step is to configure the connection to Nessie's REST Catalog. We will use the PyIceberg API to set up a connection to the REST catalog and point it to MinIO for storage.

Below is the Python code to configure the Nessie catalog using PyIceberg:

from pyiceberg.catalog import load_catalog

# Set up the connection to Nessie's REST Catalog
catalog = load_catalog(
    "nessie",
    **{
        "uri": "http://nessie:19120/iceberg/main/",
    }
)

# Verify connection by listing namespaces
namespaces = catalog.list_namespaces()
print("Namespaces:", namespaces)
Key Configuration Points:
  • uri: This is the URL for the Nessie REST Catalog API, which is running locally on http://localhost:19120/iceberg/v1/.
  • warehouse: The S3 bucket path where Iceberg tables will be stored in MinIO. We use s3://my-bucket/, which was set up during the Docker Compose configuration.
  • S3 Credentials: Since MinIO mimics S3 storage, we provide the MinIO credentials (admin and password) along with the s3.endpoint URL (http://minio:9000).
  • py-io-impl: Specifies that we are using PyArrow as the file IO implementation to read and write data to the object store.

Once this is set up, running the code will connect your notebook to the Nessie REST Catalog and print the available namespaces, confirming the connection.

Step 3: Basic Table Operations

With the connection established, let's walk through some basic table operations using PyIceberg. We’ll create a namespace, a new table, and then insert data into that table.

Create a Namespace:

We will create a new namespace called demo where our tables will reside:

catalog.create_namespace("demo")

Create a Table:

Next, we create a table within the demo namespace. For this example, let's define a simple schema with two columns: id (integer) and name (string).

from pyiceberg.schema import Schema
from pyiceberg.types import NestedField, IntegerType, StringType

# Define the schema for the table
schema = Schema(
    NestedField(1, "id", IntegerType(), required=True),
    NestedField(2, "name", StringType(), required=False)
)

# Create the table in the `demo` namespace
catalog.create_table("demo.sample_table", schema)

Insert Data into the Table:

Let's insert some simple data into the sample_table that we just created.

import pyarrow as pa

# Create an Arrow table with sample data
data = pa.Table.from_pydict(
    {"id": [1, 2, 3], "name": ["Alice", "Bob", "Charlie"]}
)

# Append the data to the Iceberg table
table = catalog.load_table("demo.sample_table")
table.append(data)

Query the Data:

Finally, we can query the data from the table using PyIceberg and PyArrow:

result = table.scan().to_arrow()
print(result)

This will display the contents of the sample_table that were inserted earlier.

Section 4: Exploring Data and Metadata with PyIceberg

Now that we’ve successfully created and interacted with a table in Nessie’s REST Catalog, let’s dive deeper into Iceberg’s capabilities by exploring the table’s metadata and file structure. This section will show you how to inspect the underlying data and metadata files that Iceberg manages, giving you insights into your table’s structure, partitioning, and snapshots.

Step 1: Exploring the Table’s Metadata

Iceberg maintains detailed metadata about tables, including schema, partitioning, and snapshots. We can use PyIceberg’s API to retrieve and inspect this metadata.

Start by loading the table and retrieving its metadata:

# Load the table from the catalog
table = catalog.load_table("demo.sample_table")

# Get the table schema
schema = table.schema()
print("Table Schema:", schema)

# Get the table properties
properties = table.properties()
print("Table Properties:", properties)

This will print out the schema of the sample_table and any table-level properties, such as file format, compression settings, and any custom configurations you’ve applied to the table.

Step 2: Inspecting Table Snapshots

Iceberg keeps track of all changes made to a table via snapshots, allowing for time travel and tracking changes over time. Let’s inspect the snapshots created so far in the sample_table.

# Inspect the table's snapshots
snapshots = table.snapshots()
print("Table Snapshots:", snapshots)

Each snapshot contains information about the changes made during that operation (e.g., data append, overwrite, or delete). You can use this to explore how your table evolved over time and roll back to previous states if needed.

Step 3: Inspecting Partitions and Files

If your table is partitioned, Iceberg makes it easy to explore partition information. Since our sample_table isn’t partitioned yet, you can see how to do this with a partitioned table later. Here’s how you would inspect partitions if they existed:

# Inspect table partitions
partitions = table.partitions()
print("Table Partitions:", partitions)

Additionally, you can inspect the actual data files that Iceberg manages behind the scenes:

# Inspect the files in the table
files = table.files()
print("Data Files:", files)

This will show you the physical files in MinIO that contain the table data. Iceberg manages these files for you, ensuring that you don’t need to worry about how the data is distributed across the object store.

Step 4: Time Travel with Snapshots

One of Iceberg’s powerful features is time travel, which allows you to query a table as it existed at a previous point in time. Let’s demonstrate how you can query data from a specific snapshot:

# Retrieve the first snapshot's ID
first_snapshot = snapshots[0].snapshot_id

# Perform a table scan using the snapshot ID (time travel)
result = table.scan(snapshot_id=first_snapshot).to_arrow()
print("Data from the first snapshot:", result)

This allows you to access the historical state of your data, making it easy to audit changes or roll back to earlier versions if needed.

Step 5: Managing Table Properties and Partitions

You can update your table properties and evolve the schema or partitioning strategy over time. Let’s walk through how to update the table’s schema by adding a new column and evolving the partition strategy.

Adding a New Column:

# Update the table schema by adding a new column
with table.update_schema() as update:
    update.add_column("age", pa.int32(), "Age of the person")

Evolving Partitions:

You can modify the partitioning strategy to optimize query performance:

from pyiceberg.transforms import DayTransform

# Evolve partition spec by adding a new partition field
with table.update_spec() as update:
    update.add_field("id", DayTransform(), "day_partition")

These operations allow your Iceberg table to evolve without breaking existing queries or applications, maintaining compatibility over time.

Conclusion

In this blog, we walked through setting up a local data environment using Nessie’s REST Catalog, MinIO for object storage, and a Jupyter Notebook environment, all orchestrated through Docker Compose. You learned how to connect to Nessie using PyIceberg, perform basic table operations like creating namespaces, inserting data, and querying tables, and explored Iceberg’s powerful metadata management features such as time travel, snapshots, and partitions.

By following this guide, you now have a local setup that allows you to experiment with Iceberg tables in a flexible and scalable way. Whether you're looking to build a data lakehouse, manage large analytics datasets, or explore the inner workings of Iceberg, this environment provides a solid foundation for further experimentation.

Turning Off the Environment

Once you're done experimenting, you can easily turn off your Docker-based environment to free up resources. To do this, simply run the following command in your terminal:

docker-compose down

This will stop and remove the Docker containers for Nessie, MinIO, and the Jupyter Notebook. If you want to preserve any data in MinIO or Nessie, make sure to back it up before running this command.

Next Steps

Now that you’ve successfully set up and interacted with Iceberg tables locally, here are a few ideas to further extend your learning:

  • Experiment with Schema Evolution: Add new columns, rename existing ones, or change partitioning strategies without disrupting existing data.
  • Try Time Travel Queries: Use Iceberg’s powerful snapshot capabilities to explore time travel for auditing and version control.
  • Connect to External Object Stores: Replace MinIO with a cloud-based S3 bucket to test how Iceberg scales in the cloud.
  • Explore Dremio or Spark: Connect this setup with tools like Dremio or Apache Spark for larger-scale data analytics and query acceleration.

With the flexibility provided by Docker Compose and the capabilities of Nessie and Iceberg, the possibilities are vast. Keep experimenting, and feel free to adapt this setup to meet your unique data needs!

Ready to Get Started?

Enable the business to create and consume data products powered by Apache Iceberg, accelerating AI and analytics initiatives and dramatically reducing costs.