21 minute read · October 22, 2024
Hands-on with Apache Iceberg Tables using PyIceberg using Nessie and Minio
· Senior Tech Evangelist, Dremio
Flexibility and simplicity in managing metadata catalogs and storage solutions are key to efficient data platform management. Nessie’s REST Catalog Implementation brings this flexibility by centralizing table management across multiple environments in the cloud and on-prem, while PyIceberg provides an accessible Python implementation for interacting with Iceberg tables.
In this blog, we’ll walk through setting up a local environment using Docker Compose to integrate Nessie’s REST Catalog, MinIO for object storage, and a Jupyter Notebook environment. We’ll then demonstrate connecting to the catalog and performing basic table operations using PyIceberg. By the end of this guide, you’ll have a local setup that allows you to explore Iceberg tables and manage metadata simply, reproducibly.
Section 1: Prerequisites
Before we dive into setting up the environment, ensure that you have the following tools installed on your system:
- Docker: Docker allows us to create and manage containers for the various services we’ll be using. If you don’t have Docker installed, you can download it here.
- Docker Compose: Docker Compose enables the management of multi-container Docker applications. Typically, Docker Compose is bundled with Docker Desktop, but you can also install it separately by following the official instructions.
- Basic Knowledge of PyIceberg, Nessie, and MinIO:
- PyIceberg: A Python implementation of Iceberg, a table format for large analytics datasets. It enables you to work with Iceberg tables without needing a Java Virtual Machine (JVM).
- Nessie: A metadata management service providing Git-like operations on data tables, with REST Catalog support to simplify catalog configurations.
- MinIO: A high-performance object storage solution compatible with S3, which we will use as the storage backend.
With these tools in place, let’s move on to setting up the environment using Docker Compose.
Section 2: Docker Compose Setup
To create a local data environment using Nessie’s REST Catalog, MinIO for storage, and Jupyter for data exploration, we will leverage Docker Compose. This setup allows us to easily manage and orchestrate all the necessary services, ensuring a seamless workflow for working with Iceberg tables.
Step 1: Creating the Docker Compose File
First, we need to define our docker-compose.yml
file, which will configure and launch the necessary services. This file will include:
- Nessie REST Catalog: To manage metadata and Iceberg tables.
- MinIO: As the object storage service where our Iceberg tables will reside.
- Jupyter Notebook: To interact with Nessie and perform table operations using PyIceberg.
Below is an example docker-compose.yml
configuration:
version: '3' services: nessie: image: ghcr.io/projectnessie/nessie:0.99.0 container_name: nessie ports: - "19120:19120" environment: - nessie.version.store.type=IN_MEMORY - nessie.catalog.default-warehouse=warehouse - nessie.catalog.warehouses.warehouse.location=s3://my-bucket/ - nessie.catalog.service.s3.default-options.endpoint=http://minio:9000/ - nessie.catalog.service.s3.default-options.access-key=urn:nessie-secret:quarkus:nessie.catalog.secrets.access-key - nessie.catalog.service.s3.default-options.path-style-access=true - nessie.catalog.secrets.access-key.name=admin - nessie.catalog.secrets.access-key.secret=password - nessie.catalog.service.s3.default-options.region=us-east-1 - nessie.server.authentication.enabled=false networks: - nessie-net minio: image: quay.io/minio/minio container_name: minio ports: - "9000:9000" - "9001:9001" environment: - MINIO_ROOT_USER=admin - MINIO_ROOT_PASSWORD=password - MINIO_REGION=us-east-1 entrypoint: > /bin/sh -c " minio server /data --console-address ':9001' & sleep 5; mc alias set myminio http://localhost:9000 admin password; mc mb myminio/my-bucket --ignore-existing; tail -f /dev/null" networks: - nessie-net notebook: image: alexmerced/datanotebook:latest ports: - "8888:8888" environment: - AWS_ACCESS_KEY_ID=admin - AWS_SECRET_ACCESS_KEY=password networks: - nessie-net networks: nessie-net:
Key Points in the Configuration:
- Nessie Service: The Nessie REST server runs on port
19120
and is configured to store Iceberg table data in thes3://my-bucket/
path on MinIO. - MinIO Service: MinIO runs on ports
9000
and9001
and is initialized with a bucket calledmy-bucket
for storing table files. - Jupyter Notebook Service: This service uses the
alexmerced/datanotebook
image, providing a Python environment with all the necessary libraries pre-installed, including PyIceberg.
Step 2: Running Docker Compose
Once the docker-compose.yml
file is ready, we can bring up the environment by running the following command:
docker-compose up
This command will start all the services defined in the Compose file. You should see output logs as each service starts up. Once everything is running, you can verify the services are working as expected:
- MinIO Console: Access the MinIO console at
http://localhost:9001
to view the object storage and confirm that the bucketmy-bucket
has been created. - Nessie REST Catalog: Nessie will be available at
http://localhost:19120
(No UI). - Jupyter Notebook: The notebook environment can be accessed by navigating to
http://localhost:8888
in your browser.
Step 3: Accessing Jupyter Notebook
When you open http://localhost:8888
, you will be greeted by the Jupyter Notebook interface. No token is required to log in, as token authentication is disabled in this setup for simplicity. From here, you can start a new Python notebook and begin interacting with Nessie and MinIO through PyIceberg.
Section 3: Connecting to Nessie with PyIceberg
Now that our local environment is up and running, it's time to connect to Nessie's REST Catalog using PyIceberg from within our Jupyter notebook. This section will guide you through installing PyIceberg in the notebook environment and configuring it to interact with the Nessie catalog and MinIO storage.
Step 1: Installing PyIceberg in the Notebook
To use PyIceberg in your notebook, we need to install the necessary dependencies. Since our Jupyter environment is already running inside the alexmerced/datanotebook
Docker container, open a new Python notebook and run the following command to install PyIceberg with support for S3 and SQL-based catalogs:
!pip install pyiceberg[s3fs,sql-sqlite]
This installs PyIceberg along with the S3FS library, which allows PyIceberg to interact with MinIO’s S3-compatible object storage. Additionally, we include support for SQL-based catalogs in case you want to experiment with SQL-backed catalogs later.
Step 2: Configuring the Connection
Now that PyIceberg is installed, the next step is to configure the connection to Nessie's REST Catalog. We will use the PyIceberg API to set up a connection to the REST catalog and point it to MinIO for storage.
Below is the Python code to configure the Nessie catalog using PyIceberg:
from pyiceberg.catalog import load_catalog # Set up the connection to Nessie's REST Catalog catalog = load_catalog( "nessie", **{ "uri": "http://nessie:19120/iceberg/main/", } ) # Verify connection by listing namespaces namespaces = catalog.list_namespaces() print("Namespaces:", namespaces)
Key Configuration Points:
uri
: This is the URL for the Nessie REST Catalog API, which is running locally onhttp://localhost:19120/iceberg/v1/
.warehouse
: The S3 bucket path where Iceberg tables will be stored in MinIO. We uses3://my-bucket/
, which was set up during the Docker Compose configuration.- S3 Credentials: Since MinIO mimics S3 storage, we provide the MinIO credentials (
admin
andpassword
) along with thes3.endpoint
URL (http://minio:9000
). py-io-impl
: Specifies that we are using PyArrow as the file IO implementation to read and write data to the object store.
Once this is set up, running the code will connect your notebook to the Nessie REST Catalog and print the available namespaces, confirming the connection.
Step 3: Basic Table Operations
With the connection established, let's walk through some basic table operations using PyIceberg. We’ll create a namespace, a new table, and then insert data into that table.
Create a Namespace:
We will create a new namespace called demo
where our tables will reside:
catalog.create_namespace("demo")
Create a Table:
Next, we create a table within the demo
namespace. For this example, let's define a simple schema with two columns: id
(integer) and name
(string).
from pyiceberg.schema import Schema from pyiceberg.types import NestedField, IntegerType, StringType # Define the schema for the table schema = Schema( NestedField(1, "id", IntegerType(), required=True), NestedField(2, "name", StringType(), required=False) ) # Create the table in the `demo` namespace catalog.create_table("demo.sample_table", schema)
Insert Data into the Table:
Let's insert some simple data into the sample_table
that we just created.
import pyarrow as pa # Create an Arrow table with sample data data = pa.Table.from_pydict( {"id": [1, 2, 3], "name": ["Alice", "Bob", "Charlie"]} ) # Append the data to the Iceberg table table = catalog.load_table("demo.sample_table") table.append(data)
Query the Data:
Finally, we can query the data from the table using PyIceberg and PyArrow:
result = table.scan().to_arrow() print(result)
This will display the contents of the sample_table
that were inserted earlier.
Section 4: Exploring Data and Metadata with PyIceberg
Now that we’ve successfully created and interacted with a table in Nessie’s REST Catalog, let’s dive deeper into Iceberg’s capabilities by exploring the table’s metadata and file structure. This section will show you how to inspect the underlying data and metadata files that Iceberg manages, giving you insights into your table’s structure, partitioning, and snapshots.
Step 1: Exploring the Table’s Metadata
Iceberg maintains detailed metadata about tables, including schema, partitioning, and snapshots. We can use PyIceberg’s API to retrieve and inspect this metadata.
Start by loading the table and retrieving its metadata:
# Load the table from the catalog table = catalog.load_table("demo.sample_table") # Get the table schema schema = table.schema() print("Table Schema:", schema) # Get the table properties properties = table.properties() print("Table Properties:", properties)
This will print out the schema of the sample_table
and any table-level properties, such as file format, compression settings, and any custom configurations you’ve applied to the table.
Step 2: Inspecting Table Snapshots
Iceberg keeps track of all changes made to a table via snapshots, allowing for time travel and tracking changes over time. Let’s inspect the snapshots created so far in the sample_table
.
# Inspect the table's snapshots snapshots = table.snapshots() print("Table Snapshots:", snapshots)
Each snapshot contains information about the changes made during that operation (e.g., data append, overwrite, or delete). You can use this to explore how your table evolved over time and roll back to previous states if needed.
Step 3: Inspecting Partitions and Files
If your table is partitioned, Iceberg makes it easy to explore partition information. Since our sample_table
isn’t partitioned yet, you can see how to do this with a partitioned table later. Here’s how you would inspect partitions if they existed:
# Inspect table partitions partitions = table.partitions() print("Table Partitions:", partitions)
Additionally, you can inspect the actual data files that Iceberg manages behind the scenes:
# Inspect the files in the table files = table.files() print("Data Files:", files)
This will show you the physical files in MinIO that contain the table data. Iceberg manages these files for you, ensuring that you don’t need to worry about how the data is distributed across the object store.
Step 4: Time Travel with Snapshots
One of Iceberg’s powerful features is time travel, which allows you to query a table as it existed at a previous point in time. Let’s demonstrate how you can query data from a specific snapshot:
# Retrieve the first snapshot's ID first_snapshot = snapshots[0].snapshot_id # Perform a table scan using the snapshot ID (time travel) result = table.scan(snapshot_id=first_snapshot).to_arrow() print("Data from the first snapshot:", result)
This allows you to access the historical state of your data, making it easy to audit changes or roll back to earlier versions if needed.
Step 5: Managing Table Properties and Partitions
You can update your table properties and evolve the schema or partitioning strategy over time. Let’s walk through how to update the table’s schema by adding a new column and evolving the partition strategy.
Adding a New Column:
# Update the table schema by adding a new column with table.update_schema() as update: update.add_column("age", pa.int32(), "Age of the person")
Evolving Partitions:
You can modify the partitioning strategy to optimize query performance:
from pyiceberg.transforms import DayTransform # Evolve partition spec by adding a new partition field with table.update_spec() as update: update.add_field("id", DayTransform(), "day_partition")
These operations allow your Iceberg table to evolve without breaking existing queries or applications, maintaining compatibility over time.
Conclusion
In this blog, we walked through setting up a local data environment using Nessie’s REST Catalog, MinIO for object storage, and a Jupyter Notebook environment, all orchestrated through Docker Compose. You learned how to connect to Nessie using PyIceberg, perform basic table operations like creating namespaces, inserting data, and querying tables, and explored Iceberg’s powerful metadata management features such as time travel, snapshots, and partitions.
By following this guide, you now have a local setup that allows you to experiment with Iceberg tables in a flexible and scalable way. Whether you're looking to build a data lakehouse, manage large analytics datasets, or explore the inner workings of Iceberg, this environment provides a solid foundation for further experimentation.
Turning Off the Environment
Once you're done experimenting, you can easily turn off your Docker-based environment to free up resources. To do this, simply run the following command in your terminal:
docker-compose down
This will stop and remove the Docker containers for Nessie, MinIO, and the Jupyter Notebook. If you want to preserve any data in MinIO or Nessie, make sure to back it up before running this command.
Next Steps
Now that you’ve successfully set up and interacted with Iceberg tables locally, here are a few ideas to further extend your learning:
- Experiment with Schema Evolution: Add new columns, rename existing ones, or change partitioning strategies without disrupting existing data.
- Try Time Travel Queries: Use Iceberg’s powerful snapshot capabilities to explore time travel for auditing and version control.
- Connect to External Object Stores: Replace MinIO with a cloud-based S3 bucket to test how Iceberg scales in the cloud.
- Explore Dremio or Spark: Connect this setup with tools like Dremio or Apache Spark for larger-scale data analytics and query acceleration.
With the flexibility provided by Docker Compose and the capabilities of Nessie and Iceberg, the possibilities are vast. Keep experimenting, and feel free to adapt this setup to meet your unique data needs!