Data lakehouses built with the Apache Iceberg table format are rapidly gaining popularity. A crucial component of an Iceberg lakehouse is the catalog, which tracks your tables, making them discoverable by various tools like Dremio, Snowflake, Apache Spark, and more. Recently, a new community-driven open-source catalog named Polaris has emerged at the forefront of open-source Iceberg catalog discussions. Below are some links to content regarding the announcement of the open-sourcing of Polaris.
In this blog, we'll discuss many of the technology's concepts and walk through the steps for getting hands-on with the current incarnation of Polaris.
Try Dremio’s Interactive Demo
Explore this interactive demo and see how Dremio's Intelligent Lakehouse enables Agentic AI
Polaris Conceptually
Polaris is a cutting-edge catalog implementation for Apache Iceberg that utilizes the open-source Apache Iceberg REST Catalog protocol to provide centralized and secure access to Iceberg tables across various REST-compatible query engines like Snowflake, Apache Spark, Apache Flink, and Dremio. It supports internal and external catalogs, enabling organizations to efficiently manage and organize their Iceberg tables. With features such as namespace creation for logical grouping and metadata management for tables, Polaris Catalog becomes the open option for your iceberg catalog which forms the foundational layer in your Iceberg lakehouse. The catalog ensures atomic operations and robust storage configurations for S3, Azure, or Google Cloud Storage.
Security and access control are pivotal to Polaris Catalog, employing a role-based access control (RBAC) model to manage permissions across all registered tables consistently. The catalog generates service principals to encapsulate credentials for query engine connections and uses credential vending to secure query execution. Additionally, storage configurations establish a trust relationship between cloud storage providers and Polaris Catalog. This comprehensive approach ensures that organizations can effectively manage, secure, and optimize their Iceberg data infrastructure.
Setup
You'll need to have git and docker installed on your computer to follow the steps in this guide, once you are ready you'll want to do the following:
NOTE: Polaris has only just been released on July 30th, 2024, two days before this blog was written, so you may run into bugs, like any open source project you can help accelerate development by contributing or filing issues of any bugs you run into at the Polaris github repository.
Go to the Polaris Github Repo and "fork" the repo to create your own copy of the repository in your github account.
If you would rather use your local storage to experiment with Polaris, refer to this GitHub repository with directions on how to use it using the FILE storage type.
Make sure to update the environmental variables for both services with your AWS credentials. Next, we will run this file with docker, which will start the Polaris and Spark services; keep an eye on the output, as you'll need some details in the output for both services.
docker compose -f dremio-blog-compose.yml up --build
The -f flag specifies the name of the file to use and the --build flag explicitly ensures any images that must be built as detailed in the docker-compose file is built. Once it is done building the image and running the image, we now have Polaris running in our local environment.
Keep an eye out in the output for the Polaris root principal credentials in a line that looks like this:
polaris-1 | realm: default-realm root principal credentials: fa44645a04410a0e:f1b82a42de2295da466682d3cfdbb0f1
Also keep an eye out the URL to the Jupyter notebook server for working with Spark which should look like this:
The documentation details what you'll need to access S3, Azure and GCP but for this tutorial we will use S3.
Traditionally, before the REST catalog specification the storage details were handled by the client, so when you'd connect to a catalog with Spark, Flink, Dremio etc. you'd have to pass not only catalog credentials to access the catalog but also storage credentials to access the storage layer with your data, which led to tedious and confusing configuration by users.
The REST Catalog Specification allows for storage credentials to be handled by the server, which means the end-user doesn't have to worry about storage credentials, they just have to concern themselves with connecting their preferred tools to the catalog making it feel even more like a traditional database system where catalog/storage are coupled.
For Polaris, you can have multiple catalogs created on the server and for S3 in those catalog settings we need to pass an ARN that has access to the S3 buckets you hope to write to. The docker compose file sets the region for the bucket to us-west-2, so to keep things simple, make sure your bucket is in that region. Here is some example json on what the access policy for that role would look like.
The python CLI in the Polaris repository is still being developed so we will use raw API calls to the Polaris management API to do the following.
Create a Catalog
Create a Catalog Role for that Catalog
Create Principal (user)
Create a Principal Role
Assign the Catalog Role to the Principal Role
Grant the Catalog Role the Ability to Manage Content in our Catalog
The benefit of this approach to RBAC (Role-based access controls) is that users and catalogs aren't directly connected but connected by the intermediate roles.
A principal role can have variety of catalogs roles to different catalogs
A catalog role can be assigned to many principal roles
A principal can be assigned a variety of principal roles that give varying levels of access to different catalogs
Getting our Authorization Token
Using the root credentials, we can get our authorization token for our subsequent authorization headers with the following request:
curl -i -X POST
http://localhost:8181/api/catalog/v1/oauth/tokens
-d 'grant_type=client_credentials&client_id=3308616f33ef2cfe&client_secret=620fa1d5850199bc7628155693977bc1&scope=PRINCIPAL_ROLE:ALL'
Make sure to update the client_id and client_secret with the credentials you got earlier. It will return a token you can want to copy somewhere for reference that looks like this:
Make sure to pay special attention to the output of creating the principal as it will return a access-key and secret for that user you'll need later on.
Creating our Catalog Role and Principal Role
### CREATE PRINCIPAL ROLE
curl -X POST "http://localhost:8181/api/management/v1/principal-roles"
-H "Authorization: Bearer principal:root;password:620fa1d5850199bc7628155693977bc1;realm:default-realm;role:ALL"
-H "Content-Type: application/json"
-d '{"principalRole": {"name": "polarisuserrole"}}'
### ASSIGN PRINCIPAL ROLE TO PRINCIPAL
curl -X PUT "http://localhost:8181/api/management/v1/principals/polarisuser/principal-roles"
-H "Authorization: Bearer principal:root;password:620fa1d5850199bc7628155693977bc1;realm:default-realm;role:ALL"
-H "Content-Type: application/json"
-d '{"principalRole": {"name": "polarisuserrole"}}'
### CREATE A CATALOG ROLE FOR OUR CATALOG
curl -X POST "http://localhost:8181/api/management/v1/catalogs/polariscatalog/catalog-roles"
-H "Authorization: Bearer principal:root;password:620fa1d5850199bc7628155693977bc1;realm:default-realm;role:ALL"
-H "Content-Type: application/json"
-d '{"catalogRole": {"name": "polariscatalogrole"}}'
So with this, we have:
Assigned the role we created "polarisuserrole" to "polarisuser"
Assigned the role "polariscatalogrole" to "polariscatalog"
Remember where in the URLs we may have to pass the name of the catalog or principal we set when we create them.
Keep in mind the POLARIS_CREDENTIALS variable should equal the credentials on your new user, not the root principal in the form "accesskey:secretkey". Reminder: Polaris has just been released, so there may be additional troubleshooting or privilege granting needed to run the particular SQL you want to run, but the above should serve as a good template for getting things configured in your Spark Notebooks.
As mentioned in the Datanami article linked at the top, some of open-source Nessie catalog code may find its way into Polaris, so below are some exercises to get hands-on with Nessie to learn about what may be in store in the future of Polaris.
Here are Some Exercises for you to See Nessie’s Features at Work on Your Laptop
Intro to Dremio, Nessie, and Apache Iceberg on Your Laptop
We're always looking for ways to better handle and save money on our data. That's why the "data lakehouse" is becoming so popular. It offers a mix of the flexibility of data lakes and the ease of use and performance of data warehouses. The goal? Make data handling easier and cheaper. So, how do we […]
Aug 16, 2023·Dremio Blog: News Highlights
5 Use Cases for the Dremio Lakehouse
With its capabilities in on-prem to cloud migration, data warehouse offload, data virtualization, upgrading data lakes and lakehouses, and building customer-facing analytics applications, Dremio provides the tools and functionalities to streamline operations and unlock the full potential of data assets.
Aug 31, 2023·Dremio Blog: News Highlights
Dremio Arctic is Now Your Data Lakehouse Catalog in Dremio Cloud
Dremio Arctic bring new features to Dremio Cloud, including Apache Iceberg table optimization and Data as Code.