19 minute read · August 1, 2024
Getting Hands-on with Polaris OSS, Apache Iceberg and Apache Spark
· Senior Tech Evangelist, Dremio
Data lakehouses built with the Apache Iceberg table format are rapidly gaining popularity. A crucial component of an Iceberg lakehouse is the catalog, which tracks your tables, making them discoverable by various tools like Dremio, Snowflake, Apache Spark, and more. Recently, a new community-driven open-source catalog named Polaris has emerged at the forefront of open-source Iceberg catalog discussions. Below are some links to content regarding the announcement of the open-sourcing of Polaris.
- Snowflake's Polaris Announcement
- Polaris Github Repo
- Datanami article on the Nessie and Polaris Merging
- Blog: Introduction to Polaris
Getting Hands-on
In this blog, we'll discuss many of the technology's concepts and walk through the steps for getting hands-on with the current incarnation of Polaris.
Polaris Conceptually
Polaris is a cutting-edge catalog implementation for Apache Iceberg that utilizes the open-source Apache Iceberg REST Catalog protocol to provide centralized and secure access to Iceberg tables across various REST-compatible query engines like Snowflake, Apache Spark, Apache Flink, and Dremio. It supports internal and external catalogs, enabling organizations to efficiently manage and organize their Iceberg tables. With features such as namespace creation for logical grouping and metadata management for tables, Polaris Catalog becomes the open option for your iceberg catalog which forms the foundational layer in your Iceberg lakehouse. The catalog ensures atomic operations and robust storage configurations for S3, Azure, or Google Cloud Storage.
Security and access control are pivotal to Polaris Catalog, employing a role-based access control (RBAC) model to manage permissions across all registered tables consistently. The catalog generates service principals to encapsulate credentials for query engine connections and uses credential vending to secure query execution. Additionally, storage configurations establish a trust relationship between cloud storage providers and Polaris Catalog. This comprehensive approach ensures that organizations can effectively manage, secure, and optimize their Iceberg data infrastructure.
Setup
You'll need to have git and docker installed on your computer to follow the steps in this guide, once you are ready you'll want to do the following:
NOTE: Polaris has only just been released on July 30th, 2024, two days before this blog was written, so you may run into bugs, like any open source project you can help accelerate development by contributing or filing issues of any bugs you run into at the Polaris github repository.
- Go to the Polaris Github Repo and "fork" the repo to create your own copy of the repository in your github account.
- Then clone the repo somewhere on your machine
git clone [email protected]:your-user-name/polaris.git
- Then open up the repository folder in your favorite text/code editor like Visual Studio Code.
- Create a new file called dremio-blog-compose.yml with the following content
services: polaris: build: context: . network: host container_name: polaris ports: - "8181:8181" - "8182" networks: polaris-quickstart: environment: - AWS_REGION=us-east-1 - AWS_ACCESS_KEY_ID=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx - AWS_SECRET_ACCESS_KEY=yyyyyyyyyyyyyyyyyyyyyyyyyyy command: # override the command to specify aws keys as dropwizard config - server - polaris-server.yml # Spark spark: platform: linux/x86_64 image: alexmerced/spark35notebook:latest ports: - 8080:8080 # Master Web UI - 7077:7077 # Master Port - 8888:8888 # Notebook environment: - AWS_REGION=us-east-1 - AWS_ACCESS_KEY_ID=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx - AWS_SECRET_ACCESS_KEY=yyyyyyyyyyyyyyyyyyyyyyyy container_name: spark networks: polaris-quickstart: networks: polaris-quickstart:
Make sure to update the environmental variables for both services with your AWS credentials. Next, we will run this file with docker, which will start the Polaris and Spark services; keep an eye on the output, as you'll need some details in the output for both services.
docker compose -f dremio-blog-compose.yml up --build
The -f flag specifies the name of the file to use and the --build flag explicitly ensures any images that must be built as detailed in the docker-compose file is built. Once it is done building the image and running the image, we now have Polaris running in our local environment.
Keep an eye out in the output for the Polaris root principal credentials in a line that looks like this:
polaris-1 | realm: default-realm root principal credentials: fa44645a04410a0e:f1b82a42de2295da466682d3cfdbb0f1
Also keep an eye out the URL to the Jupyter notebook server for working with Spark which should look like this:
http://127.0.0.1:8888/lab?token=ce51dff2516cb218408fe79c75ac3a8c959b2ee12db45015
Creating an ARN for S3 Access
The documentation details what you'll need to access S3, Azure and GCP but for this tutorial we will use S3.
Traditionally, before the REST catalog specification the storage details were handled by the client, so when you'd connect to a catalog with Spark, Flink, Dremio etc. you'd have to pass not only catalog credentials to access the catalog but also storage credentials to access the storage layer with your data, which led to tedious and confusing configuration by users.
The REST Catalog Specification allows for storage credentials to be handled by the server, which means the end-user doesn't have to worry about storage credentials, they just have to concern themselves with connecting their preferred tools to the catalog making it feel even more like a traditional database system where catalog/storage are coupled.
For Polaris, you can have multiple catalogs created on the server and for S3 in those catalog settings we need to pass an ARN that has access to the S3 buckets you hope to write to. The docker compose file sets the region for the bucket to us-west-2, so to keep things simple, make sure your bucket is in that region. Here is some example json on what the access policy for that role would look like.
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "s3:GetObject", "s3:PutObject", "s3:DeleteObject", "s3:ListBucket" ], "Resource": [ "arn:aws:s3:::your-bucket-name", "arn:aws:s3:::your-bucket-name/*" ] } ] }
Once you have created your role, you should have an ARN that looks like this.
arn:aws:iam::################:role/polaris-storage
Managing Our Polaris Server
The python CLI in the Polaris repository is still being developed so we will use raw API calls to the Polaris management API to do the following.
- Create a Catalog
- Create a Catalog Role for that Catalog
- Create Principal (user)
- Create a Principal Role
- Assign the Catalog Role to the Principal Role
- Grant the Catalog Role the Ability to Manage Content in our Catalog
The benefit of this approach to RBAC (Role-based access controls) is that users and catalogs aren't directly connected but connected by the intermediate roles.
- A principal role can have variety of catalogs roles to different catalogs
- A catalog role can be assigned to many principal roles
- A principal can be assigned a variety of principal roles that give varying levels of access to different catalogs
Getting our Authorization Token
Using the root credentials, we can get our authorization token for our subsequent authorization headers with the following request:
curl -i -X POST \ http://localhost:8181/api/catalog/v1/oauth/tokens \ -d 'grant_type=client_credentials&client_id=3308616f33ef2cfe&client_secret=620fa1d5850199bc7628155693977bc1&scope=PRINCIPAL_ROLE:ALL'
Make sure to update the client_id and client_secret with the credentials you got earlier. It will return a token you can want to copy somewhere for reference that looks like this:
principal:root;password:620fa1d5850199bc7628155693977bc1;realm:default-realm;role:ALL
Creating Our Catalog and Principal
### CREATING THE CATALOG curl -i -X POST -H "Authorization: Bearer principal:root;password:620fa1d5850199bc7628155693977bc1;realm:default-realm;role:ALL" -H 'Accept: application/json' -H 'Content-Type: application/json' \ http://localhost:8181/api/management/v1/catalogs \ -d '{"name": "polariscatalog", "type": "INTERNAL", "properties": { "default-base-location": "s3://somebucket/somefolder/" },"storageConfigInfo": { "roleArn": "arn:aws:iam::############:role/polaris-storage-role", "storageType": "S3", "allowedLocations": [ "s3://somebucket/somefolder" ] } }' ### CREATING THE PRINCIPAL curl -X POST "http://localhost:8181/api/management/v1/principals" \ -H "Authorization: Bearer principal:root;password:620fa1d5850199bc7628155693977bc1;realm:default-realm;role:ALL" \ -H "Content-Type: application/json" \ -d '{"name": "polarisuser", "type": "user"}'
Make sure to pay special attention to the output of creating the principal as it will return a access-key and secret for that user you'll need later on.
Creating our Catalog Role and Principal Role
### CREATE PRINCIPAL ROLE curl -X POST "http://localhost:8181/api/management/v1/principal-roles" \ -H "Authorization: Bearer principal:root;password:620fa1d5850199bc7628155693977bc1;realm:default-realm;role:ALL" \ -H "Content-Type: application/json" \ -d '{"principalRole": {"name": "polarisuserrole"}}' ### ASSIGN PRINCIPAL ROLE TO PRINCIPAL curl -X PUT "http://localhost:8181/api/management/v1/principals/polarisuser/principal-roles" \ -H "Authorization: Bearer principal:root;password:620fa1d5850199bc7628155693977bc1;realm:default-realm;role:ALL" \ -H "Content-Type: application/json" \ -d '{"principalRole": {"name": "polarisuserrole"}}' ### CREATE A CATALOG ROLE FOR OUR CATALOG curl -X POST "http://localhost:8181/api/management/v1/catalogs/polariscatalog/catalog-roles" \ -H "Authorization: Bearer principal:root;password:620fa1d5850199bc7628155693977bc1;realm:default-realm;role:ALL" \ -H "Content-Type: application/json" \ -d '{"catalogRole": {"name": "polariscatalogrole"}}'
So with this, we have:
- Assigned the role we created "polarisuserrole" to "polarisuser"
- Assigned the role "polariscatalogrole" to "polariscatalog"
Remember where in the URLs we may have to pass the name of the catalog or principal we set when we create them.
Assign the Catalog Role to the Principal Role
curl -X PUT "http://localhost:8181/api/management/v1/principal-roles/polarisuserrole/catalog-roles/polariscatalog" \ -H "Authorization: Bearer principal:root;password:620fa1d5850199bc7628155693977bc1;realm:default-realm;role:ALL" \ -H "Content-Type: application/json" \ -d '{"catalogRole": {"name": "polariscatalogrole"}}'
So now we have everything connected
- polarisuser gets access from polarisuserrole
- polarisuserrole gets access to polariscatalog via polariscatalogrole
Now we just have to grant the catalog role some privileges.
Grant Privileges to Catalog Role
curl -X PUT "http://localhost:8181/api/management/v1/catalogs/polariscatalog/catalog-roles/polariscatalogrole/grants" \ -H "Authorization: Bearer principal:root;password:620fa1d5850199bc7628155693977bc1;realm:default-realm;role:ALL" \ -H "Content-Type: application/json" \ -d '{"grant": {"type": "catalog", "privilege": "CATALOG_MANAGE_CONTENT"}}'
Now we have user/principal who can make changes to our catalog, so can no open the Jupyter notebook URL from earlier.
Using Polaris in Spark
Now that Polaris has been configured, we can create a new Python notebook and run the following code:
import pyspark from pyspark.sql import SparkSession import os ## DEFINE SENSITIVE VARIABLES POLARIS_URI = 'http://polaris:8181/api/catalog' POLARIS_CATALOG_NAME = 'polariscatalog' POLARIS_CREDENTIALS = 'a3b1100071704a25:3920a59d4e73f8c2dc1e89d00b4ee67f' POLARIS_SCOPE = 'PRINCIPAL_ROLE:ALL' conf = ( pyspark.SparkConf() .setAppName('app_name') #packages .set('spark.jars.packages', 'org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.2,org.apache.hadoop:hadoop-aws:3.4.0') #SQL Extensions .set('spark.sql.extensions', 'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions') #Configuring Catalog .set('spark.sql.catalog.polaris', 'org.apache.iceberg.spark.SparkCatalog') .set('spark.sql.catalog.polaris.warehouse', POLARIS_CATALOG_NAME) .set('spark.sql.catalog.polaris.header.X-Iceberg-Access-Delegation', 'true') .set('spark.sql.catalog.polaris.catalog-impl', 'org.apache.iceberg.rest.RESTCatalog') .set('spark.sql.catalog.polaris.uri', POLARIS_URI) .set('spark.sql.catalog.polaris.credential', POLARIS_CREDENTIALS) .set('spark.sql.catalog.polaris.scope', POLARIS_SCOPE) .set('spark.sql.catalog.polaris.token-refresh-enabled', 'true') ) ## Start Spark Session spark = SparkSession.builder.config(conf=conf).getOrCreate() print("Spark Running") ## Run a Query spark.sql("CREATE NAMESPACE IF NOT EXISTS polaris.db").show() spark.sql("CREATE TABLE polaris.db.names (name STRING) USING iceberg").show() spark.sql("INSERT INTO polaris.db.names VALUES ('Alex Merced'), ('Andrew Madson')").show() spark.sql("SELECT * FROM polaris.db.names").show()
Keep in mind the POLARIS_CREDENTIALS variable should equal the credentials on your new user, not the root principal in the form "accesskey:secretkey". Reminder: Polaris has just been released, so there may be additional troubleshooting or privilege granting needed to run the particular SQL you want to run, but the above should serve as a good template for getting things configured in your Spark Notebooks.
As mentioned in the Datanami article linked at the top, some of open-source Nessie catalog code may find its way into Polaris, so below are some exercises to get hands-on with Nessie to learn about what may be in store in the future of Polaris.
Here are Some Exercises for you to See Nessie’s Features at Work on Your Laptop
- Intro to Nessie, and Apache Iceberg on Your Laptop
- From SQLServer -> Apache Iceberg -> BI Dashboard
- From MongoDB -> Apache Iceberg -> BI Dashboard
- From Postgres -> Apache Iceberg -> BI Dashboard
- From MySQL -> Apache Iceberg -> BI Dashboard
- From Elasticsearch -> Apache Iceberg -> BI Dashboard
- From Kafka -> Apache Iceberg -> Dremio