Table of Contents
Managing Data as Code with Dremio Arctic – Easily ensure data quality in your data lakehouse
When it comes to data there are several challenges that may impact the quality of data you provide consumers that can result in complex and fragile pipelines and sometimes make the visibility of issues worse.
Luckily, the data lakehouse comes to the rescue. The combination of Dremio and Apache Iceberg allows you to simplify a lot of modern data challenges so you can focus on producing insights, not fixing pipelines. Today we’ll focus on the role of Dremio Arctic in ensuring data quality.
Dremio Arctic is a data catalog for your data lakehouse that can be accessed with any engine. It uniquely enables data teams to use isolation to ensure data quality, recover from mistakes easily, and audit the data from one place. It is powered by the open source Project Nessie, which enables Git-like semantics at the data catalog level.
First, we’ll discuss a few data quality use cases that Dremio Arctic can assist with, and then we’ll detail a hands-on exercise so you can try it out for yourself.
Data Quality Problems and Solutions
I’m ingesting data and not catching substantial quality issues before the data is queried.
Problem: You ingest the data, and upon first inspection, the results look great. Unfortunately, you hear from your data consumers (analysts or data scientists) that there is missing data resulting from transformations that didn’t play out correctly, and now you're in a high-pressure race to fix the problem.
Solution: With an Arctic catalog you can create a branch and have your ETL pipelines land the new data in the branch to isolate the data. This allows you to test the data, make changes, and then merge the branch back into the “main” branch that all your data consumers query against.
I ran a faulty update and now I’m in for the lengthy process of discovering and fixing impaired data.
Problem: You complete an upsert and realize there is a typo in your SQL statement that affects the entire job and now you have to figure out how to repair all the tainted records.
Solution: With an Arctic catalog every change is tracked as a commit and you can easily revert the state of your data lake so all you have to do is correct the typo and rerun your job.
Data consumers depend on the joining of several tables, and if I update them one by one, concurrent queries may work off partially updated data.
Problem: You have a fact table that depends on several dimension tables and they all need some updates. You update them back to back, but in the time between each update, queries come in and may join updated tables with yet-to-be-updated tables.
Solution: With an Arctic catalog you can update all the tables in a branch. When you’re done, you merge the branch, and all the tables become live on the main branch simultaneously for querying.
Hands-On Exercise Setup
The first step is to sign up for a free Dremio Cloud account. The process should only take a few minutes. Follow the docs if you have any questions. You can also follow a tour through the signup process and UI by watching these videos:
Once your Dremio account (i.e., organization) is set up, you should create a Dremio Sonar project and a Dremio Arctic catalog.
From the screen after sign-up, select a Dremio Sonar project.
After selecting Add a Sonar project, there are two quick steps:
- Enter a name for the project
- Configure your AWS account to enable Dremio to provision the necessary resources (don’t worry, it’s really easy)
(Note: The new project starts with a minimal 2XS engine, which automatically shuts off after being idle for a period of time. To create new engines or configure existing engine sizes and termination rules head into Engine Settings from Project Settings.)
From your Sonar project, to add an Arctic Catalog to work with, click “Add Source” in the bottom left.
Then select Arctic as your source, which gives you an option to select an existing Arctic catalog or to create a new one. Let’s create a new catalog.
Once the source is added, you can always switch between Sonar projects and Arctic catalogs by clicking on an icon in the menu on the left.
Setting Up Your Example Data
The following datasets were generated for this demonstration using Mockaroo. The datasets represent sales of individual items to customers including the customer name, product, price, and date of purchase. The first dataset represents our existing data, and the staging dataset will represent incoming data from a recent batch or streaming ETL job. Please download the following two datasets:
Next, navigate to your home space in the Sonar project. (You’ll see the home icon denoting the home space in your account.)
In the upper right corner, you’ll see a plus (+) icon that allows you to upload the files you just downloaded.
Once you select the CSV to upload make sure to select “Extract Field Names.” Then do the same for the second CSV.
Once both are added, you’ll see them listed as tables.
The purple dataset icons represent the actual datasets (in this case, made up by the files), whereas green dataset icons represent logical views.
Now that the CSV files are accessible from your Sonar project, let’s convert them to Iceberg tables in your Arctic catalog. To do so, head over to the SQL runner and run a simple CTAS statement.
CREATE TABLE arctic.exmpl.sales AS SELECT * FROM "@username"."sales-data"
If you inspect your data in Sonar you should see this dataset under your Arctic catalog.
Now let’s begin your exercise!
Isolating, Validating, and Reverting
The first benefit of Arctic is that you can isolate your ETL work. Since any engine can communicate with an Arctic catalog, a Spark Streaming, Flink, or Dremio Sonar job could land the data in your Arctic catalog in an ETL branch instead of landing it in the main branch, thereby preventing downstream consumers from querying the data before you’ve run quality checks.
To simulate this, create an ETL branch and then land your staging dataset in this branch.
CREATE BRANCH etl_10_25_2022 IN arctic; USE BRANCH etl_10_25_2022 IN arctic; CREATE TABLE arctic.exmpl.staging AS SELECT * FROM "@[email protected]"."staging-data";
Now let’s switch over to the Arctic project so you can see the result of these queries from the Arctic UI. Notice that when you look at the main branch you only see the sales table. The staging table is not there, nor is it exposed to downstream data consumers querying the main branch.
If you switch over to your ETL branch now, you’ll see the isolated sales and staging table so you can do an upsert with peace of mind.
Let’s head back to Sonar and run an upsert query within the ETL branch. Make sure to double-check that SQL runner’s context is set to the ETL branch (the “ref” property you see in the upper right of the SQL runner window or you can run the “USE BRANCH” statement again before the MERGE INTO statement).
MERGE INTO arctic.exmpl.sales t USING (SELECT * FROM arctic.exmpl.staging) s ON t.id = s.id WHEN MATCHED THEN UPDATE SET name = s.name, product = s.product, price = s.price, "date" = s."date" WHEN NOT MATCHED THEN INSERT (id, name, product, price, "date") VALUES (s.id, s.name, s.product, s.price, s."date")
Now let’s query the sales data from both branches to see that the changes were isolated to the ETL branch. Run a COUNT query to see the number of records in the sales table from the ETL branch (after it finishes running click on the second query summary to see the results).
USE BRANCH etl_10_25_2022 in arctic; SELECT COUNT(*) FROM arctic.exmpl.sales;
If you run the same query from the main branch the record count is still 500, indicating that your upsert has not affected any queries from your data consumers, and you can safely focus on checking the quality of the data on the branch.
Now that we’ve demonstrated how you can isolate your workloads, let’s take advantage of the isolation of your data; this can include verifying referential integrity, checking value bounds, looking for missing or incorrect data, and more.
In this case, you are concerned about missing price data that can cause issues in tracking your revenue accurately. So you will do the following:
- Check to see if any of the records are missing price data
- Move any incorrect data to another table so you can backfill and merge it in after doing so
- Remove the incorrect records from the table
Let’s run a query to see if any of the prices are missing in the merged sales table from the ETL branch.
SELECT * FROM arctic.exmpl.sales WHERE price = ''
It looks like you have 54 rows that present a problem.
(Note: You may have noticed that I queried the table for prices that equaled an empty string. This is because when CSV files are imported, all columns are treated as text. From Dremio Sonar’s UI you can adjust data types of columns by clicking the three dots next to a field name and selecting “Convert Data Type,” making Dremio very convenient for last-mile ETL work like joins, field type conversions, creating calculated columns, and so forth. That said, if you know SQL, you may find it easier to simply make the necessary changes with SQL.)
Let’s isolate the 54 records into their own table for rehabilitation.
CREATE TABLE arctic.exmpl.broken_records AS (SELECT * FROM arctic.exmpl.sales WHERE price = '');
Now let’s drop the records.
DELETE FROM arctic.exmpl.sales WHERE price = '71.60'
Only two records were deleted… yet you had 54 records to delete. You made a mistake and deleted records with “71.60” instead of records with an empty price (“”). Usually, you may be quite alarmed when something like this happens, but luckily you are using Arctic as your catalog and every change is a commit that you can revert to. You just need to browse the Arctic UI to get the commit number you’d like to revert to.
Simply switch back to Arctic and inspect the commits in your ETL branch.
As you can see the newest commit was a change to your sales table which was your erroneous query. You want to roll back the branch to the previous commit that created the broken_records table. Click on the commit and copy over the full commit ID. You can use Sonar to run a query to revert your branch.
Then you rollback from Sonar with an ALTER BRANCH ASSIGN COMMIT command.
ALTER BRANCH etl_xx_xx_2022 ASSIGN COMMIT "xxxxxxxxxxxxxx" IN arctic
If all goes well you should be able to query the sales table and see that it has 1000 records again.
SELECT * FROM arctic.exmpl.sales
Success! Luckily, thanks to Arctic’s “Data as Code” paradigm, you were able to revert your mistake as easily as a developer can revert in Git. Now you can finish up your ETL job and run the query you meant to run originally.
DELETE FROM arctic.exmpl.sales WHERE price = ''
The expected 54 rows were deleted this time. Now you can safely drop the staging table and merge back into the main branch. With the data merged, the updated sales table can be queried by your data consumers.
DROP TABLE arctic.exmpl.staging; MERGE BRANCH etl_xx_xx_2022 INTO main IN arctic;
Now let’s switch back to the main branch and run one last COUNT query to ensure the merge was successful.
USE BRANCH main IN arctic; SELECT COUNT(*) from arctic.exmpl.sales;
Success, you see your new data in main and none of the problematic records are in the table.
Connecting to Your Arctic Catalog with Spark
One of the best features of the Arctic catalog is that it works with any engine. To demonstrate this, let’s connect to our Arctic catalog using Apache Spark and query the data.
The first step is to generate a personal access token for authentication. Click to access your account settings where you can create a token. You should find this on the bottom left corner of the screen.
Create a new personal token and write it down somewhere safe so you can refer to it in the coming steps.
Next step, let’s start a Docker container running Spark with the following command:
docker run -it alexmerced/spark33playground
To successfully connect to Arctic you need to define a few environmental variables:
TOKEN: This is your personal access token from Dremio that will authorize Spark to read your Arctic catalog.
WAREHOUSE: If you plan on doing any writing to your catalog, Spark will want to know where it should be writing the files to. Make sure the URI includes the trailing slash.
AWS credentials: To read and write to your S3 account, you need to define the AWS_REGION, AWS_SECRET_ACCESS_KEY, and AWS_ACCESS_KEY_ID to provide Spark AWS credentials.
ARCTIC_URI: This is the endpoint for your Arctic catalog. On the UI of your Arctic catalog click on the gear to bring up the project settings and you’ll find the URI there.
Run the following commands from your Docker container:
export TOKEN=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx export WAREHOUSE=s3a://xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/ export AWS_ACCESS_KEY_ID=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx export AWS_SECRET_ACCESS_KEY=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx export ARCTIC_URI=https://nessie.dremio.cloud/v1/repositories/xxxxxxxxxxxxx export AWS_REGION=xxxxxxxxxxxx
The warehouse should be the bucket your Arctic catalog is configured to on Dremio Sonar. To double-check this from Sonar, right-click on the Arctic catalog and select “edit details” and navigate to the storage section where you’ll find the bucket under the property “AWS Root path.” You can also use any AWS S3 path your credentials have access to read and write.
Once all these variables are set the following command should open up Spark-SQL with your Arctic catalog under the namespace “arctic”:
spark-sql --packages "org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.3.1,org.projectnessie.nessie-integrations:nessie-spark-extensions-3.3_2.12:0.67.0,software.amazon.awssdk:bundle:2.17.178,software.amazon.awssdk:url-connection-client:2.17.178" \ --conf spark.sql.extensions="org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,org.projectnessie.spark.extensions.NessieSparkSessionExtensions" \ --conf spark.sql.catalog.arctic.uri=$ARCTIC_URI \ --conf spark.sql.catalog.arctic.ref=main \ --conf spark.sql.catalog.arctic.authentication.type=BEARER \ --conf spark.sql.catalog.arctic.authentication.token=$TOKEN \ --conf spark.sql.catalog.arctic.catalog-impl=org.apache.iceberg.nessie.NessieCatalog \ --conf spark.sql.catalog.arctic.warehouse=$WAREHOUSE \ --conf spark.sql.catalog.arctic=org.apache.iceberg.spark.SparkCatalog \ --conf spark.sql.catalog.arctic.io-impl=org.apache.iceberg.aws.s3.S3FileIO
Here is a brief overview of the flags we are passing to configure Spark:
This defines any packages Spark should download which include Iceberg, Nessie, and the AWS SDK. If your environment uses versions of Spark earlier than 3.3 you may need to adjust this flag; read the appropriate sections of the Iceberg and Nessie documentation for details.
This adds the Iceberg and Nessie SQL extensions so you can use Iceberg/Nessie specific SQL from Spark-SQL.
--conf spark.sql.catalog.arctic.uri=$ARCTIC_URI \ --conf spark.sql.catalog.arctic.ref=main \ --conf spark.sql.catalog.arctic.authentication.type=BEARER \ --conf spark.sql.catalog.arctic.authentication.token=$TOKEN \ --conf spark.sql.catalog.arctic.catalog-impl=org.apache.iceberg.nessie.NessieCatalog \ --conf spark.sql.catalog.arctic.warehouse=$WAREHOUSE \ --conf spark.sql.catalog.arctic=org.apache.iceberg.spark.SparkCatalog \ --conf spark.sql.catalog.arctic.io-impl=org.apache.iceberg.aws.s3.S3FileIO
These settings configure our catalog. The catalog doesn’t have to be called Arctic; you can change it to whatever name you’d like, but that’s the namespace the catalog will be accessible at during your Spark session. So here we pass all the info for authenticating Arctic: where to write files to when writing to our catalog, stating that it is a Nessie-type catalog, and setting our read/write implementation to that for AWS S3.
Once Spark-SQL opens up let’s run a quick query on our data:
SELECT * FROM arctic.exmpl.sales;
There it is, you can read and write to the data from Spark like you could from Dremio. You can also take advantage of the Nessie Spark-SQL extensions and create, use, and merge branches from Spark. For example you can list the branches that exist with the command.
LIST REFERENCES IN arctic;
(Note: If you run into any issues loading the Nessie SQL extension, this GitHub issue should assist.)
By following this exercise, you have experienced firsthand how Dremio Arctic can isolate workloads, enable data auditing, and revert mistakes to make your data quality workflows more effective and less stressful. Everything you just did can be done from other engines like Spark and Flink, allowing multiple users to work on multiple tables with multiple engines.