When reading documentation or following tutorials, a source of confusion is often all the configurations used to configure the catalog during your Spark session, which can look like the following when using the Spark CLI:
Explore this interactive demo and see how Dremio's Intelligent Lakehouse enables Agentic AI
Wait, What Is a Catalog?
An Apache Iceberg catalog is a mechanism for tooling to identify available Apache Iceberg tables which provides the engine with the location of the metadata.json that currently defines the table. You can learn more by reading these articles about Apache Iceberg reads and Apache Iceberg writes.
An Apache Spark catalog is a mechanism in the Spark session that enables Spark to discover available tables to work with, and our Iceberg configurations create a Spark catalog and links it to an existing Iceberg catalog.
The Primary Settings
Here are many of the typical settings you’ll use regardless of the catalog.
This setting identifies any packages that need to be downloaded and used during the Spark session. Libraries you’ll often specify in this setting include:
This specifies any extensions to SQL that should be present in the Spark session; this is what makes the special SQL syntax for Apache Iceberg or Project Nessie (branching and merging) available to the existing Spark session.
The settings below are to configure your specific catalog, which can be under a namespace of your choosing. We’ll name the catalog for this Spark session “my_iceberg_catalog”, to make it clear where you are able to specify the namespace of the catalog in this setting.
This specifies where any data files for new tables should be warehoused. This will not affect existing tables as they will already have a location where they exist that is specified in their metadata. This can be a local file system path or an object storage path for S3. The value could be something like: “s3a://my-bucket/path/”.
This specifies what class to use for writing to the file system. It is mainly needed if you’re using a cloud file system, as a standard file system will be used if not specified. If you are writing to HDFS or a local filesystem, you can exclude this property. Possible values include:
org.apache.iceberg.aws.s3.S3FileIO (If using an S3 compatible store)
org.apache.iceberg.gcp.gcs.GCSFileIO (If using Google Cloud storage)
org.apache.iceberg.dell.ecs.EcsFileIO (If using dell ECS)
Nessie-Specific Settings
Project Nessie is an open source transactional catalog that goes beyond a normal catalog by providing catalog-level snapshots and branching, which enables the following:
Isolation of ingestion
Catalog-level rollbacks
Catalog-level time travel
Branching as a way to make zero-copy clones for experimentation
Tagging at the catalog level for reproducibility
If you haven’t tried out Project Nessie, here are a few articles on it and the cloud service, Dremio Arctic, which provides cloud-managed Nessie catalogs with extensive lakehouse management features:
If using ECS, Minio, or other S3 compatible services make sure to specify the endpoint in your settings. For S3 and compatible stores make sure to define these environmental variables prior to starting up Spark.
Below are some examples of what the settings may look like in the shell and in PySpark, but here are some other articles where you can find examples for reference:
Intro to Dremio, Nessie, and Apache Iceberg on Your Laptop
We're always looking for ways to better handle and save money on our data. That's why the "data lakehouse" is becoming so popular. It offers a mix of the flexibility of data lakes and the ease of use and performance of data warehouses. The goal? Make data handling easier and cheaper. So, how do we […]
Aug 16, 2023·Dremio Blog: News Highlights
5 Use Cases for the Dremio Lakehouse
With its capabilities in on-prem to cloud migration, data warehouse offload, data virtualization, upgrading data lakes and lakehouses, and building customer-facing analytics applications, Dremio provides the tools and functionalities to streamline operations and unlock the full potential of data assets.
Aug 31, 2023·Dremio Blog: News Highlights
Dremio Arctic is Now Your Data Lakehouse Catalog in Dremio Cloud
Dremio Arctic bring new features to Dremio Cloud, including Apache Iceberg table optimization and Data as Code.