Dremio Blog

5 minute read · March 25, 2025

Securing Your Apache Iceberg Data Lakehouse

Emre Saglam Head Of Security and Compliance at Dremio

Start For Free

Copied to clipboard

The rise of the data lakehouse architecture, exemplified by Apache Iceberg, offers flexibility and scalability for data management and analytics. However, the decentralized nature of storage and diverse access paths inherent in lakehouses present unique security challenges compared to traditional databases. Ensuring the confidentiality, integrity, and availability of Iceberg data requires a multi-layered approach, addressing security at the storage, infrastructure, and catalog levels.

In this example, I am discussing AWS controls; however, these controls can be replaced with the controls of any of the major cloud providers.

Securing the underlying object storage, such as S3 buckets, is a fundamental aspect of protecting Iceberg data. This includes several key measures:

Encrypting data at rest using S3 side encryption with either cloud-managed keys (e.g., SSE-S3), customer-managed keys (e.g., SSE-KMS) or server-side encryption with customer-provided keys (SSE-C), depending on your risk appetite.
Restricting access to S3 buckets through the use of IAM policies, bucket policies, and Access Control Lists (ACLs). This ensures that only authorized entities can interact with the raw data files.
Controlling network access using VPC endpoints and security groups to limit the pathways through which data can be accessed.
Enabling S3 bucket logging and monitoring to track access patterns and detect suspicious activities.
Employing S3 Object Lock to prevent accidental or malicious deletion of data objects. The importance of these measures is underscored by past S3 data leaks resulting from misconfigurations, which exposed vast amounts of sensitive customer data.

Beyond storage, securing the cloud infrastructure layer where ETL workloads and business intelligence tools reside is critical. Compromised infrastructure can provide a path to sensitive data. Key security requirements at this layer include:

Segmentation of environments and workloads.
Careful configuration of security group settings to control network traffic.
Workload isolation to limit the impact of a potential breach.
Data source isolation and segmentation.
VPC isolation to create private network spaces.
AWS Account Level Isolation for strong organizational boundaries. Past cloud infrastructure breaches highlight the risks of misconfigurations and vulnerabilities at this level.

Catalog-level governance plays a crucial role in managing access to Iceberg tables and their metadata. Catalogs provide centralized metadata management, ensuring consistent data definitions. They also facilitate namespace organization, logically grouping tables. Implementing Role-Based Access Control (RBAC) within catalogs allows administrators to define user roles and permissions, ensuring only authorized users can access or modify specific datasets. Some advanced catalogs also support fine-grained permissions, enabling precise data governance.

To further enhance security when clients interact with Iceberg data stored in object stores like S3 and GCS, Iceberg catalog credential vending offers a robust solution. This mechanism involves the catalog (such as Nessie or Polaris) mediating access to the underlying storage. When a client, like a query engine, attempts to access an Iceberg table, it communicates with the catalog to locate the table metadata and request short-lived credentials.

The catalog's credential vending component, aware of the storage resources needed, interacts with cloud access services like AWS STS or GCS equivalents to generate temporary, downscoped credentials with limited lifetimes and specific permissions for the requesting client and table. These generated credentials allow the client to directly access the permitted data locations in the object store using the permitted privileges.

The catalog then passes these temporary credentials back to the Iceberg client, which uses them for secure data access. Once the credentials expire, the client must re-request them, ensuring continuous security and preventing the prolonged use of potentially compromised credentials. This approach provides a more secure and manageable way to access Iceberg data compared to sharing long-term credentials.

In conclusion, securing an Apache Iceberg lakehouse demands a holistic strategy that encompasses multiple layers of control. By implementing robust security measures at the object storage level, such as encryption and access restrictions, organizations can protect the raw data. Strengthening the cloud infrastructure layer through segmentation and isolation minimizes the risk of breaches spreading to sensitive data. Finally, leveraging catalog-level governance with features like RBAC and credential vending allows for fine-grained access management to Iceberg tables and their underlying data, ensuring that only authorized users and applications can access the resources they need, thereby fortifying the overall security posture of the Iceberg data lakehouse.

Try Dremio Cloud free for 30 days

Deploy agentic analytics directly on Apache Iceberg data with no pipelines and no added overhead.

Start For Free

Article Topics

Dremio Blog: Open Data Insights

Sep 22, 2023 Dremio Blog: Open Data Insights

Intro to Dremio, Nessie, and Apache Iceberg on Your Laptop

We're always looking for ways to better handle and save money on our data. That's why the "data lakehouse" is becoming so popular. It offers a mix of the flexibility of data lakes and the ease of use and performance of data warehouses. The goal? Make data handling easier and cheaper. So, how do we […]

Alex Merced

Aug 16, 2023 Dremio Blog: News Highlights

5 Use Cases for the Dremio Lakehouse

With its capabilities in on-prem to cloud migration, data warehouse offload, data virtualization, upgrading data lakes and lakehouses, and building customer-facing analytics applications, Dremio provides the tools and functionalities to streamline operations and unlock the full potential of data assets.

Alex Merced

Aug 31, 2023 Dremio Blog: News Highlights

Dremio Arctic is Now Your Data Lakehouse Catalog in Dremio Cloud

Dremio Arctic bring new features to Dremio Cloud, including Apache Iceberg table optimization and Data as Code.

Jeremiah Morrow

Securing Your Apache Iceberg Data Lakehouse

Table of Contents

Try Dremio Cloud free for 30 days

Ready to Get Started?

Table of Contents

Try Dremio Cloud free for 30 days

Related Dremio Articles

Intro to Dremio, Nessie, and Apache Iceberg on Your Laptop

5 Use Cases for the Dremio Lakehouse

Dremio Arctic is Now Your Data Lakehouse Catalog in Dremio Cloud

Ready to Get Started?