56 minute read · September 16, 2024

Leveraging Apache Iceberg Metadata Tables in Dremio for Effective Data Lakehouse Auditing

Alex Merced · Head of DevRel, Dremio

Overview of Data Lakehouse Architecture

The Role of Apache Iceberg and Dremio in Modern Data Lakehouses

Importance of Table Monitoring and Auditing

How Metadata Tables Assist in Monitoring and Auditing

Setting Up the Environment

Prerequisites

Setting Up Dremio, MinIO, and Nessie Using Docker Compose

Verifying the Environment Setup

Creating an Apache Iceberg Table and Generating Snapshots

Connecting to Dremio

Creating an Iceberg Table in Dremio

Running Transactions to Generate Snapshots

Viewing Snapshots in Apache Iceberg

Querying the Apache Iceberg Metadata Tables in Dremio

Overview of Iceberg Metadata Tables

Querying Metadata Tables One by One

Practical Use Cases

Summary

Conclusion

Organizations are inundated with vast amounts of data generated from diverse sources. Managing, processing, and extracting meaningful insights from this data is a significant challenge. The Data Lakehouse architecture has become the next evolution in overcoming these challenges, combining the best features of data lakes and data warehouses to deliver a unified platform for both analytical and transactional workloads.

This blog explores the integration of Apache Iceberg and Dremio within the Data Lakehouse framework to enhance data auditing and monitoring capabilities. We'll delve into how querying Iceberg metadata tables in Dremio can provide invaluable insights for table auditing, ensuring data integrity and facilitating compliance.

Overview of Data Lakehouse Architecture

The Data Lakehouse is an evolution of traditional data architectures, aiming to bridge the gap between the scalability of data lakes and the performance of data warehouses. It allows organizations to store all their structured and unstructured data in a single repository, enabling efficient data processing and analytics without the need for multiple systems.

Key advantages of the Data Lakehouse include:

Unified Storage: Centralizes data storage for diverse data types.
Scalability: Handles large volumes of data with ease.
Flexibility: Supports various data processing frameworks and query engines.
Cost Efficiency: Reduces the need for data duplication and multiple infrastructures.

The Role of Apache Iceberg and Dremio in Modern Data Lakehouses

Apache Iceberg is a data lakehouse table format designed for large analytic datasets in a data lake. It addresses the challenges of managing petabyte-scale tables by providing:

Schema Evolution: Allows changes to the table schema without disrupting read or write operations.
Partition Evolution: Supports changing partition strategies over time.
Time Travel: Enables querying historical versions of data for auditing and rollback purposes.
ACID Transactions: Ensures data consistency during concurrent read and write operations.

Dremio is a high-performance data lakehouse query engine that simplifies and accelerates analytics on data lake storage. It offers:

Seamless Integration with Apache Iceberg: Native read and write support for Iceberg tables for efficient querying of data lakehouses.
Advanced Query Optimization: Enhances performance through Apache Arrow based process, reflection-based acceleration and a variety of other query optimization techniques.
Self-Service Data Access: Empowers users to explore and analyze data without heavy reliance on IT teams. Beyond Iceberg data lakehouses, Dremio also enables self-service access to databases, data warehouses and data lakes all from one unified access point.
Scalability and Flexibility: Adapts to growing data volumes and diverse analytics workloads.

By integrating Apache Iceberg with Dremio, organizations can leverage the strengths of both technologies to build a robust and efficient data lakehouse environment.

Importance of Table Monitoring and Auditing

Maintaining data integrity, security, and compliance is paramount. Monitoring and auditing tables are critical components in achieving these objectives. Here's why they matter:

Data Consistency and Integrity: Ensures that data remains accurate and reliable across the organization.
Regulatory Compliance: Meets legal and industry standards for data governance, such as GDPR or HIPAA.
Operational Transparency: Provides visibility into data changes, facilitating troubleshooting and accountability.
Security: Detects unauthorized access or alterations to sensitive data.

Apache Iceberg's metadata tables offer rich insights into the state and history of data tables. By querying these metadata tables in Dremio, organizations can:

Track Data Lineage: Understand how data evolves over time.
Analyze Snapshot Histories: Review past versions of data for auditing and recovery.
Monitor Data Operations: Observe patterns in data modifications, additions, and deletions.
Enhance Troubleshooting: Quickly identify and resolve data issues.

In the sections that follow, we'll demonstrate how to set up an environment with Dremio, MinIO, and Nessie, create an Apache Iceberg table, and utilize Dremio to query the metadata tables for effective auditing and monitoring.

How Metadata Tables Assist in Monitoring and Auditing

By leveraging Apache Iceberg's metadata tables within Dremio, organizations can gain deep visibility into their data operations using Dremio's SQL syntax and capabilities.

Snapshot Management

Time Travel Queries: Dremio allows you to perform time travel queries on Iceberg tables using the AT SNAPSHOT or AT TIMESTAMP clauses. This feature enables you to query the state of your data at a specific point in time, which is essential for auditing and compliance.
- Querying by Snapshot ID:
  
  SELECT * FROM my_catalog.my_table AT SNAPSHOT '5393090506354317772';
- Querying by Timestamp:
  
  SELECT * FROM my_catalog.my_table AT TIMESTAMP '2022-01-01 17:30:50.000';
Change Tracking: You can access detailed information about snapshots and table history by querying Iceberg metadata tables using Dremio's TABLE function.
- Viewing Table History:
  
  SELECT * FROM TABLE(table_history('my_catalog.my_table'));
- Listing Snapshots:
  
  SELECT * FROM TABLE(table_snapshot('my_catalog.my_table'));
Audit Trails: By analyzing the table_history and table_snapshot metadata tables, you can build comprehensive audit trails of all data modifications, which is crucial for governance and regulatory compliance.

Data Recovery and Rollback Capabilities

Data Recovery: In the event of erroneous data modifications, you can restore the table to a previous state by using Dremio's ROLLBACK TABLE command. This command allows you to roll back an Iceberg table to a specific snapshot or timestamp.
- Rolling Back to a Snapshot ID:
  
  ROLLBACK TABLE my_catalog.my_table TO SNAPSHOT '5393090506354317772';
- Rolling Back to a Timestamp:
  
  ROLLBACK TABLE my_catalog.my_table TO TIMESTAMP '2022-01-01 17:30:50.000';
- Example Workflow:
  1. Identify the Snapshot: Query the table_snapshot metadata table to find the desired snapshot ID or timestamp to which you want to roll back.
    
    SELECT snapshot_id, committed_at, operation FROM TABLE(table_snapshot('my_catalog.my_table')) ORDER BY committed_at DESC;
  2. Perform the Rollback: Use the ROLLBACK TABLE command with the identified snapshot ID or timestamp.
    
    ROLLBACK TABLE my_catalog.my_table TO SNAPSHOT '5393090506354317772';
  3. Verify the Rollback: Confirm that the table has been rolled back by querying the current data or checking the latest snapshot.
    
    SELECT * FROM my_catalog.my_table;
Rollback Operations: The ROLLBACK TABLE command simplifies the process of reverting to a previous state without manually overwriting data or creating new tables. It ensures that the rollback is executed atomically and efficiently.

Enhanced Data Governance

Anomaly Detection: Analyze metadata tables to detect irregularities such as unexpected data additions or deletions.
- Monitoring Data Modifications:
  
  SELECT h.made_current_at, s.operation, s.summary['added-records'] AS records_added FROM TABLE(table_history('my_catalog.my_table')) AS h JOIN TABLE(table_snapshot('my_catalog.my_table')) AS s ON h.snapshot_id = s.snapshot_id ORDER BY h.made_current_at DESC;
  
  This query helps you monitor the number of records added in each operation, aiding in anomaly detection.
Performance Optimization: Use the table_partitions and table_files metadata tables to understand data distribution and optimize query performance.
- Viewing Partition Statistics:
  
  SELECT * FROM TABLE(table_partitions('my_catalog.my_table'));
- Analyzing File Sizes and Record Counts:
  
  SELECT file_path, file_size_in_bytes, record_count FROM TABLE(table_files('my_catalog.my_table'));

Using Dremio's SELECT Syntax for Metadata Queries

Dremio provides a straightforward SQL interface to query Apache Iceberg metadata tables. The syntax leverages the TABLE function within a SELECT statement.

Syntax Overview:

SELECT * FROM TABLE(<iceberg_metadata>('table_name'));

Replace <iceberg_metadata> with the specific metadata function and 'table_name' with the name of your Iceberg table enclosed in single quotes.
Available Metadata Tables:
- table_history: Retrieves the commit history of the table.
  
  SELECT * FROM TABLE(table_history('my_catalog.my_table'));
- table_snapshot: Provides details about each snapshot.
  
  SELECT * FROM TABLE(table_snapshot('my_catalog.my_table'));
- table_files: Lists the data files associated with the table.
  
  SELECT * FROM TABLE(table_files('my_catalog.my_table'));
- table_manifests: Displays information about manifest files.
  
  SELECT * FROM TABLE(table_manifests('my_catalog.my_table'));
- table_partitions: Shows statistics related to table partitions.
  
  SELECT * FROM TABLE(table_partitions('my_catalog.my_table'));
Example Queries:
- Joining Snapshots with History:
  
  SELECT h.made_current_at, s.operation, h.snapshot_id, s.summary['spark.app.id'] AS app_id FROM TABLE(table_history('my_catalog.my_table')) AS h JOIN TABLE(table_snapshot('my_catalog.my_table')) AS s ON h.snapshot_id = s.snapshot_id ORDER BY h.made_current_at;
  
  This query helps track changes over time by combining history and snapshot details, including application IDs.
- Analyzing Data File Changes:
  
  SELECT s.committed_at, s.operation, f.file_path, f.record_count FROM TABLE(table_snapshot('my_catalog.my_table')) AS s JOIN TABLE(table_files('my_catalog.my_table')) AS f ON s.snapshot_id = f.snapshot_id WHERE s.operation = 'append' ORDER BY s.committed_at DESC;
  
  This query identifies data files added in each append operation.
- Detecting Anomalies or Unexpected Changes:
  
  SELECT h.made_current_at, s.operation, COUNT(f.file_path) AS file_count FROM TABLE(table_history('my_catalog.my_table')) AS h JOIN TABLE(table_snapshot('my_catalog.my_table')) AS s ON h.snapshot_id = s.snapshot_id LEFT JOIN TABLE(table_files('my_catalog.my_table')) AS f ON s.snapshot_id = f.snapshot_id GROUP BY h.made_current_at, s.operation ORDER BY h.made_current_at;
  
  This query helps spot irregular patterns in data modifications by showing the number of files affected in each operation.

By utilizing Dremio's SQL syntax to query Apache Iceberg's metadata tables and employing the ROLLBACK TABLE command for data recovery, organizations can effectively monitor and audit their data lakehouse environment. This approach enhances data governance, ensures data integrity, and supports compliance with regulatory standards, all while maintaining high performance and scalability.

Setting Up the Environment

Prerequisites

Before we begin, ensure you have the following prerequisites:

Docker Installed: Docker is required to run the services in containers. If you haven't installed Docker, you can download it from the official Docker website.
Basic Understanding of Docker Compose: Familiarity with Docker Compose will help you navigate and modify the docker-compose.yml file as needed.

Setting Up Dremio, MinIO, and Nessie Using Docker Compose

To simplify the setup process, we'll use Docker Compose to orchestrate our services. The adjusted docker-compose.yml file includes only the necessary services for this tutorial: Dremio, MinIO, and Nessie. We've removed Spark, MongoDB, and PostgreSQL, as they are not required for our purposes. Additionally, MinIO has been configured without seeded data, so there's no need for volume mapping.

Adjusted `docker-compose.yml` File

Below is the adjusted docker-compose.yml file:

version: '3.8'

services:
  dremio:
    image: dremio/dremio-oss:latest
    container_name: dremio
    ports:
      - 9047:9047      # Dremio UI Port
      - 31010:31010    # ODBC/JDBC Clients
      - 32010:32010    # Arrow Flight Clients
      - 45678:45678    # Internal Communication
    environment:
      - DREMIO_JAVA_SERVER_EXTRA_OPTS=-Dpaths.dist=file:///opt/dremio/data/dist
    networks:
      - dremio-network

  minio:
    image: minio/minio
    container_name: minio
    environment:
      - MINIO_ROOT_USER=admin
      - MINIO_ROOT_PASSWORD=password
      - MINIO_DOMAIN=minio
      - MINIO_REGION_NAME=us-east-1
    ports:
      - "9000:9000"    # MinIO API Port
      - "9001:9001"    # MinIO Console Port
    command: server /data --console-address ":9001"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:9000/minio/health/live"]
      interval: 30s
      timeout: 20s
      retries: 3
    networks:
      - dremio-network

  nessie:
    image: projectnessie/nessie:latest
    container_name: nessie
    environment:
      - QUARKUS_PROFILE=prod
      - QUARKUS_HTTP_PORT=19120
    ports:
      - "19120:19120"  # Nessie API Port
    networks:
      - dremio-network

networks:
  dremio-network:

Explanation of Each Service

1. Dremio

Image: dremio/dremio-oss:latest
Purpose: Dremio serves as the query engine for our data lakehouse, allowing us to interact with Apache Iceberg tables and metadata.
Ports:
- 9047: Dremio UI access.
- 31010: ODBC/JDBC client connections.
- 32010: Arrow Flight client connections.
- 45678: Internal communication port.
Environment Variables:
- DREMIO_JAVA_SERVER_EXTRA_OPTS: Configures Dremio to use the containers file system for storing reflections.

2. MinIO

Image: minio/minio
Purpose: Acts as S3-compatible object storage, serving as the data lake storage layer.
Ports:
- 9000: MinIO API access.
- 9001: MinIO web console access.
Environment Variables:
- MINIO_ROOT_USER: Sets the admin username (default is admin).
- MINIO_ROOT_PASSWORD: Sets the admin password (default is password).
- MINIO_DOMAIN: Domain name for MinIO service.
- MINIO_REGION_NAME: Specifies the region (default is us-east-1).
Command:
- server /data --console-address ":9001": Starts MinIO server with the console accessible at port 9001.

3. Nessie

Image: projectnessie/nessie:latest
Purpose: Provides Git-like version control for data lakes, enabling branching and tagging for our Iceberg tables.
Ports:
- 19120: Nessie API access.
Environment Variables:
- QUARKUS_PROFILE: Sets the profile to prod for production settings.
- QUARKUS_HTTP_PORT: Configures the HTTP port for Nessie.

Verifying the Environment Setup

After setting up the docker-compose.yml file, follow these steps to launch and verify the environment.

Step 1: Launch the Services

Open a terminal in the directory containing your docker-compose.yml file and run:

docker-compose up -d

This command starts all the services in detached mode.

Step 2: Verify Dremio UI Access

Access the Dremio UI:
- Open a web browser and navigate to http://localhost:9047.
Initial Setup:
- If prompted, create an admin user by providing a username and password.
Login:
- Use your credentials to log in to the Dremio UI.
Confirmation:
- You should see the Dremio home page, indicating that the service is running correctly.

Step 3: Confirm MinIO Operations

Access the MinIO Console:
- Open a web browser and navigate to http://localhost:9001.
Login Credentials:
- Username: admin (as set in the docker-compose.yml file)
- Password: password
Verification:
- Upon successful login, you should see the MinIO dashboard.
Create a Bucket:
- Click on + (Create Bucket) and create a new bucket named datalake.
Confirmation:
- The bucket should appear in the list, confirming that MinIO is operational.

Step 4: Verify Nessie Service

Nessie API Access:
- Nessie doesn't have a built-in UI, but you can interact with it via API or CLI.

Step 5: Connect Dremio to MinIO and Nessie

Connecting Nessie (with Minio Storage) to Dremio

Log in to Dremio: Open your browser and go to http://localhost:9047. Log in with your Dremio credentials.
Navigate to Sources:
- On the Dremio Web UI, click on the Sources tab in the left-hand sidebar.
Add Nessie as a Source:
- Click + Source in the upper-right corner.
- Select Nessie from the list of available source types.
Configure Nessie Source:
- Name: name this source MinIO
- Nessie REST API URL: Enter http://nessie:19120/api/v2 (the API URL exposed by the Nessie container, based on the container_name defined in the docker-compose.yml file).
- Authentication: Choose None
Configure Nessie Storage:
- set the warehouse address to the name of the bucket in MinIO (datalake)
- access key and secret key are the same as the ones used to access MinIO defined in the docker-compose.yml file. (admin/password)
- set the following custom parameters:
  - fs.s3a.path.style.access to true
  - fs.s3a.endpoint to the endpoint of MinIO (minio:9000)
  - dremio.s3.compat to true
Click Save: Once the configuration is set, click Save. Dremio will now be connected to Nessie, and you will be able to read and write versioned data using Iceberg tables managed by Nessie.

Step 6: Test the Integration

Run a Sample Query

SQL Editor:
- In Dremio, click on New Query.
Sample Query:
- Run the following query to create an example table:
  CREATE TABLE MinIO.test1 (col VARCHAR);
Verify Results:
- Should be able to see the table in Dremio's UI and see the initial metadata files in MinIO's UI

With the environment set up and verified, you're now ready to create an Apache Iceberg table and generate snapshots, which we'll cover in the next section.

Creating an Apache Iceberg Table and Generating Snapshots

With our environment set up and verified, we're ready to create an Apache Iceberg table in Dremio, perform some transactions, and generate snapshots. This process will allow us to explore how Dremio interacts with Iceberg tables and how we can utilize metadata tables for auditing purposes.

Connecting to Dremio

You can interact with Dremio using the Dremio UI or the Dremio SQL editor. For this tutorial, we'll use the Dremio UI's SQL editor.

Accessing the Dremio SQL Editor

Open Dremio UI: Navigate to http://localhost:9047 in your web browser.
Log In: Enter your Dremio credentials if prompted.
Navigate to SQL Editor:
- Click on New Query at the top-right corner.
- This will open the SQL editor where you can execute SQL commands.

Creating an Iceberg Table in Dremio

We'll create an Iceberg table in our MinIO data source using Dremio's CREATE TABLE statement. Ensure that you've connected Dremio to MinIO as outlined in the previous section.

Defining the Table Schema

We'll create a simple table named sales_data with the following schema:

id: INTEGER
item: VARCHAR
quantity: INTEGER
price: DOUBLE
sale_date: DATE

SQL Statement to Create the Table

CREATE TABLE MinIO.sales_data (
  id INTEGER,
  item VARCHAR,
  quantity INTEGER,
  price DOUBLE,
  sale_date DATE
)
PARTITION BY (sale_date)
TBLPROPERTIES (
  'format-version' = '2'
);

Explanation:

CREATE TABLE MinIO.datalake.sales_data: Specifies the location and name of the table.
Column Definitions: Lists the columns and their data types.
PARTITION BY (sale_date): Partitions the table by the sale_date column for optimized querying.
TBLPROPERTIES: Sets table properties. Here, we're specifying that we're using Iceberg format version 2 (not necessary, but done here to illustrate setting table properties at table creation).

Executing the Create Statement

Paste the SQL Statement: Copy the SQL statement into the Dremio SQL editor.
Run the Query: Click on the Run button or press Ctrl+Enter.
Verify Creation:
- If successful, you should see a message indicating that the table was created.
- In the Datasets pane, navigate to MinIO > sales_data to confirm the table exists.

Running Transactions to Generate Snapshots

We'll now perform a series of transactions (INSERT, UPDATE, DELETE) to generate snapshots of the table. Each transaction will create a new snapshot in the Iceberg table's history.

Inserting Data into the Table

First Insert: Initial Data Load

INSERT INTO MinIO.sales_data (id, item, quantity, price, sale_date)
VALUES
  (1, 'Laptop', 5, 999.99, DATE '2022-01-15'),
  (2, 'Smartphone', 10, 499.99, DATE '2022-01-16'),
  (3, 'Tablet', 7, 299.99, DATE '2022-01-17');

Execution Steps:

Paste and Run: Copy the SQL statement into the SQL editor and execute it.
Verify Data: Run a SELECT query to ensure the data has been inserted.

SELECT * FROM MinIO.sales_data;

Second Insert: Adding More Records

INSERT INTO MinIO.sales_data (id, item, quantity, price, sale_date)
VALUES
  (4, 'Headphones', 15, 79.99, DATE '2022-01-18'),
  (5, 'Smartwatch', 12, 199.99, DATE '2022-01-19');

Execution Steps:

Run the Insert Statement.
Verify Data:

SELECT * FROM MinIO.sales_data;

Updating Records

Let's update the quantity of one of the items.

UPDATE MinIO.sales_data
SET quantity = 8
WHERE id = 2;

Execution Steps:

Run the Update Statement.
Verify Update:

SELECT * FROM MinIO.sales_data WHERE id = 2;

Deleting Records

We'll delete a record to generate another snapshot.

DELETE FROM MinIO.sales_data
WHERE id = 3;

Execution Steps:

Run the Delete Statement.
Verify Deletion:

SELECT * FROM MinIO.sales_data;

Each Transaction Creates a Snapshot

With each INSERT, UPDATE, or DELETE operation, Iceberg creates a new snapshot of the table. This allows us to time travel and audit changes over time.

Viewing Snapshots in Apache Iceberg

Now that we've generated multiple snapshots, we can query the metadata tables to view the snapshots and their details.

Understanding Snapshot IDs

Snapshot IDs are unique identifiers for each state of the table after a transaction. We can retrieve them by querying the table_snapshot metadata table.

Querying the Snapshots Table

SELECT *
FROM TABLE(table_snapshot('MinIO.sales_data'));

Result Interpretation:

committed_at: Timestamp when the snapshot was created.
snapshot_id: Unique ID of the snapshot.
parent_id: Snapshot ID of the parent snapshot.
operation: Type of operation (e.g., append, delete, replace).
summary: Additional details about the snapshot.

Using Time Travel Features

We can query the table as it existed at a specific snapshot or point in time.

Querying a Previous Snapshot

Identify a Snapshot ID: From the snapshots table, choose a snapshot ID corresponding to an earlier state.Example Snapshot ID: '1234567890123456789'
Time Travel Query:

SELECT * FROM MinIO.sales_data AT SNAPSHOT '1234567890123456789';

This query returns the data as it existed at the specified snapshot.

Querying by Timestamp

Choose a Timestamp: Use a timestamp from the committed_at column in the snapshots table.Example Timestamp: '2022-01-16 12:00:00'
Time Travel Query:

SELECT * FROM MinIO.sales_data AT TIMESTAMP '2022-01-16 12:00:00';

This query returns the table's state as of the specified timestamp.

Rolling Back the Table (Optional)

If you need to roll back the table to a previous state, use the ROLLBACK TABLE command.

ROLLBACK TABLE MinIO.sales_data
TO SNAPSHOT '1234567890123456789';

Note: Rolling back modifies the current state of the table to match the specified snapshot.

By performing these transactions and utilizing time travel queries, we've generated multiple snapshots of our Iceberg table. In the next section, we'll delve deeper into querying the Apache Iceberg metadata tables in Dremio to extract valuable insights for auditing and monitoring purposes.

Querying the Apache Iceberg Metadata Tables in Dremio

Now that we've created an Apache Iceberg table and generated multiple snapshots through various transactions, we can explore how to query the metadata tables in Dremio to gain insights into the table's history, structure, and data files. These metadata tables are instrumental in auditing, monitoring, and managing your data lakehouse effectively.

Overview of Iceberg Metadata Tables

Apache Iceberg provides several metadata tables that contain valuable information about the state and history of your data. Dremio allows you to query these metadata tables using standard SQL syntax, providing seamless access to this critical information.

The key Iceberg metadata tables are:

History Table: Contains the commit history of the table, showing when each snapshot was made.
Snapshots Table: Provides detailed information about each snapshot, including the operation performed.
Manifests Table: Lists manifest files, which contain metadata about data files.
Partitions Table: Shows information about the table's partitions.
Files Table: Contains metadata about individual data files in the table.

Querying Metadata Tables One by One

Let's dive into each metadata table, understand what information it provides, and see how to query it using Dremio.

1. History Table

The History table records the commit history of the Iceberg table, showing when snapshots were created and the relationships between them.

Querying the History Table

SELECT *
FROM TABLE(table_history('MinIO.sales_data'));

Interpreting the Results

The result set includes the following columns:

made_current_at: The timestamp when the snapshot became the current state of the table.
snapshot_id: The unique identifier of the snapshot.
parent_id: The snapshot ID of the parent snapshot (if any).
is_current_ancestor: Indicates whether the snapshot is an ancestor of the current table state.

Example Output:

made_current_at	snapshot_id	parent_id	is_current_ancestor
2022-01-15 12:00:00	1234567890123456789	NULL	true
2022-01-16 13:00:00	2234567890123456789	1234567890123456789	true
2022-01-17 14:00:00	3234567890123456789	2234567890123456789	true
2022-01-18 15:00:00	4234567890123456789	3234567890123456789	true
2022-01-19 16:00:00	5234567890123456789	4234567890123456789	true

Usage

By examining the history table, you can trace the evolution of your table over time, understand when changes were made, and identify the snapshots corresponding to significant events.

2. Snapshots Table

The Snapshots table provides detailed information about each snapshot, including the operation performed (e.g., append, delete), and a summary of the changes.

Querying the Snapshots Table

SELECT *
FROM TABLE(table_snapshot('MinIO.sales_data'));

Interpreting the Results

The result set includes the following columns:

committed_at: Timestamp when the snapshot was committed.
snapshot_id: Unique ID of the snapshot.
parent_id: Parent snapshot ID.
operation: The type of operation (e.g., append, overwrite, delete).
summary: A map containing additional information about the snapshot.

Example Output:

committed_at	snapshot_id	parent_id	operation	summary
2022-01-15 12:00:00	1234567890123456789	NULL	append	{added-records=3, total-records=3}
2022-01-16 13:00:00	2234567890123456789	1234567890123456789	append	{added-records=2, total-records=5}
2022-01-17 14:00:00	3234567890123456789	2234567890123456789	overwrite	{deleted-records=0, added-records=0}
2022-01-18 15:00:00	4234567890123456789	3234567890123456789	delete	{deleted-records=1, total-records=4}

Usage

The snapshots table is essential for:

Identifying the nature of each change to the table.
Understanding how data was added, modified, or removed.
Retrieving snapshot IDs for time travel queries or rollbacks.

3. Manifests Table

The Manifests table lists manifest files, which are files that contain metadata about data files (e.g., paths, partition data).

Querying the Manifests Table

SELECT *
FROM TABLE(table_manifests('MinIO.sales_data'));

Interpreting the Results

The result set includes columns like:

path: File path of the manifest file.
length: Size of the manifest file.
partition_spec_id: ID of the partition specification.
added_snapshot_id: Snapshot ID when the manifest was added.
added_data_files_count: Number of data files added.
existing_data_files_count: Number of existing data files.
deleted_data_files_count: Number of data files deleted.
partition_summaries: Summary of partitions included.

Usage

The manifests table helps you:

Examine how data files are organized.
Identify which snapshots introduced or removed data files.
Analyze partitioning information for optimization.

4. Partitions Table

The Partitions table provides information about the partitions in the table, including record counts and file counts per partition.

Querying the Partitions Table

SELECT *
FROM TABLE(table_partitions('MinIO.sales_data'));

Interpreting the Results

The result set includes:

partition: Partition key values.
spec_id: Partition specification ID.
record_count: Number of records in the partition.
file_count: Number of data files in the partition.

Example Output:

partition	record_count	file_count
{2022-01-15}	3	1
{2022-01-16}	2	1
{2022-01-17}	1	1
{2022-01-18}	1	1

Usage

Use the partitions table to:

Analyze data distribution across partitions.
Optimize query performance by understanding partition sizes.
Monitor partition-level data growth.

5. Files Table

The Files table contains detailed metadata about each data file in the table, such as file paths, sizes, and record counts.

Querying the Files Table

SELECT *
FROM TABLE(table_files('MinIO.sales_data'));

Interpreting the Results

Key columns include:

file_path: Path to the data file.
file_format: Format of the file (e.g., Parquet).
partition: Partition key values.
record_count: Number of records in the file.
file_size_in_bytes: Size of the file.
column_sizes: Sizes of individual columns.
value_counts, null_value_counts: Counts of values and nulls per column.

Usage

The files table allows you to:

Assess storage utilization.
Identify large or skewed files.
Plan maintenance tasks like compaction.

Handling Complex Data Types

Some metadata columns contain complex data types like maps or lists (e.g., summary, column_sizes). Dremio allows you to extract specific values using appropriate SQL functions.

Example: Accessing Snapshot Summary Details

SELECT
  committed_at,
  snapshot_id,
  operation,
  summary[0].key as key1,
  summary[0]."value" as val1
FROM TABLE(table_snapshot('MinIO.sales_data'));

Practical Use Cases

By querying the metadata tables, you can:

Audit Data Changes: Track how and when data was modified.
Monitor Data Growth: Observe the increase in data files and sizes over time.
Optimize Performance: Identify partitioning issues or large files that may affect query speed.
Ensure Compliance: Provide detailed records of data history for regulatory purposes.

Summary

Using the Apache Iceberg metadata tables in Dremio provide powerful tools for auditing and managing your data lakehouse. By understanding how to query and interpret these tables, you can gain valuable insights into your data's history, structure, and storage, enabling better decision-making and more efficient data operations.

Conclusion

In this blog, we've journeyed through the powerful capabilities unlocked by leveraging Apache Iceberg metadata tables within Dremio for effective data lakehouse auditing. Let's summarize the key benefits:

Enhanced Data Governance

Comprehensive Auditing: By accessing and querying metadata tables, organizations can maintain detailed audit trails of all data modifications, including inserts, updates, and deletes.
Data Lineage Tracking: Metadata tables provide insights into the evolution of data over time, allowing for accurate tracking of data lineage and ensuring transparency.
Regulatory Compliance: Detailed metadata supports adherence to regulatory requirements such as GDPR and HIPAA by providing necessary documentation and facilitating data retention policies.

Improved Operational Efficiency

Optimized Query Performance: Understanding data distribution and file sizes through metadata tables enables optimization strategies like partitioning and file compaction.
Anomaly Detection: Early identification of unusual patterns or unexpected changes helps in proactive troubleshooting and maintaining data integrity.
Simplified Data Recovery: The ability to perform time travel queries and rollbacks to previous snapshots reduces downtime and mitigates the impact of erroneous data modifications.

NOTE: You can shut off the docker compose environment with the

command docker compose down -v

Get started with the Dremio Platform today or Schedule a meeting to do a review of your data architecture!

Article Topics

Dremio Blog: Open Data Insights Product Insights from the Dremio Blog

Table of Contents

Overview of Data Lakehouse Architecture

The Role of Apache Iceberg and Dremio in Modern Data Lakehouses

Importance of Table Monitoring and Auditing

How Metadata Tables Assist in Monitoring and Auditing

Snapshot Management

Data Recovery and Rollback Capabilities

Enhanced Data Governance

Using Dremio's SELECT Syntax for Metadata Queries

Setting Up the Environment

Prerequisites

Setting Up Dremio, MinIO, and Nessie Using Docker Compose

Adjusted docker-compose.yml File

Explanation of Each Service

1. Dremio

2. MinIO

3. Nessie

Verifying the Environment Setup

Step 1: Launch the Services

Step 2: Verify Dremio UI Access

Step 3: Confirm MinIO Operations

Step 4: Verify Nessie Service

Step 5: Connect Dremio to MinIO and Nessie

Connecting Nessie (with Minio Storage) to Dremio

Step 6: Test the Integration

Run a Sample Query

Creating an Apache Iceberg Table and Generating Snapshots

Connecting to Dremio

Accessing the Dremio SQL Editor

Creating an Iceberg Table in Dremio

Defining the Table Schema

SQL Statement to Create the Table

Executing the Create Statement

Running Transactions to Generate Snapshots

Inserting Data into the Table

First Insert: Initial Data Load

Second Insert: Adding More Records

Updating Records

Deleting Records

Each Transaction Creates a Snapshot

Viewing Snapshots in Apache Iceberg

Understanding Snapshot IDs

Querying the Snapshots Table

Using Time Travel Features

Querying a Previous Snapshot

Querying by Timestamp

Rolling Back the Table (Optional)

Querying the Apache Iceberg Metadata Tables in Dremio

Overview of Iceberg Metadata Tables

Querying Metadata Tables One by One

1. History Table

Querying the History Table

Interpreting the Results

Usage

2. Snapshots Table

Querying the Snapshots Table

Interpreting the Results

Usage

3. Manifests Table

Querying the Manifests Table

Interpreting the Results

Usage

4. Partitions Table

Querying the Partitions Table

Interpreting the Results

Usage

5. Files Table

Querying the Files Table

Interpreting the Results

Usage

Handling Complex Data Types

Example: Accessing Snapshot Summary Details

Practical Use Cases

Summary

Conclusion

Enhanced Data Governance

Improved Operational Efficiency

Unlock the Full Potential of the: Power Your AI Initiatives with Trusted Data

Additional Resources

The Why and How of Using Apache Iceberg on Databricks

Adjusted `docker-compose.yml` File