26 minute read · February 1, 2024
Ingesting Data Into Apache Iceberg Tables with Dremio: A Unified Path to Iceberg
· Senior Tech Evangelist, Dremio
Apache Iceberg and Dremio have emerged as significant players in the data lakehouse space, offering robust data management and analytics solutions. Apache Iceberg, an open source table format, provides a high-performance platform for large-scale data analytics, enabling better data management through hidden partitioning, schema evolution, and efficient data file management. Likewise, Dremio, a cutting-edge data lakehouse platform, revolutionizes how we approach big data by offering a powerful, SQL query engine to work with your data lakehouse and creating a unified access layer with your other data sources.
Combining these two technologies can lead to a streamlined, efficient, and more manageable data platform. Dremio stands out for its ease of use and minimal configuration, especially compared to traditional methods like Apache Spark. This simplicity doesn't come at the cost of functionality; Dremio offers automated table optimization and cleanup, ensuring that your lakehouse tables are easy to create and populate but have their storage imprint and performance regularly optimized.
The SQL-first approach of Dremio aligns with the contemporary shift toward query-based data handling. It empowers users to interact with their data lakehouses using familiar SQL queries, making the technology accessible to a broader range of professionals, including those who may not have deep technical expertise in data engineering.
Additionally, Dremio's compatibility with various orchestration tools, such as dbt (data build tool), enhances its appeal. This integration allows for seamless workflow management, ensuring that data ingestion and processing are both scalable and maintainable.
In this article, we'll delve into leveraging Dremio for data ingestion into Apache Iceberg tables. We'll explore the use of commands like CREATE TABLE AS, INSERT INTO SELECT, MERGE INFO and COPY INTO, showcasing the simplicity and efficiency of Dremio in managing big data tasks. Whether you're a data engineer, a business analyst, or a data scientist, understanding how to harness the power of Dremio in conjunction with Apache Iceberg will be a valuable addition to your data management toolkit.
Extracting Data: Connecting Data Sources to Dremio for Iceberg Table Ingestion
After you set up your free Dremio Cloud account, the first step in leveraging Dremio for data ingestion into Apache Iceberg tables is to connect with your diverse data sources. Dremio's versatility shines here, as it supports a wide range of data sources, enabling a seamless and integrated data extraction process.
Dremio Cloud supports a variety of data sources, ensuring that the nature or location of your data does not limit your data ingestion process:
Metastores: Dremio supports connecting to metastores like Dremio Arctic (the integrated git-like data catalog that is part of Dremio’s suite of data lakehouse management features) and AWS Glue. These metastores are crucial in metadata management and essential for efficient data handling and querying.
Data lakes: Dremio offers the ability to connect to data lakes on popular cloud platforms like AWS and Azure. Working with JSON, XLS, CSV, ORC, and Parquet files on your data lake simplifies the data ingestion process from these sources into Apache Iceberg tables.
Databases and data warehouses: Dremio supports connections to various databases and data warehouses, including Postgres, MySQL, Snowflake, and Redshift. This wide-ranging support is particularly advantageous for organizations that rely on multiple databases or data warehouses, as it provides a unified platform for data curation.
Dremio-to-Dremio connector: For data sources not directly supported by Dremio Cloud, Dremio offers the Dremio-to-Dremio connector. This feature allows users to connect to a Dremio software cluster, which in turn can access additional data sources. This includes sources like MongoDB, Hive, Hadoop, and GCP, among others.
Key Benefits of Dremio's Data Source Connectivity
Comprehensive coverage: With support for a broad spectrum of data sources, Dremio ensures that almost any data environment can be integrated into your data ingestion pipeline.
Simplified data access: The ability to connect to various data sources through a single platform simplifies the data extraction process, reducing the need for multiple tools or complex integrations.
Flexibility and scalability: The Dremio-to-Dremio connector offers a flexible solution for accessing data sources not directly supported by Dremio Cloud, ensuring scalability as your data sources evolve.
Unified data ingestion platform: By serving as a central hub for data extraction from numerous sources, Dremio streamlines the data ingestion process, making it more efficient and manageable.
CREATE TABLE
The CREATE TABLE statement in Dremio allows for creating Apache Iceberg tables. Let’s delve into how to use the CREATE TABLE AS statement effectively to move data from our other data sources in an Apache Iceberg table, along with best practices to consider.
Using the CREATE TABLE AS Statement
The CREATE TABLE AS statement allows you to create an Iceberg table by selecting data from any connected data source in Dremio. This process is straightforward and efficient, making it an ideal approach for data ingestion into Iceberg tables. Here’s a basic syntax to understand its structure:
CREATE TABLE [ IF NOT EXISTS ] <table_name> AS <select_statement>
This command creates a new table named <table_name> using the data returned by the <select_statement>.
Best Practices for Creating Iceberg Tables
Namespace management: Ensure that the namespace exists where you want to create the table.
Unique table names: The table name must be unique within its path. It’s essential to avoid naming conflicts, especially when working with multiple data sources.
Partitioning strategy: Use the PARTITION BY clause wisely. Partitioning helps optimize query performance, but be mindful of the number of distinct values in the partition column. Over-partitioning or partitioning contrary to query patterns can lead to performance issues. One possible strategy is to avoid partitioning the raw data and allow the reflection recommender to recommend reflections partitioned based on actual query patterns observed by your Dremio cluster.
Local sorting: Consider using LOCALSORT BY for range filter queries. Sorting Parquet file fragments by a specific column can significantly speed up these queries.
Data type considerations: When defining columns (<column_spec>), ensure that data types are appropriately assigned, especially when dealing with complex data structures.
Partition transformations: Utilize partition transformations like year, month, day, hour, bucket, and truncate to suit your data analysis needs. These transformations help in organizing data efficiently within Iceberg tables.
Example Scenarios
Creating a simple table:
CREATE TABLE arctic.sales AS SELECT * FROM postgres.sales;
This creates an Iceberg table named arctic.sales (a new Iceberg table tracked by our Arctic catalog) using data from postgres.sales (a table in a Postgres source).
Partitioned and sorted table:
CREATE TABLE arctic.voters.bronze.voter_data PARTITION BY (state) LOCALSORT BY (city) AS SELECT * FROM mysql.voter_data;
This example demonstrates creating a table with partitioning by state and local sorting by city, optimizing query performance for these fields.
Using partition transformations:
CREATE TABLE arctic.employees (id INT, first_name VARCHAR, last_name VARCHAR, hire_date DATE) PARTITION BY (month(col2));
In this case, the employees table is partitioned by the month on hire_date, which is a date column.
INSERT INTO SELECT
After creating Apache Iceberg tables in Dremio, the next crucial step is populating them with data. The INSERT INTO SELECT statement effectively achieves this, allowing for the seamless transfer of data from various source tables into your Iceberg tables. This section will explore how to use this command in Dremio.
Understanding the INSERT INTO SELECT Syntax
The INSERT INTO SELECT statement in Dremio enables you to insert data into a specified table from the result set of a SELECT query. Here's the basic syntax:
INSERT INTO <table_name> [ ( <column_name1>, <column_name2>, ... ) ] <select_statement>
In this command, <table_name> is the name of the destination table where data will be inserted. The <select_statement> is a query that produces the data to be inserted. Optionally, specify the column names into which the data will be inserted.
Best Practices and Considerations
Schema matching: Ensure that the source data schema matches the destination table.
Column order: If you specify column names, ensure they are in the same order as in the destination table.
Reference and branch specification: Optionally, you can specify a particular reference or branch of the table you are inserting into using the AT { REFERENCE | BRANCH } <reference_name> syntax. (Branching features are only available for tables in integrated Dremio catalogs on Dremio Cloud or in a Nessie catalog on Dremio Software.)
Example Scenarios
Insert data from a PostgreSQL table:
INSERT INTO arctic.sales_data SELECT * FROM postgres.sales_info;
This example inserts data into an Iceberg table arctic.sales_data from a PostgreSQL table postgres.sales_info.
Insert specific columns:
INSERT INTO arctic.customer_data (id, name, email) SELECT customer_id, customer_name, customer_email FROM mysql.customer_info;
Here, specific columns (id, name, email) are selected from a MySQL table mysql.customer_info and inserted into the corresponding columns of the Iceberg table arctic.customer_data.
Inserting into a specific branch:
INSERT INTO arctic.product_catalog AT BRANCH main SELECT * FROM postgres.product_list;
This command inserts data into the main branch of the arctic.product_catalog table from postgres.product_list.
MERGE INTO
The MERGE INTO statement in Dremio is an essential tool for performing insert and update operations on Apache Iceberg tables. This functionality is particularly useful for synchronizing data in a target table with changes from a source table. In this section, we'll explore the effective use of the MERGE INTO statement in Dremio and highlight best practices.
Understanding the MERGE INTO Syntax
The MERGE INTO statement enables conditional updates or inserts based on the presence of matching rows between a target and a source table. Here's the basic structure of the command:
MERGE INTO <target_table_name> USING { <source_table_name> | ( <select_statement> ) } ON <condition> [ WHEN MATCHED THEN <update_clause> ] [ WHEN NOT MATCHED THEN <insert_clause> ]
In this syntax, <target_table_name> is the Iceberg table you intend to update, while <source_table_name> or <select_statement> provides the source data. The <condition> specifies how rows from the source table are matched with the target table rows.
Best Practices for MERGE INTO
Schema compatibility: Ensure the source and target tables have compatible schemas, especially when using commands like UPDATE SET * or INSERT *.
Explicit column specification: To avoid errors and for clarity, explicitly specify the columns for updating or inserting, particularly when the source and target table schemas are not identical.
Reference and branch specification: You can specify a reference or branch for the target table using the AT { REFERENCE | BRANCH } syntax, similar to the INSERT INTO SELECT statement.
Optimized join conditions: Use efficient conditions in the ON clause to enhance the performance of the merge operation. This usually involves a primary key or a unique identifier.
Example Scenario
Basic merge operation:
MERGE INTO arctic.target_table AS t USING postgres.source_table AS s ON t.id = s.id WHEN MATCHED THEN UPDATE SET description = s.description WHEN NOT MATCHED THEN INSERT (id, description) VALUES (s.id, s.description);
In this example, if a matching id is found in postgres.source_table, the description in arctic.target_table is updated. If there's no match, new rows are inserted.
COPY INTO
The COPY INTO command in Dremio provides a straightforward method for copying data from files and directories into existing Apache Iceberg tables. This capability is handy for integrating data stored in formats like CSV, JSON, or Parquet into your analytical workflows. Assuming an S3 source named "s3," let's explore how to use COPY INTO and best practices for optimal results effectively.
Understanding the COPY INTO Syntax
The basic syntax of the COPY INTO command is as follows:
COPY INTO <table_name> FROM '@<storage_location_name>[/<path>[/<file_name>] ]' [ FILE_FORMAT 'csv' | 'json' | 'parquet' ] [ <format_options> ]
This command allows you to specify the target Iceberg table (<table_name>), the source file location in S3 (@s3/<path>/<file_name>), and various format options depending on the file type.
Best Practices for Using COPY INTO
File location clarity: Ensure the file path is correctly specified. If you're pointing to a directory, Dremio will load all files, including those in subdirectories.
File format specification: Specify the file format (CSV, JSON, Parquet) if the files don't have extensions. This helps Dremio correctly interpret the file contents.
Optimizing file selection: If you're dealing with many files, use the REGEX clause to load files based on naming patterns selectively. Avoid overly broad patterns that might include many files, impacting performance.
Handling CSV and JSON formats: Utilize the specific format options for CSV and JSON files, such as DATE_FORMAT, NULL_IF, and TRIM_SPACE, to ensure data is correctly interpreted and loaded.
Schema matching: Verify at least one matching column between your data files and the target Iceberg table. Dremio will ignore non-matching columns in the data files and insert NULL values for non-matching columns in the target table.
Example Scenarios
Copying all files in a folder:
COPY INTO arctic.sales_data FROM '@s3/bucket-name/sales-data-folder/' FILE_FORMAT 'json';
This command copies all JSON files from the specified S3 folder into the arctic.sales_data Iceberg table.
Copying specific files:
COPY INTO arctic.customer_data FROM '@s3/bucket-name/customer-data-folder/' FILES ('customers.csv', 'additional_customers.csv') FILE_FORMAT 'csv' (FIELD_DELIMITER ',');
Here, specific CSV files are selected for loading, with a custom field delimiter.
Using regular expressions for file selection:
COPY INTO arctic.product_catalog FROM '@s3/bucket-name/product-data/' REGEX '.*2024.*\\.parquet' FILE_FORMAT 'parquet';
This example uses a REGEX pattern to load only Parquet files related to the year 2024 into the arctic.product_catalog table.
Data Lakehouse Management
Dremio offers a suite of features for managing an Apache Iceberg lakehouse, making it an excellent platform for data lakehouse management. These features include the OPTIMIZE and VACUUM commands, automated table management, and Integrated Dremio Iceberg Catalogs, which enhance management capabilities and user-friendly interfaces.
OPTIMIZE and VACUUM Commands
The OPTIMIZE command in Dremio is used to rewrite data and manifest files to an optimal size. This command combines small files or splits large files to improve query performance and storage efficiency. It's particularly useful for tables that have undergone many small updates or inserts.
OPTIMIZE TABLE <table_name> [ REWRITE MANIFESTS ] [ REWRITE DATA [ USING BIN_PACK ] [ FOR PARTITIONS <predicate> ] [ ( { TARGET_FILE_SIZE_MB | MIN_FILE_SIZE_MB | MAX_FILE_SIZE_MB | MIN_INPUT_FILES } = <value> [, ... ] ) ]
The VACUUM command cleans up old snapshots and associated files from Apache Iceberg tables. This command is crucial for managing the storage efficiently and ensuring that the data lakehouse does not retain unnecessary data.
VACUUM TABLE <table_name> EXPIRE SNAPSHOTS [older_than <value>] [retain_last <value>]
Best Practices:
Regularly use OPTIMIZE to maintain efficient table file sizes and improve query performance.
Use VACUUM to remove outdated snapshots and free up storage space, particularly in environments with frequent data updates.
Automated Table Management
Dremio's automated table management features work seamlessly with any Iceberg table in an Arctic Catalog. These features include:
Table optimization: Automatically compact small files into larger ones to optimize query performance.
Table cleanup: Delete expired snapshots and orphaned metadata files using the VACUUM CATALOG command.
Best Practices:
Schedule regular optimization and cleanup tasks to maintain performance and manage storage efficiently.
Monitor and adjust the frequency of these tasks based on your data update and access patterns.
Dremio's Data Lakehouse Catalogs
Dremio's Data Lakehouse management features include an integrated catalog offering Nessie-powered git-like features, providing a sophisticated way to manage data lakes. These catalogs facilitate:
Version control: Manage commits, branches, and tags, allowing for rollbacks, experimental changes, and more controlled updates to your data.
Intuitive UI: An easy-to-use interface for managing data lake structures, providing visibility and control over the data.
Best Practices:
Leverage Arctic Catalogs for complex data lake environments where version control is critical.
Utilize the intuitive UI to manage and monitor data changes effectively.
Conclusion
Apache Iceberg's open source table format provides a solid foundation for large-scale data analytics, offering advanced features like hidden partitioning, schema evolution, and efficient file management. In tandem, Dremio emerges as a cutting-edge platform, facilitating a seamless, SQL-first approach to interacting with data lakehouses. Its integration with Apache Iceberg represents a significant stride in making big data more accessible and manageable.
The journey through Dremio's capabilities, from data ingestion to sophisticated data lakehouse management, highlights its ease of use, flexibility, and comprehensive coverage. The ability to connect to a wide array of data sources, including metastores like Dremio Arctic and AWS Glue, databases, and data warehouses, underscores Dremio's role as a unified data access layer. The Dremio-to-Dremio connector further extends this capability, bridging gaps when necessary.
Dremio's prowess in simplifying data management tasks is exemplified through its intuitive handling of Apache Iceberg tables. Commands like CREATE TABLE AS, INSERT INTO SELECT, MERGE INTO, and COPY INTO not only streamline data operations but also embody the platform's commitment to efficiency and ease of use. These features, coupled with Dremio's automated table optimization and cleanup, ensure that managing a data lakehouse is no longer a daunting task but an accessible and scalable endeavor.
Moreover, introducing Dremio's integrated catalog, with its Nessie-powered git-like features, brings unprecedented control and flexibility. This modern approach to data versioning and management empowers organizations to handle their data with the same finesse as their code, enabling more controlled and experimental changes to data.
In summary, combining Apache Iceberg and Dremio offers a revolutionary approach to data lakehouse management. By unifying data from diverse sources, simplifying data operations, and providing powerful tools for data management, Dremio stands out as a comprehensive solution for modern data needs. Whether you are a data engineer, business analyst, or data scientist, harnessing the combined power of Dremio and Apache Iceberg will undoubtedly be a valuable asset in your data management toolkit.