h3h3h3h3

9 minute read · December 15, 2023

Loading Data Into Apache Iceberg Just Got Easier With Dremio 24.3 and Dremio Cloud

Manoj Raheja

Manoj Raheja · Principal Product Manager, Dremio

Jeremiah Morrow

Jeremiah Morrow · Product Marketing Director, Iceberg Lakehouse Management, Dremio

Additional features in Dremio 24.3 provide customers with a unified path for loading all of their data into Iceberg

iceberg hero image square

Apache Iceberg is quickly gaining traction as the open table format of choice for enterprise organizations. Our survey of the State of the Data Lakehouse shows more companies planning to adopt Iceberg than any other table format over the next 3 years. Iceberg gives data teams high performance on large datasets, data warehouse functionality directly on the data lake, and dramatically simplifies data engineering.

Now, with the release of Dremio 24.3 and updates in Dremio Cloud, we’ve made it even easier to get your data into Iceberg tables and take advantage of its performance and functionality.

New in Dremio in 24.3 and Dremio Cloud

Parquet file format support for COPY INTO

COPY INTO is among several options for loading data into Iceberg tables, and is especially good for bulk loads. Previously, Dremio supported COPY INTO for .json and .csv files. Now, Dremio also supports COPY INTO for Apache Parquet files. Many customers have already standardized on Parquet, an open file format that delivers superior performance and compression over other file formats. Now, they can build on those benefits by easily moving from Parquet to Iceberg.

CSV File Format Enhancements for COPY INTO

In addition to support for Parquet, Dremio has added some quality of life enhancements for customers using COPY INTO to get CSV files into Iceberg. COPY INTO now supports additional formatting options for handling CSV file header rows and for skipping lines.

Better Error Handling with COPY INTO

Continue on Error is a new feature that guarantees that jobs loading multiple files into Iceberg won’t fail if one file load fails. Previously, the entire job would fail, and it would be especially difficult to identify the problem file. Now, the job will complete, and Dremio will clearly identify the issue, so users will spend less time troubleshooting ingestion workloads.

Support for Git for Data

COPY INTO now supports ingesting your data into a specific reference or a branch in the Nessie/Arctic catalog. Enabling users to validate/process the data before merge the changes to the main branch - all without creating multiple copies.

Together, these capabilities give customers more ways to get their data into Iceberg, and experience the performance and TCO benefits of the open table format. Now, we'll share how you can get started with Iceberg using these features in Dremio.

Loading Data Into Iceberg in Dremio

Step 1: Create Your Table

Use the CREATE TABLE statement to create an Iceberg table. Optionally, you can add PARTITION BY to the end of your CREATE TABLE statement to add hidden partitioning that makes queries on your table faster. For example, if I had a table of customer data, I could partition it by last name. The Dremio documentation for CREATE TABLE, including syntax examples, is in the Dremio documentation.

Step 2: Copy Data Into Iceberg

Run a COPY INTO query to copy one or multiple files in CSV, JSON, or Parquet into your Iceberg table. Optionally, you can list a series of files to copy into your Iceberg table by using FILES and listing the files (up to 1,000) that you want to copy, or by specifying a path pattern, a file name pattern, or both with the REGEX clause. For more information on customizing your query to fit your use case, and for specific syntax examples, refer to the Dremio documentation.

Step 3: Optimize Your Tables

The OPTIMIZE TABLE command ensures that your table contains the optimal file sizes for performant queries. You can control the parameters by which the table is optimized, but the optimal file size in Dremio is 256 MB, and simply running the query OPTIMIZE TABLE <table_name> will rewrite the files to that specification.

Again, for all of the options available within the OPTIMIZE TABLE command, refer to the Dremio documentation.

Step 4: Clean Up Your Tables

The VACUUM CATALOG command performs table clean-up for all tables in your Project Nessie or Dremio lakehouse management service catalog, which is available in Dremio Cloud, by expiring snapshots and deleting unused data files. This process improves storage utilization for your data lakehouse. If you are using Dremio Cloud and the lakehouse management service as your catalog, automatic table cleanup is now enabled by default, and runs in the background.

Versioning Best Practices For Iceberg Tables With Nessie and Dremio Lakehouse Management

Iceberg includes a snapshot capability that enables users to recreate a view of their data lakehouse at a specific point in time using metadata pointers. Project Nessie is an open-source data lakehouse catalog that uses Iceberg snapshots to introduce Git-inspired data versioning.

In a data lakehouse environment where production users may be impacted by data changes, data teams can use Git for Data to make changes and optimize tables in isolation, and test tables before exposing them to end users. In Dremio Cloud, these versioning capabilities are available by default through our lakehouse catalog.

First, use CREATE BRANCH to instantly create a zero-copy clone of your catalog. In the picture below, we named the branch "ETL".

Ensure that you set the context in the Dremio session to the ETL branch (highlighted in the red box in the above screenshot), and create and/or populate your Iceberg table. Alternatively, with this new update, if you don’t want to set the context to the branch, you can run COPY INTO with an AT BRANCH “ETL” clause to load data into your ETL branch. 

Now, you can perform optimizations and test your data in the branch to ensure everything works properly.

Once you are satisfied with your changes, you can use MERGE BRANCH to merge your changes into your main branch. Your changes will be exposed to end users atomically, so they always see a consistent and accurate view of the data.

Conclusion

By delivering high performance analytics on huge volumes of data, providing data warehouse functionality directly on the data lake, and simplifying data operations, Apache Iceberg effectively bridges the gap between the data lake and the data lakehouse. Now, with the new ingestion capabilities in Dremio 24.3 and Dremio Cloud, it’s never been easier to get your data into Iceberg tables and start taking advantage of the performance and capabilities.

The easiest way to get started with Iceberg is in Dremio Cloud, our managed service that automates Iceberg table optimization and provides a lakehouse management service with Git-inspired data versioning. Sign up for free here.

Ready to Get Started?

Bring your users closer to the data with organization-wide self-service analytics and lakehouse flexibility, scalability, and performance at a fraction of the cost. Run Dremio anywhere with self-managed software or Dremio Cloud.