Dremio Blog

7 minute read · June 22, 2022

Announcing DML Support for Apache Iceberg

Jason Hughes Director of Technical Advocacy, Dremio

Shrirang Kamat Principal Product Manager @ Dremio

Start For Free

Copied to clipboard

Announcing DML Support for Apache Iceberg

The full power of SQL, on an open lakehouse

DML on the lakehouse in action

Run all your SQL workloads on an open lakehouse today

Today we’re excited to announce DML (Data Manipulation Language) support for Apache Iceberg! This allows you to mutate your data directly in your data lake via standard SQL DML commands like INSERT, UPDATE, MERGE INTO, and DELETE from Dremio. Together with the existing SELECT support for Iceberg tables, you can run any SQL statement on data in your data lakehouse with Dremio.

This capability is available now in Dremio Cloud and will soon be available in the upcoming 22.0 release of Dremio software.

The full power of SQL, on an open lakehouse

Once a capability only available in data warehouses and other databases, data mutations are now possible directly on the data lake. With the ability to do DML directly on the data lake, it’s now easier than ever to run your workloads on an open lakehouse without having to deal with expensive and proprietary cloud data warehouses. You no longer need to copy your data into a closed data warehousing system to have the full power of SQL at your fingertips.

This DML capability is enabled by Apache Iceberg, a high-performance table format that solves challenges with traditional tables in data lakes. It’s rapidly becoming an industry standard for managing data in data lakes, establishing the foundation of an open lakehouse. Iceberg brings the reliability and simplicity of database tables to the lakehouse, while also enabling multiple engines to work together on the same data in a safe and transactionally consistent manner.

DML on the lakehouse in action

To demonstrate DML capability in Dremio, we will create two Iceberg tables SF_incident2016 and SF_incidents2016_stage using the SF_incidents2016.json dataset included in the sample source in Dremio.

For this demonstration, let’s say that some of the data for December 31, 2016 has changed since it was originally loaded into the production SF_incident2016 table and the new values for this day have been loaded into SF_incidents2016_stage. We want to merge the new changes into the production SF_incident2016 table.

You can follow along with this scenario by running these two CREATE TABLE AS SELECT statements:

CREATE TABLE S3.SF_incidents2016 AS
SELECT * 
FROM Samples."samples.dremio.com"."SF_incidents2016.json";

CREATE TABLE S3.SF_incidents2016_stage AS
SELECT IncidntNum, Category, Descript, DayOfWeek, "Date", "Time", PdDistrict,
    CASE WHEN IncidntNum IN (161061373, 161061420, 161061464) 
        THEN 'ARREST, BOOKED' 
        ELSE Resolution 
    END AS Resolution,
Address, X, Y, Location, PdId 
FROM Samples."samples.dremio.com"."SF_incidents2016.json" 
WHERE "Date" = '2016-12-31';

Now that the tables are set up, let’s merge the updated values of the data for December 31, 2016 from the stage table into the production table.

The following MERGE INTO command can be run in Dremio to update the values of Resolution in SF_incidents2016 for rows where there’s a matching IncidntNum value in the stage table, and INSERT non-matching records into SF_incidents2016 from the stage table:

MERGE INTO  S3.SF_incidents2016 a
USING       S3.SF_incidents2016_stage b
ON a.IncidntNum = b.IncidntNum
    AND a.PdId = b.PdId
WHEN MATCHED THEN
	UPDATE SET Resolution = b.Resolution
WHEN NOT MATCHED THEN
	INSERT VALUES (b.IncidntNum, b.Category, b.Descript, b.DayOfWeek, b."Date", b."Time", b.pdDistrict, b.Resolution, b.Address, b.x, b.y, b.Location, b.PdId);

Now we can see that the values for the updated incidents are in the production table by running the following statement:

SELECT *
FROM S3.SF_incidents2016
WHERE IncidntNum IN (161061373, 161061420, 161061464)

Now, let’s say that the court decided to expunge someone’s record, so we need to delete the record of the incident. We can run the following DELETE statement to delete that single record in the SF_incidents2016 table:

DELETE FROM S3.SF_incidents2016
WHERE IncidntNum = 161061549

Now, let’s say that moving forward the application writing this data will change the values it writes. Now, when an incident is resolved by arrest and booking, it will no longer write ARREST, BOOKED, it will write ARREST & BOOKED. To ensure all downstream reports don’t see both ARREST, BOOKED and ARREST & BOOKED, we need to update all the previous records to reflect this new column format.

UPDATE S3.SF_incidents2016
SET Resolution = 'ARREST & BOOKED'
WHERE Resolution = 'ARREST, BOOKED';

Run all your SQL workloads on an open lakehouse today

Dremio offers the ability to change data directly on the lakehouse, so you no longer need to copy your data into an expensive and proprietary data warehousing system to have the full power of SQL at your fingertips.

You can read more about this feature in the documentation here.

Try it out now on Dremio Cloud and please reach out with any feedback to [email protected]!

Try Dremio Cloud free for 30 days

Deploy agentic analytics directly on Apache Iceberg data with no pipelines and no added overhead.

Start For Free

Article Topics

Dremio Blog: News Highlights Product Insights from the Dremio Blog

Aug 16, 2023 Dremio Blog: News Highlights

5 Use Cases for the Dremio Lakehouse

With its capabilities in on-prem to cloud migration, data warehouse offload, data virtualization, upgrading data lakes and lakehouses, and building customer-facing analytics applications, Dremio provides the tools and functionalities to streamline operations and unlock the full potential of data assets.

Alex Merced

Aug 31, 2023 Dremio Blog: News Highlights

Dremio Arctic is Now Your Data Lakehouse Catalog in Dremio Cloud

Dremio Arctic bring new features to Dremio Cloud, including Apache Iceberg table optimization and Data as Code.

Jeremiah Morrow

Screenshot of Dremio's press release template.

Dec 7, 2023 Dremio Blog: News Highlights

Dremio Cloud on Azure Available Now

Unveiling Dremio Cloud on Azure, a Fast, Scalable and Secure Lakehouse Platform Introduction Today we introduce Dremio Cloud on Azure, newly landing on Azure in November 2023, in public preview. Dremio Cloud is a powerful lakehouse platform providing streamlined self-service, unparalleled SQL performance, centralized data governance and seamless lakehouse management. In this blog post, we […]

elise

Announcing DML Support for Apache Iceberg

Table of Contents

The full power of SQL, on an open lakehouse

DML on the lakehouse in action

Run all your SQL workloads on an open lakehouse today

Try Dremio Cloud free for 30 days

Ready to Get Started?

Table of Contents

The full power of SQL, on an open lakehouse

DML on the lakehouse in action

Run all your SQL workloads on an open lakehouse today

Try Dremio Cloud free for 30 days

Related Dremio Articles

5 Use Cases for the Dremio Lakehouse

Dremio Arctic is Now Your Data Lakehouse Catalog in Dremio Cloud

Dremio Cloud on Azure Available Now

Ready to Get Started?