The Dremio Blog

Announcing DML Support for Apache Iceberg

Dremio News, Product
Jason Hughes Jason HughesDirector of Product Management, Dremio
Shrirang Kamat Shrirang KamatProduct Management Director, Dremio

Today we’re excited to announce DML (Data Manipulation Language) support for Apache Iceberg! This allows you to mutate your data directly in your data lake via standard SQL DML commands like INSERT, UPDATE, MERGE INTO, and DELETE from Dremio. Together with the existing SELECT support for Iceberg tables, you can run any SQL statement on data in your data lakehouse with Dremio.

This capability is available now in Dremio Cloud and will soon be available in the upcoming 22.0 release of Dremio software.

The full power of SQL, on an open lakehouse

Once a capability only available in data warehouses and other databases, data mutations are now possible directly on the data lake. With the ability to do DML directly on the data lake, it’s now easier than ever to run your workloads on an open lakehouse without having to deal with expensive and proprietary cloud data warehouses. You no longer need to copy your data into a closed data warehousing system to have the full power of SQL at your fingertips.

This DML capability is enabled by Apache Iceberg, a high-performance table format that solves challenges with traditional tables in data lakes. It’s rapidly becoming an industry standard for managing data in data lakes, establishing the foundation of an open lakehouse. Iceberg brings the reliability and simplicity of database tables to the lakehouse, while also enabling multiple engines to work together on the same data in a safe and transactionally consistent manner.

DML on the lakehouse in action

To demonstrate DML capability in Dremio, we will create two Iceberg tables SF_incident2016 and SF_incidents2016_stage using the SF_incidents2016.json dataset included in the sample source in Dremio. 

For this demonstration, let’s say that some of the data for December 31, 2016 has changed since it was originally loaded into the production SF_incident2016 table and the new values for this day have been loaded into SF_incidents2016_stage. We want to merge the new changes into the production SF_incident2016 table. 

You can follow along with this scenario by running these two CREATE TABLE AS SELECT statements:

CREATE TABLE S3.SF_incidents2016 AS
SELECT * 
FROM Samples."samples.dremio.com"."SF_incidents2016.json";
CREATE TABLE S3.SF_incidents2016_stage AS
SELECT IncidntNum, Category, Descript, DayOfWeek, "Date", "Time", PdDistrict,
    CASE WHEN IncidntNum IN (161061373, 161061420, 161061464) 
        THEN 'ARREST, BOOKED' 
        ELSE Resolution 
    END AS Resolution,
Address, X, Y, Location, PdId 
FROM Samples."samples.dremio.com"."SF_incidents2016.json" 
WHERE "Date" = '2016-12-31';

Now that the tables are set up, let’s merge the updated values of the data for December 31, 2016 from the stage table into the production table.

The following MERGE INTO command can be run in Dremio to update the values of Resolution in SF_incidents2016 for rows where there’s a matching IncidntNum value in the stage table, and INSERT non-matching records into SF_incidents2016 from the stage table:

MERGE INTO  S3.SF_incidents2016 a
USING       S3.SF_incidents2016_stage b
ON a.IncidntNum = b.IncidntNum
    AND a.PdId = b.PdId
WHEN MATCHED THEN
	UPDATE SET Resolution = b.Resolution
WHEN NOT MATCHED THEN
	INSERT VALUES (b.IncidntNum, b.Category, b.Descript, b.DayOfWeek, b."Date", b."Time", b.pdDistrict, b.Resolution, b.Address, b.x, b.y, b.Location, b.PdId);

Now we can see that the values for the updated incidents are in the production table by running the following statement:

SELECT *
FROM S3.SF_incidents2016
WHERE IncidntNum IN (161061373, 161061420, 161061464) 

Now, let’s say that the court decided to expunge someone’s record, so we need to delete the record of the incident. We can run the following DELETE statement to delete that single record in the SF_incidents2016 table: 

DELETE FROM S3.SF_incidents2016
WHERE IncidntNum = 161061549

Now, let’s say that moving forward the application writing this data will change the values it writes. Now, when an incident is resolved by arrest and booking, it will no longer write ARREST, BOOKED, it will write ARREST & BOOKED. To ensure all downstream reports don’t see both ARREST, BOOKED and ARREST & BOOKED, we need to update all the previous records to reflect this new column format.

UPDATE S3.SF_incidents2016
SET Resolution = 'ARREST & BOOKED'
WHERE Resolution = 'ARREST, BOOKED';

Run all your SQL workloads on an open lakehouse today

Dremio offers the ability to change data directly on the lakehouse, so you no longer need to copy your data into an expensive and proprietary data warehousing system to have the full power of SQL at your fingertips. 

You can read more about this feature in the documentation here.

Try it out now on Dremio Cloud and please reach out with any feedback to [email protected]!

Get Started Free

No time limit - totally free - just the way you like it.

Sign Up Now

See Dremio in Action

Not ready to get started today? See the platform in action.

Watch Demo

Talk to an Expert

Not sure where to start? Get your questions answered fast.

Contact Us