Get Started Free
No time limit - totally free - just the way you like it.Sign Up Now
Today we’re excited to announce DML (Data Manipulation Language) support for Apache Iceberg! This allows you to mutate your data directly in your data lake via standard SQL DML commands like
MERGE INTO, and
DELETE from Dremio. Together with the existing
SELECT support for Iceberg tables, you can run any SQL statement on data in your data lakehouse with Dremio.
This capability is available now in Dremio Cloud and will soon be available in the upcoming 22.0 release of Dremio software.
Once a capability only available in data warehouses and other databases, data mutations are now possible directly on the data lake. With the ability to do DML directly on the data lake, it’s now easier than ever to run your workloads on an open lakehouse without having to deal with expensive and proprietary cloud data warehouses. You no longer need to copy your data into a closed data warehousing system to have the full power of SQL at your fingertips.
This DML capability is enabled by Apache Iceberg, a high-performance table format that solves challenges with traditional tables in data lakes. It’s rapidly becoming an industry standard for managing data in data lakes, establishing the foundation of an open lakehouse. Iceberg brings the reliability and simplicity of database tables to the lakehouse, while also enabling multiple engines to work together on the same data in a safe and transactionally consistent manner.
To demonstrate DML capability in Dremio, we will create two Iceberg tables
SF_incidents2016_stage using the
SF_incidents2016.json dataset included in the sample source in Dremio.
For this demonstration, let’s say that some of the data for December 31, 2016 has changed since it was originally loaded into the production
SF_incident2016 table and the new values for this day have been loaded into
SF_incidents2016_stage. We want to merge the new changes into the production
You can follow along with this scenario by running these two
CREATE TABLE AS SELECT statements:
CREATE TABLE S3.SF_incidents2016 AS SELECT * FROM Samples."samples.dremio.com"."SF_incidents2016.json";
CREATE TABLE S3.SF_incidents2016_stage AS SELECT IncidntNum, Category, Descript, DayOfWeek, "Date", "Time", PdDistrict, CASE WHEN IncidntNum IN (161061373, 161061420, 161061464) THEN 'ARREST, BOOKED' ELSE Resolution END AS Resolution, Address, X, Y, Location, PdId FROM Samples."samples.dremio.com"."SF_incidents2016.json" WHERE "Date" = '2016-12-31';
Now that the tables are set up, let’s merge the updated values of the data for December 31, 2016 from the stage table into the production table.
MERGE INTO command can be run in Dremio to update the values of Resolution in
SF_incidents2016 for rows where there’s a matching
IncidntNum value in the stage table, and
INSERT non-matching records into
SF_incidents2016 from the stage table:
MERGE INTO S3.SF_incidents2016 a USING S3.SF_incidents2016_stage b ON a.IncidntNum = b.IncidntNum AND a.PdId = b.PdId WHEN MATCHED THEN UPDATE SET Resolution = b.Resolution WHEN NOT MATCHED THEN INSERT VALUES (b.IncidntNum, b.Category, b.Descript, b.DayOfWeek, b."Date", b."Time", b.pdDistrict, b.Resolution, b.Address, b.x, b.y, b.Location, b.PdId);
Now we can see that the values for the updated incidents are in the production table by running the following statement:
SELECT * FROM S3.SF_incidents2016 WHERE IncidntNum IN (161061373, 161061420, 161061464)
Now, let’s say that the court decided to expunge someone’s record, so we need to delete the record of the incident. We can run the following
DELETE statement to delete that single record in the
DELETE FROM S3.SF_incidents2016 WHERE IncidntNum = 161061549
Now, let’s say that moving forward the application writing this data will change the values it writes. Now, when an incident is resolved by arrest and booking, it will no longer write
ARREST, BOOKED, it will write
ARREST & BOOKED. To ensure all downstream reports don’t see both
ARREST, BOOKED and
ARREST & BOOKED, we need to update all the previous records to reflect this new column format.
UPDATE S3.SF_incidents2016 SET Resolution = 'ARREST & BOOKED' WHERE Resolution = 'ARREST, BOOKED';
Dremio offers the ability to change data directly on the lakehouse, so you no longer need to copy your data into an expensive and proprietary data warehousing system to have the full power of SQL at your fingertips.
You can read more about this feature in the documentation here.