CI/CD on the Lakehouse: Making Data Changes and Repair Safe and Easy

In Software development developers use Continuous Integration and Continuous Deployment as a technique to automate the integration of new code safely and quickly. A challenge in the data world is often integrating new data ingested from batches and streams with the same safety and speed. As the data world adopts more and more of these software best practices in a great trend we like to call “Data as Code” how can we begin automating our data integration in ways that are also safe and fast.

In this talk we’ll discuss:
– How to isolate your data ingestion
– How to Audit your ingested Data
– How to integrate your new data
– How to automate these steps

Topics Covered

Iceberg

Open Source

Sign up to watch all Subsurface 2023 sessions

Speaker

Alex Merced

Developer Advocate

Video Synopsis

What is CI/CD?
How Does Automating Code Integration Work?
Isolating Test Branches and Rolling Back Previous Data
Managing Data With Dremio Arctic
Using Apache Arrow Flight For Faster Connectivity
Data Table Demo Using Iceberg Tables
Fixing Data Mistakes With Dremio Arctic
Automating Data Queries Using Dremio

Transcript

Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.

Alex Merced:

My presentation today is CI/CD on the Lakehouse, making data changes and repairs safe and easy. As many of you know, I am Alex Merced, developer advocate here at Dremio. Basically, constantly talking about the data Lakehouse and all the cool things the data Lakehouse can be, whether it be on blogs, podcasts, instructional videos yeah. So we’ll move on and get to our feature presentation. But on today’s agenda, what I want to talk about today is, again, CI/CD and the Lakehouse. We’re going to talk about like, what is CI/CDs? When I say CI/CD, what does that mean? Okay, why does it matter? What is the big deal here? And then I want to talk about how that relates to the topic. That was a very big part of our today’s keynote, which was the data as code paradigm.

What is CI/CD?

So we’ll show some examples like branching and isolation, version rollback, multitable transactions, merging, publishing data, and then how to automate all that or just possible strategies to automate that. Okay? So let’s first talk about what CI/CD is. Okay? It stands for continuous integration and continuous deployment. So let’s just talk about what those words mean in parse the amounts when I say integration, okay, because we’re talking about the world of software development. This is like a typical software development practice. So we’re talking about integration, we’re talking about integrating new code into old code, but it’s not just adding the new code, it’s about doing it safely, doing it without bugs. Doing it without hiccups, okay? When we talk about deployment, it’s about publishing that code. So that application can be used to the world, but again, doing it safely, doing it without bugs, okay?

How Does Automating Code Integration Work?

So we get what integration and deployment is, but what is continuous? So continuous is just the extra step of automating those steps, okay? So for all the software developers, you might have experiences with tools like GIT or Jenkins. You might be using GitHub actions where you push code up and then it triggers a series of unit tests and end-to-end tests. And if all those tests pass, then it gives it the go ahead to then merge that code into the code base. And that allows you to be more productive and more safe with what you put out there. So why don’t we try to adopt the same practices with our data? Okay? And why does this matter? Okay? Because the quality of our code leads to the quality of our applications and the better our applications are, the better the end user experience, the better business value. Because you’re going to have more users, okay? Same thing with our data quality. The better our data quality is, the better the insights that we get from that data, which again, drives business value. So the bottom line is quality equals business value. So it’s a good investment to think about how we can improve quality?

So in the paradigm we’re talking about in the software development world, we think of this whole paradigm of pushing the code. Again, all those unit and end-to-end tests get automatically triggered. And then if all is successful, that code then gets merged into the code base. So that push test merge, how would that look like in the data world?

Okay, well, we’ll think about it this way. We write the data. So we want to write the data. We want that data to be audited but not published until that audit is successful. So write, audit, publish. And when we’re thinking about GIT like semantics, that would mean sort of to branch and ingest that data on a branch, audit the data on that branch and then merge that branch. Okay? So, I think of that song Foodie Toodie. So it’s like wap bam lam, nevermind. It’s like his thing, okay? Branch, audit, merge. Okay? So we want to move to this paradigm and again, why? Because of the data’s code paradigm. You need to allow for isolation. So that’s the whole idea of branching and being able to do that ingestion work on a branch isolated from what your consumers are enjoying.

Isolating Test Branches and Rolling Back Previous Data

Okay? So when Facebook is adding a new feature, okay, all those new features are being worked on, on a branch. I’m not accidentally seeing halfway written code when I’m checking my Facebook profile, okay? Same thing. It should be with your data, okay? A data analyst, a data scientist shouldn’t be seeing half updated code while they’re running their queries. Okay? So we want to isolate that work. We want version control, okay? We make mistakes, okay? And nothing is worse than spending the weekend backfilling data. Okay? So in that case, we can avoid that if I can just roll back my mistakes and just say, Hey, let me go back to a previous commit like it can do with my code. Okay? And then governance. Okay? Nothing that happens when you’re using GIT is sometimes only certain people are allowed to push to certain branches only that some people are allowed to authorize.

Okay? So being able to have that visibility into who’s doing what and being able to control who’s doing what can have a lot of benefits. Okay? So all of these features, if we can apply that to our data, well, we’ll make our lives a lot easier. Okay? So as I do this example, or I go through this example that we’ll be discussing today, basically let’s set the stage sort of like the stack or our data stack that we’re imagining in the scenario. Okay? So we’re imagining that we’re landing our data, okay? It’s because we have to have our data somewhere. So we’re imagining some sort of like object storage is our data storage, whether it be like S3, Azure, Google Cloud. And that data has to be stored in a particular file type. So imagine like Parquet because we get the benefit of all those rogue groups and the metadata for those rogue groups to more efficiently query that data and that nice column binary format.

Managing Data With Dremio Arctic

Okay? We’re going to need a data management tool, okay? That’s going to be the thing that enables this whole data as code. In the same way GIT allows us to manage our code, Dremio Arctic is going to allow us to manage our data in that same way. Particularly it’s going to allow us to manage the data that we have sitting in Iceberg tables. So Iceberg tables will be our table format. So basically the tool that our engines and all our tools use to recognize groups of our files as tables. Okay? But at the end of the day, great that I have this data that’s recognized as tables. I need an engine to query that data. So I’m going to talk basically the way these query the queries you’ll be seeing over the next few pages. They’re based on syntax as you would use them in Dremio Sonar, okay?

Using Apache Arrow Flight For Faster Connectivity

But the cool thing about Dremio Sonar and other tools that are based on Arrow is that you can use Apache Arrow Flight as a one way to connect it to get extra fast connectivity. Okay? There’s this video on YouTube where I compare ODBC to Arrow Flight and you can see at different numbers of rows and when you start getting some really high numbers of rows, you start seeing a really big difference in performance and being able to access that data quickly. So we’re going to assume that you’re connecting to your data using Apache Arrow flight, which actually the next talk we’ll be discussing or Apache Arrow in general. So look forward to that next talk. And then again, we’re going to assume that hey, all your analysts are consuming that data locally, maybe in an a Jupyter notebook using their favorite data frame library like Pandas or Polars, or maybe they’re basically taking that arrow buffer and turning it into a DuckDB relation or turning it into a BI dashboard with Tower BI or Tableau.

They can consume it in all the tools that they want, okay? But the idea is we want to manage that code so that way when they do their job, they don’t have to worry about the quality of the data. They know that the data that they have access to is the good stuff. Okay? So first thing, branching isolation. So just set the stage, imagine that this is a company that handles like virtual assistants. So basically we have a roster of virtual assistants who are willing to be a virtual assistant to somebody. And then we have a bunch of customers who hire these virtual assistants. And then we basically track sales basically as a virtual assessment for this customer. Okay? So then we have three tables. We have our sales table and then we have our two dimension tables, which are our assistance and our customers.

Data Table Demo Using Iceberg Tables

Okay? So in this, see here, the first couple queries, what I’m doing is I’m creating a branch. I’m going to say, you know what, I plan on ingesting some data. So I’m going to create a branch for ingesting that data and then I’m going to make sure I switch my context over to that branch so that way any query going forward is occurring on that branch. So you see that second line, you, that used branch line, okay? Then however you want to pull in your data, you can pull in your data. So in this case, what I’m doing is I’m just using a C statement to pull in some data from another set of files and then putting them over into a staging iceberg table. So I’ll do, that’s what that second statement is doing. And then what I’m going to do is I’m going to do an upsearch on my data.

I want to say, okay, here’s my sales table, here’s my staging data and I’m going to run a merge saying, okay, basically anytime the IDs match between these two, update the record. Anytime they don’t match, insert the record and all good. So now I theoretically ingested the data, but I haven’t had a chance to audit it yet. I haven’t had a chance to run any checks, but I don’t have to worry about that. Why? Because all of this occurred on the branch, okay? All my BI dashboards, everyone’s notebooks are unaffected because their data is querying the data that is on the main branch, not this staging branch. So I don’t have to worry about anyone giving me a call five minutes later being like, what happened? Okay? I can just proceed forward. Okay? So now I’m going to want to audit that data, okay?

So in this case, again, I’m switching over to my staging branch, okay? And then I’m doing a referential integrity check because basically each of those sales is referring to a customer in the customer’s table and referring to an assistant in the assistance table. And I want to make sure that there’s an actual record for each assistant and an actual record for each customer. So hopefully when I run this query, I should have the same count as the total number of records, as the number of records that pass the referential integrity test. That’s good. But if not, I have a problem so I can run that test. So I run that test and again, what if there is a problem or what if I pass that test, but then I make some other edits and I make a mistake? Well, I can just roll it back and rolling it back is as easy as this.

Fixing Data Mistakes With Dremio Arctic

I just found a commit number that I would like to roll back to and I just use this statement right here where I alter the branch and I say, Hey, for this branch, make this branch point to this commit. And it’s like all those mistakes never happened, okay? Really with just one SQL statement. And it’s not just for one table. Like if I made a mistake across a hundred tables, that one statement would fix all hundred tables because Dremio Arctic does it at the catalog level. It’s versioning the whole catalog, okay? That’s a big distinction over other solutions. There are other solutions that branch at the file level. There are solutions that branch at the table level, but then you have to go back to each table, okay? There’s a lot of other stuff going on. But when you have it at the catalog level, you just roll back and all your tables are back to where they were before.

Okay? So there’s to see the difference and the benefit, the value of that difference, that different point of version control. Okay? But another cool thing that happens is that you can end up having multi table transactions. Up till now, that was like a thing of the data warehouses. You do that whole begin transaction statement and then you run a bunch of other SQL statements and then you do an end transaction statement and everything happens. There’s one big commit and that’s nice, but how do you do that on a data lake? Well, here’s how, okay? So in this situation, again, I switched to my branch and here, I’m adding a new assistant and I’m adding a new sale, okay? Normally if I was to do this on a data lake without branching, I would first add the assistant or if I added a sale first, that might screw up the integrity and I might end up having some bad queries before I finish all my updates.

But because I do this all on the branch, I can wait till all my updates are done and at the end then I can just merge it back in and all those updates have occurred at the same time. So it’s one big atomic update across all these tables that I’ve updated. Okay? So I can do multi table transactions, okay? And then I can then publish those changes through a merge transaction, just merge that branch. And now basically both branches should be identical. No difference. They should have the same records because they now have all the same commits. Wonderful. Okay? So again, I can branch, I can audit my data and then I can merge in the same way we do that. Now, the missing piece is, hey, we’ve talked about integration, we’ve talked about deployment, but what about the continuous part?

Automating Data Queries Using Dremio

Okay? Because the fun part is the automating of it, right? Because we don’t have to do this manually every time. Well, there’s a couple different options and again, the options will continue to grow as this space expands. But basically all these queries are available in Dremio. So any tool that allows you to orchestrate Dremio, you can basically set up or orchestrate those pipelines to handle that. That’s one option, okay? You could also create custom web services to handle this. So like in this example here, we’re imagining a scenario where we have an ETL job. Okay? So imagine like you’re using Spark or Flink to consume a data stream. And then what we do is we’re assuming that we’re ingesting that data into some cloud like S3, okay? You have serverless functions that you can trigger when data hits your buckets.

Okay? So basically we’ll say, hey, a Lambda function gets triggered the minute that data gets ingested and sees those files in the bucket. And then that Lambda can then call a web service, in this case written a Python flask. So that way we could use Python libraries, okay? And that Python service, that Flask service can then send all those queries to Dremio and run through all those checks. And then again using something like connecting through either JDBC or connecting through Arrow Flight, whatever your connection is of choice. But it can then take those return data, say, Hey, did it pass this check? Did it pass this check? Did it pass this check? And once all the checks have been passed, send that final merge transaction. And then the cool thing is you can also then pull your business rules from somewhere else because you can make a table in Dremio that has all your business rules, like saying, okay, hey, this table, like the cost of a service should never be less than a hundred dollars, but never be more than $50,000 or something like that.

You can take those business rules and use those in your audits because you have all that data accessible in Dremio, okay? But because you’re using Python Flask, you could also bring that data down and then run it against other libraries, like great expectations for doing a lot of different types of data quality checks and stuff like that. So you can choose how you want to do it. So again, right now you have two choices. You can use orchestration tools, whatever your favorite orchestration tool is, or you can create a custom service. But the idea is you can then automate all this process and then yeah, you, how have CI/CD, you now have a process where you are continuously ingesting and you are continuously deploying your data with that whole branch audit merge process. Okay? And with that, that is the end of this presentation. My name is Alex Merced. You can always email me @[email protected]. Please follow me on Twitter @amdatalackhouse. I like to post a lot of videos and I have a lot of really cool videos even on this topic planned over the next few weeks. So you don’t want to miss those.