May 3, 2024

Using Dremio to Implement a DataOps Framework for Managing Data Products

DataOps is a framework that brings users closer to the data, streamlines data management and operations, accelerates time to insight, and reduces total cost of ownership. This session will explore using Dremio to implement a DataOps approach to managing data products, including: turning data assets into data products, adopting core DataOps concepts like version control, quality assurance, CI/CD, and observability, and simplifying data pipelines with Zero-ETL virtual data products.

Topics Covered

DataOps and ELT/ETL
Performance and Cost Optimization

Sign up to watch all Subsurface 2024 sessions

Transcript

Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.

Dave Anderson:

Yep, my name is Dave Anderson. I’ve been here at Dremio for about three years now, so I’ve been around the block a few times here. I’m pretty familiar with the product, but I’m here today to talk to you about Dremio, using Dremio to implement a data ops and data product framework, and I’ve got a couple of slides that I’m going to show just because I think it’s a rule that you have to have at least a couple of slides to show in one of these events, but we’re going to get through those and then dive right into the product and make you understand the concepts by actually seeing them work live. So let me go ahead and just get started with this, and we’re going to jump back and forth a little bit between this and my demo environment. 

Data Lakehouse

But yeah, we’re talking about data lake house, data ops, and data products. We’re going to jump into kind of set the foundation for how are we going to build these and what are some of the key concepts that are needed to make all this work. The first thing that we need to talk about is what really is a data lake house? How is it different than just a data lake? And it’s actually very simple. A data lake, we all know you’ve got data lake, so you’ve got storage and object storage somewhere out there, and then you’ve got some kind of compute layer that’s able to query that data, Dremio, Athena, Starburst, any of those derivatives, et cetera. There’s a lot of different products that can interact with the data lake. 

Now, a data lake house has really a lot of the same core components. You’ve got your storage, you’ve got your compute, same type of thing. The big difference is the implementation of a table format and a catalog to interact with that table format. And really what this is doing is making it so we can treat our data lake like a data warehouse. We can do DML operations, insert, update, delete against data sets that are built out on a data lake, which is not something that you can do in a typical data lake. But with the implementation of a table format like Iceberg, which I’m sure you’ve heard a lot about this weekend, there are other table formats out there that are available as well. We’re going to focus primarily on Iceberg today. But table format and a catalog to interact with that table format really gives you the ability to take advantage of that lake house functionality, as well as do a lot of really interesting things that we’re going to shift over to start talking about now, which is data ops. 

DataOps

So what is data ops? Also called data as code, but this is actually data ops is a broader classification. Data as code is really just a piece of this, but it’s an important piece of this because this gives us the functionality that we need to be able to enable these types of behaviors. So data ops will give us isolation of our data. We can have a particular data set and do things in it in an isolated fashion. So we don’t have to worry about changes and data sets happening in one place that are going to fan out to a lot of different locations as well. We can isolate our data. We can do all these things in a zero copy fashion. So the typical kind of pattern that you would see in a normal data lake or even in a data warehouse type of environment is for ETL jobs or changes that you need to make to data sets. You have to make entire copies of these data sets and then either do your transformations and load it back in or make a copy, add additional features, and then point your applications to the new copy, et cetera. With a data as code and data ops implementation, we can do this in a zero copy type of fashion. 

Can also do this multi-table. What I mean by multi-table is you can have a atomic action across multiple tables that occur. You can log in and make changes to multiple different tables inside of your data set or inside of your catalog and be able to commit those transactions in a singular atomic action to be able to have now the proper representation of all the changes that you’ve made at the exact same point in time. We can accomplish that in Dremio with Iceberg and Nessie. We’ll talk about that. I’ll show you that as well in just a little bit. Rollbacks as well. You make a mistake. You need to go back to a previous point in time. No problem. This is very critical for any type of data ops implementation because chances are something’s going to break at some point in time and you’re going to need to fix it. Who wants to go do a restore? Be much easier just to be able to revert back to a previous view of that data and then try again until you get it right. 

Then reproducibility. We can go back in time and see what data looked like at a previous point in time and investigate who changed this data yesterday at 3 p.m. or what did this data look like before our batch process ran last night in the middle of the night. A very important concept for any type of data ops implementation. And then of course validation. If we’re going to be making changes to our data, before we make it live, we want to be able to ensure that this is the right version of this data that we want people to see. These are the core concepts wrapped around data ops or data as code. I’m going to jump in and talk a little bit more about how this is actually possible with Dremio, Iceberg, and Nessie. 

Git For Data

So Git for data is really what we’re kind of talking about, versioning of data or branching of data. There’s a lot of different terms, but they all basically mean the same thing. We want to be able to snapshot our data at specific points and be able to go back and forth between it. We’re versioning our data just like you would versioning in GitHub for code development, whatever that may be. We want to treat our data the same way as we’ve been treating our code for many, many years now. It opens up a whole lot of new avenues for us. So why? Why would you do this? Again, we already talked about the zero copy piece. We can do experimentation on our prod data, but not our actual prod data, just a copy or not even a copy, a different representation of that. We can do experimentation without impacting our live data that our applications are running against. We can just make a new branch and then play around with that and see what comes out of it. If something’s good, maybe we do push that to production, but we can do this without having to copy an entire data set, and we can do this with a version of our data that is the same as what is currently in prod without impacting it. Very, very important and very, very powerful. 

We can also do our ETL jobs now without impacting our current production workloads. So you can make an ETL branch and push all your changes into that production version of data, validate it, make sure everything is good, and then once you’re sure that this is right, you can just merge it back into your main branch. So you get an instant and atomic change to all of that data all at once without having it run against your currently running production applications. So very, very interesting what we can do there. We already talked about multi-table transactions. So we can do inserts, update, deletes to a bunch of different tables at different times inside of a branch, and then once we get all the data in the right position, just flip it over. So we can do multiple table transactions. Roll it back very easily. Again, we can reproduce it. We can do our consistency and quality checks before we send anything live. And most importantly, DataOps is not anything without automation. These are all things using toolkits that are available out there right now that we’re probably already using, DBT. There’s a whole lot of CI/CD tools, anything that can communicate via REST, SQL. We can take these tools and automate a lot of these processes so we can get the human hands off of the keyboard and let the computers do all the thinking for us. 

Versioning of Data

So what is actually happening when we are doing this versioning of data underneath the hood? Well, you’ve got a version of data as it currently exists. Somebody did a commit. They created a table, inserted some data. This is our very first version of this data. And then as time goes on, people are going to insert more rows, maybe make a change to a table. These are all just different commits that have happened along the path this data’s lifecycle. With data as code or DataOps, we can take these commits and we can do things like apply a tag to a specific point. So if this table had one row and we inserted a second row at commit two and needed a third row at commit three, we can apply a tag to any one of these commits to make it easy for us to reference in case we need to roll back to that previous commit. And I’m actually going to show you this in just a moment. 

But the branch is where we can really start seeing some interesting things. So we’ve created a second branch of our production data. So if this yellow lines are our production data, we’ve created a branch now that are now these blue circles where we can do experimentation. We can do ETL. So this is all happening on the side. And again, this is not a copy of the data. These are just pointers. These are just essentially snapshots and pointers that are maintained in the catalog, the Nessie catalog for these iceberg tables underneath that catalog and all the metadata that is contained within them. So we can look at this data at any point in time because we understand where the data is from that particular snapshot that is maintained in the catalog. So on this branch, we can go now and we can do an ETL job or a batch load, wherever that may be, and then merge it back into our production workload in an atomic single action. Boom. And now our production data is able to start our production applications rather are able to see the data at this point in time that we’ve done alongside without impacting any of those applications during that process. 

Live Demo

So let me just go now and just show you how it’s done, because this is going to be a lot more powerful, I think, to really just see what’s happening here. And what’s actually very funny about this is I logged into this environment earlier today to make sure I had everything still set up. And I realized that I was a very, very bad developer and I did something very silly. I created one of the tables that is part of this demo in our main branch. And since this is a demo environment, this main branch gets rolled back to previous commits all the time. So I had spent some time building just a small, simple table. Nothing in here is going to be very complex. I wanted to make sure that even people like our executives or directors that may have not touched a keyboard in a long time are able to understand this concept without just being a big flex on showing off fancy SQL. So don’t laugh at my data model, because it is very, very, very simple and very bad, honestly. But it makes it so you can understand this whole process. 

So but again, I created a table yesterday in this preparation space, just called locations. And I came in today to run through my script and I realized my locations table is completely gone. Somebody must have reverted this main branch back to a previous commit. So bad developer Dave. However, I also when I did this, I set a tag at a certain point. So it’d make it easier for me to roll back to so I could play with this and go back and forth and make it easy so I can can get back to that previous commit that I was sure of that of its state at. And I just called that tag da gold. So this is I made a tag on a commit of our main branch. So I can very easily now just go back to that tag, which is a commit from a previous point in time against our main branch. And now if I go in and look, what do you know, there’s my locations folder or my locations table rather that I created yesterday. So I saved myself by using data ops types of concepts here and taking advantage of Dremio’s implementation of these to save myself from some reworks and somebody dropped my table. I just go back to the previous commit when it existed. So I fixed my whoopsie very, very easily. 

Now one of the things we do have to talk about when we’re talking about data ops, that is kind of a core concept is also the concept of a data product and a data product is really a data mesh type of philosophy. You know, a data product is is a curated view of a data set or a table, whatever it may be. Just it’s a curative view of data that is governed by a responsible party typically called a data product owner. And really what we’re doing when we’re building a data product is ensuring that all of our end users and our applications have the same consistent view of a data product. A good example is customers, you know, typically a representation of what a customer is in a company is going to be there’s going to be data in a few different places. You’re likely going to have some core definition of what a customer is in a relational database somewhere, SQL Server, Oracle, and then you’re probably going to have if it’s a sales type of implementation, which would normal be normal with a customer’s data set. You’re going to have sales or transactions and that may be out in a data lake that may be in a data warehouse. You know, you’ve got SQL Server for the main portion and maybe you’ve got your sales transactions out in Oracle. 

So a data product is really just saying, OK, company, our definition of a customer is pulling from this particular table in SQL Server, these five columns, and then this table in Oracle, maybe these four columns, and we join that together on this column. And this is what we call a customer. And no changes are made to that representation other than by the governing body, the data product owner. So you have some sort of governance over that representation of the data. So data product is kind of critical to this whole concept, because if you don’t do that type of governance over your representations of data, well, sales may be basing a customer off of a SQL Server database and an Excel spreadsheet. Very common for sales guys. Your data scientists may be purely pulling data from a data lake. So now everybody’s got their own view of what a customer is and applications may be potentially feeding from different representations of a customer. Now all your reports are out of whack, nothing lines up. So building a proper data product is important. 

Building a Data Product Demo

So we’re actually going to do that first, before we jump into showing you branching and everything else. Let’s go ahead and just build a data product. So in this demo that I’m about to do, we actually have just some very simple data. We have models and prices. This has car information in it. And then we have locations, which is just locations and model IDs and number available. Again, I told you this was a very basic data model. Don’t judge me. So I want to build a data product that is just a simple inventory view across North America of where are all my cars? How many do I have at each location? What’s the model? And what’s the MSRP? So we can do that very, very simply in Dremio. So first we’re just going to go into models and prices. We’re just going to see what’s in here. 

So you can see, very simple table, again, easy to understand. We’ve got an ID for every type of model. We’ve got Model X, Model Y, Model 3, and Model S, as you can guess, these are Teslas. And then we’ve got the MSRP. We might need to update those MSRPs. So this is just a little dimension table. We’ve got some basic information about the four different models that are available for us out there. But this isn’t really an inventory view. Now we could just have our end users join this table to location table every single time that they want to get an inventory view. But we don’t really want all of our end users writing SQL, because as we all know, not everybody’s great at it. And they may join on different columns, or may join to the wrong table. So we’re going to go ahead and create this inventory view for them and call it a data product. Now in Dremio, I can just click on this Join button here. And I can browse around and go into my catalog. And in my catalog, I know that I have my locations table out here that has location information. So I’m going to join the ID, the model ID, from my cars and models– or I can’t remember the name of the table– to my locations information based on model ID. Let’s go ahead and apply that and see what our data looks like now. 

So you can see now we’ve got our model number, and MSRP, and our ID, and then our model ID, and our state now, and the amount available inside of this. Again, this is a very bad data model. But you can all at least follow along with what we’re trying to show here, right? So essentially, now we’ve got this inventory view that shows everything about every location, and every model, and every car that they have in there. So I can now save this off as a new view. And I’m going to call this Site Inventory. And I’m going to save this into a different catalog, actually. So we’re going to save it there. So now when I come in to my catalog here, I have a Site Inventory object. So now my end applications, my Tableaus, my Power BIs, our ad hoc SQL that people are running, they can come in and just select star from Site Inventory and get that representation of data without having to do any of the join logic. 

And we can even go further. If we wanted to add additional business logic inside of here, we could do things like make it so only people in Texas can see the Texas inventory, where state equals TX. We just want to look at that. We can see our Texas inventory. So Texas, you can see, don’t have a ton of available cars there, but they’ve got enough. But you get the idea. And if we permission these catalogs and these objects, we can ensure that nobody is making changes to this business object or this business model except for the data product owner. So very important to be able to lock that down. This also means is any changes we make to the underlying data will automatically be reflected in this, and your end users don’t have to change any of their code. But I know we are going to run out of time here. So let me just go in now and start showing you what we can do from a branching perspective. 

So let me get over to this and show you what we got going on here. Okay. So let’s assume that we’ve got our locations and our car models and MSRP and all these different tables out there. But what if we need to make a change to this? We have an ETL project that’s going to load. We got new inventory coming in, right? We don’t want to run this against our production data without having it go through some checks. And also, we don’t want to impact running applications by changing data underneath them. So let’s go ahead and create a new branch. And I’m going to create a new branch, and I’m calling car model Dave’s main. And I’m doing it from that tag that I created previously. So I’ll have that table that’s in there that was missing in the regular main branch. Easy peasy. So we’ve got it. And now we can come take a look at both of our tables up there that are currently in main. So we can see, yep, there’s model X, Y, 3, and S. And then there’s my on my site location information from that location table that I created. 

Now we’re going to go ahead and create a ETL branch. So a separate branch that I’m going to do some work in. And what I’m going to do is I’m going to go ahead and just insert this new inventory. Looks like in Texas, they got a couple of Cybertrucks because Elon wants one. So we’re inserting data into this separate branch, and we can see, OK, we inserted one row to the car models table or models and prices. And then we insert one row into our locations. And now we can go back and double check our main branch to ensure that there is not anything in current running production. Those changes have not been committed to production at all yet. You can see no Cybertruck in Texas. We can sort through there and find that. But we do see that data in our ETL branch. So now I’ve got Cybertruck in my ETL branch, and you can see a Texas has got two on all ID five, which is Cybertruck. 

So let’s go ahead and merge that in. Now in one single atomic transaction, I’ve taken those changes and pushed them into both tables at the exact same time. So now if we go back and look at main, we also have that data in our main branch as well. So now our main, which is our production data, is showing that we have a Cybertruck, and Texas has two models available until Elon shows up and drives one away. So if I screw up, though, I can go all the way back to the beginning. So I could have done this at the ETL point, which if you were implementing a true Data Ops framework, you would have checked for the validity of the data before you merged it into main. But I’m just going to go ahead and go all the way back to the beginning here. And now we can see that my tables are back to the way they were without the Cybertruck in there. So now no one knows that there was ever a Cybertruck at Texas. And I’ll just go ahead and clean up my branches, and we’ll call it a day.