A Git-Like Experience for Data Lakes
While traditional data warehouse concepts like transactions, commits and rollbacks are necessary in a SQL context, they’re not sufficient for modern data platforms, which involve many users and data flows. A new open source project, Project Nessie, decouples transactions and makes distributed transactions a reality by building on the capabilities of table formats like Apache Iceberg and Delta Lake to provide Git-like semantics for data lakes. By using versioning concepts, users can work in an entirely new way, experimenting or preparing data without impacting the live view of the data, opening a whole world of possibilities for true DataOps on the data lake. At the same time, they can leverage classic data warehousing transactional concepts in their data lake including three levels of isolation across all tables and the classic START…COMMIT primitives.
Topics Covered
Transcript
Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.
Opening
Anita Pande:
Hello, everyone. My name is Anita Pande, and I'll be your moderator for this session. It's on a Git-like experience for data lakes, and I'd like to welcome Ryan Murray, who is our open-source engineer at Dremio and a key committer on Project Nessie and Apache Arrow. But just before we get started, [there are] a couple of housekeeping items I'd like to review. For live Q&A, please note that we'll do that ten minutes before the end of the session. You can activate your mic and your camera for the Q&A with the button on the [top right] corner of the screen, and I will put you in the question queue. For those of you who would prefer to submit your Q&A via chat, you can do that throughout the session, and we will line those up towards the end of the call. And finally, Ryan will also have a dedicated speaker Slack channel that I will post in the Q&A for those of you who'd like to follow up and engage with him after his presentation. And last but not least, we have an exciting expo hall, so I urge you to go and get demos and deep dives with those folks and also experience Dremio in a virtual sandbox that we have for a limited time, and also participate in our fun giveaways. With that, Ryan, take it away.
A Git-Like Experience for Data Lakes
Ryan Murray:
Thanks very much, Anita, and thanks a lot to everyone for coming. I'm really, really excited to be here at Subsurface and to finally talk about Nessie. So, to get going, I wanted to sort of set the stage by taking a little bit of a walk back in time, I suppose, to talk about the time before Git, when you were developing software. So, if you were around then, if you had to use Subversion and CVS and these kinds of tools, you know that it was not always easy to use version control. It was expensive to create branches. It was very expensive to merge. You never really were kind of certain that you were able to have all the changes that you thought were on your main branch [were] actually there. And then Git came along, and that really changed everything for a lot of people. It suddenly made everything easy. Branches were free. It was easy to get code review. And subsequently, a lot of people, all the developer productivity, code quality, everything improved. And then that revolution kind of went everywhere. We started seeing our configuration being stored in Git. And, eventually, we started deploying things from Git. So our entire application stack was stored in Git. And recently, we've seen infrastructure as code, where we have the definition of our network and our Linux servers and our databases, and everything is stored in Git. So Git really has taken the development world by storm, but there's one place that it actually hasn't gone, and that's data. And I'm hoping that by the end of the talk, you'll believe that project Nessie is actually gonna be something that we can say has taken a Git-like experience to data.
Introduction
Ryan Murray:
So to give us, just to go through a quick outline, what I'm gonna do with you guys today is get our bases covered with the data platform. What do I mean by the data platform? What are they? What are the highs and lows of these things? [And] what is the current state and current challenges with a data platform? And then I'll go ahead and introduce Nessie, and this is gonna be my favorite part of the talk, I think. In the almost twenty years I've been doing this, this is one of the coolest ideas I've ever worked with. And we'll take a look at some use cases. Those will look surprisingly like our challenges, and then we'll wrap up with the current state and where we think Nessie is going.
Data Platforms: Then and Now
Ryan Murray:
So let's get started. Let's talk about data platforms. Now, when I say data platform, I mean the storage and analytics layer for our data. And why do we have data? It's to help us make better decisions. So, back in the day, that was just a single database, smallish datasets, and a handful of analysts. And then that's really evolved over the past forty, fifty years. We had data warehouses in the late nineties, and that really transformed how we deal with data. [And] that really gave us a strong platform for how to deal with data and how to analyze data. In the early 2000s, we started to see the rise of the data lake. That was with Hadoop, and S3, and these sorts of technologies where we could start just dumping data into the data lake and then analyzing it later. And that was a very different approach than the data warehouse. But we managed to find a lot of different architectures, you know, through the 2010s, combining these two with their strengths. So, you know, your lake is where the data lands, and the warehouse is where the analysis happens. In the past two years, we've kind of seen these two ideas start to merge together. So [there's] people talking about stuff like the lakehouse, where the data lake and the data warehouse—there's not as much differentiation between these two things. And I think that this slow convergence is driven by a number of things. Obviously, the exponential increase in data sources has a lot to do with it, as well as a drive to reduce complexity and cost. Cloud-based [solutions]—the cloud is a big driver for that as well. I think clouds are sort of the great unifier. What's the difference between a warehouse and a lake when you're staring at them on the Amazon console? Not much. And finally, an important one for me is this drive towards open formats and open technologies, where we're starting to take the best tools for the job rather than having a one-size-fits-all tool that never really takes care of everything we want it to.
Current State: Convergence of Lake and Warehouse
Ryan Murray:
So that's kind of the journey we've been on for the past thirty, forty years. But you'll notice there are two technologies in here that have been co-evolving. So, if we were to take a look at this timeline again, but sort of rotate it, what would we see? We'd see, again, these two parallel technologies, but they've actually been coming closer together as time progresses. And they've been learning from each other. So we see a lot of tools that existed in the data warehouse from the eighties have been moved over to the data lakes. We're seeing SQL interfaces and transactions and stuff. And then the data warehouses took from the data lake [when] we moved to the cloud. We started being able to import and export from blob storage, this kind of stuff. And we've really gotten to the stage where, as I said before, there's really not a whole lot of difference between these two technologies. And I think over the next few years, we'll actually see those differences completely disappear. And anyone who guesses what I think the technology that's going to unify these two concepts is going to be gets a gold star.
Challenges: Consistency
Ryan Murray:
So, what are some of the challenges we see in our modern data lake today? I think the first one, and probably the most important one, is consistency. If we have analysts looking at a dashboard or reading off a complicated dataset—lots of views, lots of tables, all kinds of stuff like that—how can we make changes to that dataset? Maybe it's our end-of-day batch job when all of our new data comes in. How can we present that to the user in a unified way so that the user never sees an inconsistent set of data?Typically, the way we do that is—there are a lot of different ways you can do that. You can do run IDs. You can do stuff like that. One of the easiest ways to show how it's done is [to] talk about a point review where we select from yesterday's data. Maybe that's with a WHERE clause. Maybe that's with a different table name. And then at some point, when ready, we move that pointer table over to point to today's data. And that's a really straightforward way. It's a really simple way. But it, like a lot of its other cousins for managing those problems, can be relatively fragile. So if we were to look at this from a developer's point of view, how do developers do this using Git? Well, they would probably create a branch. They would do all of their changes on that branch, and then they would merge it back. They create a feature, they'd test a feature, and then they'd merge the feature back. And that would happen atomically. That's kind of where we want to get to with this consistency problem.
Challenges: Verification
Ryan Murray:
Closely related to consistency is the idea of verification. And this is—how do we know when to move the pointer? How do we know when today's data is actually ready? And if today's data wasn't ready, and we moved the pointer too early, how do we move it back safely and then fix today's data? Right now, that's really challenging. We tend to do this on [an] ad hoc basis. Maybe we have a set of tools to run on today's data before merging it. Maybe we just merge it, move it, and wait to see if our users complain. And if we do have to revert back, how do we do that? Well, it's usually easiest to throw out today's data and completely rerun all detail jobs. It's too hard to pick apart these complex connections. And, again, looking at this from a developer's point of view, what's a developer gonna do? Well, they're gonna verify it by raising a pull request, and that pull request is going to be a chance for people to review the code, comment on the code, maybe run some automated tests, that kind of stuff. Then they can make any changes to it before merging. And if they find a problem after merging, what do they do? Well, they revert the change, and it's that easy.
Challenges: Reproducibility
Ryan Murray:
So, the final challenge is reproducibility, and this is a pretty big one. This has developed an entire ecosystem of products, tools, and open-source projects on its own. This is the idea of, how do I reproduce the decision I've made in the past? Whether that's a machine learning model, and we have the concept of MLOps to manage the life cycle of these machine learning models, or it could just be a simpler analysis. How do I go back and say, why did I make this decision? What data did I use to make this decision, and was this a good decision? And that's currently hard to do. We need to find a way to freeze all of our data. Typically, we just can't do that. There's unversioned data, and there are copies of data whose lifecycle we have to manage. And then the queries, the analysis, and the model are all stored in different places—if they're stored at all. And, again, what would a developer do who's using Git? Well, they’d probably just create a tag. You think of a machine learning model or a decision as a release. When you do a release in Git, you usually create a tag to go along with it. Then, when you want to look at that release, you go to v0.1, and that's the state of the world when you made that release.
So Far…
Ryan Murray:
So those are our challenges. Let's stop and just double-check what we've done so far. We've talked about Git and all the ways that Git has affected our daily lives as developers. We've [also] discussed our modern data lake, modern data architecture, and all the different challenges and concepts that go along with it. So, what happens when we want to put those two together? That's where we can finally start to talk about Nessie. After over years of development from a number of really, really smart people, I'm excited to introduce to you Project Nessie.
Project Nessie
Ryan Murray:
So, what is Project Nessie? Well, it's a Git-like experience for your data lake. When I say Git-like, it's because it tries to take as much inspiration as it can from Git, but it actually isn't Git. The reasons behind that are reasonably straightforward but a bit technical for right now. I'd be happy to take questions or discuss that later in Slack. But it is Git-like, so there's the concept of branches and tags and commits and merges and all these other concepts that we're familiar with from Git. At the core of that, like in Git, is the idea of a commit. And what is a commit in Nessie? It's simply a transaction. Nessie doesn't care if that transaction involves a single table or multiple tables, and it can even involve many tables and the associated views and materialized views. From Nessie’s perspective, a transaction is a commit. It doesn't matter what is in that transaction. So how does it do that? Well, it does that by maintaining a key-value store of all the information we have about the data lake. So it behaves a lot like Git in that perspective, which is basically a key-value store on disk. But instead of being on disk, Nessie is stored in a cloud KV store, something like DynamoDB. The reason behind that is we wanted to make it cloud-native. We wanted to make it able to fit into our modern ecosystem. So this KV store is backed up by the Nessie server, and the Nessie server is really just a simple REST API that talks in JSON. We can put that in AWS Lambda or some Docker containers in Kubernetes or something like that. Then it's super easy to scale in and out horizontally. And since it's backed by a cloud KV store, we really don't even have to think too much about how that scales.
And then next, Nessie supports all of our common data tools. Well, right now, it supports Hive and Spark with Iceberg and Delta Lake. But we hope that in the future, Nessie will be supported by a whole range of data tools. And the idea is you just get more value with the more tools that use Nessie. It's just a good scale thing. And finally, I think what's cool about Nessie is there's no data copying. So we're doing version control on the data lake, but we're not actually copying much data, [if any]. Everything is stored in metadata on the Nessie server. So we really avoid the situation where if I want to do a git commit of a table, I don't have to copy the whole table. The table metadata is tracked by Nessie to represent the commit.
Example Workflow: Consistency
Ryan Murray:
So now let's go back to the challenges we talked about earlier, and let's take a look at how that would work in Nessie. So here we have a commit on main, denoted by number one there. That dot is the commit, and it represents the head of the branch 'main,' or you can think of it as the entire state of our data lake as it stands—what all the data, views, everything else is on our data lake right now. So now we kick off our end-of-day ETL job. This is [a] complicated ETL job with, you know, a hierarchy of things that have to happen, with enrichment and derived data and creating new and different views, that kind of stuff. So what happens? Well, this ETL process starts by creating an ETL branch. And like git, that's a free operation. It happens instantaneously. And then on that ETL branch, all of our ETL jobs run. You can have a Spark job in that region and another Spark job in that region. It doesn't really matter. Everyone is just committing onto the ETL branch, and they're able to do that whether it's a single-table transaction, multi-table transaction, doesn't matter. So at the end of this ETL job, the third blue commit there represents the end of our ETL job. The ETL branch looks like we want our main branch to look. That's all of our data having been put onto our ETL branch. So what do we do? We simply merge it back into the main branch, and we're done.
Now, the important thing to remember here is that these merges are atomic operations. So, from the perspective of the user, they were querying 'head,’ point one up here, they were querying that the whole time. They never knew that the detail branch existed, and they never knew that stuff was happening on it. At some point, when that merge happens, all of the data from that ETL job is atomically promoted into the main branch. The next user to come along is going to see the complete ETL job run with no possibilities of problems with consistency. You can sort of think of this as our workflow that we discussed earlier, where a developer comes along, creates a feature branch, develops the feature, and commits it back into main when they're ready.
Example Workflow: Verification
Ryan Murray:
Now, if we look at the verification stage — it looks like my slides are screwed up — verification happens very similarly. In this case, instead of merging straight back, we're actually going to do some tests. We're gonna do basically a code review on our data. These can be automated tests, or they can be something very similar to the code review that happens on a pull request, whatever. We can imagine that happening on a staging branch, or we can imagine it happening on the ETL branch. But it's a very similar process to what a pull request would look like. And then that change eventually gets merged into main once we're satisfied that all the data is there, that we want it to be there. What happens if it's not there? Well, it's effectively Git, so we can revert some changes. We can add more changes onto the ETL branch, and once we're verified, we can merge back. There's a lot of flexibility in how we can verify and then fix [things] once they've proven to not be quite right.
Example Workflow: Reproducibility
Ryan Murray:
So finally, what happens with the reproducibility problem? And, I gave you a big hint earlier: we use tags. So, 'training run a' here is a tag of main when the machine learning model was promoted to production. So we promote our machine learning model to production, we create a tag on Nessie, we do an analysis, and before we present that analysis, we create an associated tag. Then, we can always return to that tag, effectively check out that tag, and we have the entire state of the world — everything in Nessie as it stood when we made that decision. So then, if we come along and create some more branches, or we create some more commits, we mutate the data lake, or whatever else, we can always return to training run A, or we can create training run B, which is, the machine learning specialist comes in and creates a new model, modifies the model, retrains the data, and then deploys it to production. And that production deploy gets a tag as well. So now, we can always go back to either run, verify what happened in that run, and understand how we made the decisions we made. We can even do things like start to compare training run A to training run B, or apply training run B's model to training run A and see how these models have been progressing over time.
Current Status
Ryan Murray:
So, [those are] some of these cases. These are some of the powerful things that we can do with Nessie. So, where is Nessie at today? As I said, we can deploy Nessie using a Docker image or an AWS Lambda. We have a variety of back-end storage [options] right now. That includes AWS DynamoDB, which we feel is the strongest database for us right now. We also have one for MongoDB, and we'll soon have one for a generic JDBC database. Along with that, we have clients for Python, Java, and a CLI client, as well as a relatively simplistic UI. Just don't criticize my UI development skills—that's not my forte. But we do have a nice Nessie client for Java and Python, as I said, and we also have a CLI. The CLI looks almost identical to Git. Along with that, we have full support in Iceberg as of the Iceberg release on Tuesday. We have some stuff with Delta Lake. So, we can use Delta Lake right now, but it requires a special build from us until we can get that merged into Delta's upstream. And finally, this is important for us here at Dremio, so you'll start to see this pop up in Dremio.
Future Work
Ryan Murray:
And where are we heading next? Well, we wanna add more stuff to Iceberg, there's a lot more features to put into Iceberg, including things like multi table transactions. We have a lot of work to do yet on Hive, and, we hope to get full Delta support, as [well] as many other integrations, and our our first focuses there are going to be in the query engines, so that Presto and Flink and everyone else can query Nessie tables. And finally, we want to look at table management services, and this is basically the care and feeding of your data lake. So compaction tools and ways to get your Nessie history, garbage collection, all kinds of stuff like that.
Thank You
Ryan Murray:
And so with that, I want to wrap up. Here's a bunch of links. We're going to post those links into Slack before we move into that channel. And, yeah, this is how to get a hold of Nessie, how to start interacting with the community. We're really keen on making this a real open source community driven project. This is something we all feel very strongly about, so we're actively searching for developers and committers who are interested in collaborating with us. So, with that, thanks everyone for coming, and hope you enjoyed it.
Live Q&A
Anita Pande:
Thank you, Ryan. Right on time. Very timely. Well done. So, actually, you mentioned a good point about Nessie and how independent it was, etcetera. So, let me just take a question that's related to that. We have a question from Arnaud: How tightly coupled to Dremio is Nessie, or is it totally independent?
Ryan Murray:
It's totally independent. I think, putting my Dremio hat on, it's important that Dremio has the best interface with Nessie. So, that's on Dremio to make sure we can deliver that. As far as Nessie actually goes, the project is a separate entity, and we want to treat it as such.
Anita Pande:
Thank you for that. And we have a question from Stefan as well here. [Stefan's question is:] Do we need a separate Nessie server in addition to a Spark cluster?
Ryan Murray:
Yes. So, it's a bit like your Hive metastore in the sense that you have to run a separate service. But, hopefully, it's a lot lighter and easier to run than a Hive server.
Anita Pande:
Got it. And does Nessie run on [Kubernetes]?
Ryan Murray:
Yes, you can get it as a Docker image. So, we don't have a Helm chart or anything like that, but it's relatively straightforward to set up.
Anita Pande:
And how many branches are possible, and how are merge conflicts resolved?
Ryan Murray:
Ah, merge conflicts—I was waiting for that one. Branches, you can have as many as you want. I think the only limitation is if you have branches, you're never going to be able to delete the underlying data. The branches are what hold the references to the underlying data, so you're really only limited by how much extra data you want to store. But you can keep your entire history if you want. In terms of merge conflicts, merges are handled very similarly to how Git would handle them. If Git comes across a merge conflict, it says, "Sorry, I can't do anything about that. Let me know when it's resolved." Nessie is going to do the same thing. I think the real key, especially with the table management services I mentioned, is doing what the IDEs did with Git: making a really clean, simple, contextual way to resolve conflicts. So that's table management services, and then there are some integrations that we're working on, really defining what a merge conflict resolution looks like.
Anita Pande:
Cool. Very cool. Another question here. So, this helps with being able to look back at version one, but if part of the data is deleted in the future, say [due to] GDPR or [something] similar, how does that impact version one?
Ryan Murray:
So, the idea would be that the GDPR-related delete would happen through Nessie. Nessie would then resolve that for you. [This might mean] moving the version one commit to point at version one post-delete or something like that. But in the end, you only delete data if you delete it through Nessie, and then Nessie keeps track of what data is there and what data isn't there.
Anita Pande:
Got it. Cool. Another question on merge conflicts, which is clearly a very popular topic here—just one second. Are merge conflicts only on the metadata level, or do they also occur on the data level?
Ryan Murray:
Right now, a merge conflict is basically any change to the table. So even if you were to have two appends to the same table, which aren't conflicting changes, Nessie would call that a commit. Over time, it will become more fine-grained. So we can say, automatically merge and append, and not worry too much about that. So it should become more fine-grained and more at the individual data row level as as Nessie matures.
Anita Pande:
Makes sense. And is it possible to do data versioning, the proof, on all steps of transformations?
Ryan Murray:
I'm not sure exactly what you mean by that. I think the answer is yes. It might be easier to discuss that a bit more in Slack because I just need a bit more context.
Anita Pande:
Yeah, and Arnaud, I will Slack you Ryan's dedicated speaker channel in just a minute here. Good question. Let's see. Oh, you've got a comment and a recommendation that's kind of fun, from Joshua. He'd like you to consider calling the future architecture "The Lockhouse." I don't know how you feel about that, given we're not big on proprietary names and locks on things.
Ryan Murray:
Well, over here, I actually have a Nessie stuffed animal. So maybe that will come up for the next talk.
Anita Pande:
Very cool. Question from Alex here: How does Nessie compare with Lake FS?
Ryan Murray:
I think they’re actually relatively similar in a lot of ways, though there are some finer differences. Mainly, the primary difference is the way we version the data. But at the end of the day, we’re both trying to version data. I’d be happy to take more questions on that in Slack as well, because that’s a bigger talk.
Anita Pande:
Very cool. How about streaming that is not batch like Git?
Ryan Murray:
Streaming––I think we're still developing exactly what streaming means. But if you think about [it], streaming already exists in Iceberg. You can stream a Flink table or an Iceberg table. So in that way, writing a table to the data lake with Iceberg or Flink is a commit. So you can do these streaming commits. It's not a problem at all to have a commit be a very small change, few rows coming in from a streaming event. I have a ton of thoughts about all the interesting things you can do around streaming, including maintaining your offsets in Nessie or something like that. But yeah, I'd I'd love to talk to you more about that question.
Anita Pande:
Very cool. I think we are at the end of our questions. But we do have––oh. Oh, no, we've got some more good questions here. Is the Delta Lake integration compatible with Databricks?
Ryan Murray:
Not yet. We've had a PR open on the Delta open source project for a couple months now, and we're trying to get that merged into the open source project, and I think that would pull it up into the Databricks project. But we need to get that one merged before we know for sure. But it's something we consider important right now.
Anita Pande:
Cool. Another question here is how much impact does Nessie have on performance measures, on DataOps? And then kind of an extension here is, how large can tables become, and how does parallelizing Spark commands work in conjunction with Nessie?
Ryan Murray:
So Nessie really only manages the metadata. So Nessie shouldn't have any impact at all on the performance of the data queries. So you can have any table size that Iceberg can handle, any amount of history that Iceberg can handle. At the end of the day, we're relying on Iceberg to tell us which data tables you need to read for a particular commit. So that's relatively scalable. In terms of commit performance, we've really specced out the goal for performance from Nessie, to be able to do a thousand commits per second. I think we're a little bit short. We're in the four or five hundred re-commits per second region. But that's kind of the performance we expect to have out of Nessie. And from that perspective, we can handle quite a bit of extra load from there. That's actually why we didn't use Git in the end. Git was five seconds per operation, and that's that's obviously way too slow.
Anita Pande:
Wow. That does give us a sense of proportionality there, wow. Another question here is: are all these snapshots on the main branch kept, so we can go back to any of them?
Ryan Murray:
That depends on your garbage collection setup. We just merged the initial garbage collection implementation into our master branch about a week ago. The idea is that eventually, we have to garbage collect, eventually, we have to remove commits. You can set it to keep everything for the past month, year, or any other period you choose. The garbage collection will only reap commits older than that. However, at some point, you do have to reap commits, so you can't keep an infinite amount of data. But, yes, it's configurable based on how much you want to store.
Anita Pande:
Interesting. Wow. Well, it looks like––let's give it a little bit here in case we have other questions. I see the Slack channel isn't working. Let me check on that. Oh, okay. Let's see. It looks like we have another question. I've seen the question from our know-above regarding DBT versus Nessie. I have the same question. Do you want to maybe elaborate on that again?
Ryan Murray:
Yeah. I'm guessing you mean––
Anita Pande:
Oh sorry, give me a second here. Let me elaborate, ah, it’s not dbt. I've just seen a demo of BDT with Dremio. How do you see the overlap of approaches?
Ryan Murray:
So I think we haven't we haven't reached out to the [BDT] community yet. I think that's something we wanna do relatively soon. I don't see it as a competition thing. I think Nessie could really fit in nice quite nicely to [BDT] or some other ETL type of data processing tool. I think there's some really interesting collaborations there. We just, we haven't reached out to the community yet. So I hope we'll do that soon.
Anita Pande:
Cool. And will it be more optimized if the main branch keeps Deltas between versions?
Ryan Murray:
I'd have to think about it. I don't think so. What's being stored in Nessie, thinking about it from the Iceberg perspective, we’re storing a pointer to a particular Iceberg snapshot. If you think about Iceberg, in Delta, it's to a particular delta log entry. So there's not really a Delta to store. We're basically saying, at time x, this is the Iceberg table that exists, or at time y, that's the Delta table that existed. The same [goes for] views or something like that. So storing—what's the difference between a Delta and a snapshot there? Not much. We're really just pointing to a path on your data lake. So, yeah, I think maybe I didn't understand that question. If you want to come on Slack, I'd be happy to go through that more.
Anita Pande:
Yeah. Sounds good. Hey, folks, that's all the time we have right now. I love how popular [and how] good the questions [have been] here.