Distributed Transactions on the Data Lake with Project Nessie

Session Abstract

While database concepts like transactions, commits and rollbacks are necessary for traditional data warehousing workloads, they’re not sufficient for modern data platforms and data-driven companies. Project Nessie is a new open source metastore that builds on table formats such as Apache Iceberg and Delta Lake to deliver multi-table, multi-engine transactions. In this talk we will discuss the transactional model of Nessie and how it can help improve the ETL workflow. We will introduce the recently released Nessie Airflow provider and its use in multi-stage and complex workflows as an example of the power of Nessie transactions. We will finish with a demo and a discussion on the production readiness of Project Nessie.

Video Transcript

Ryan Murray:    Welcome everyone. So first off thanks for coming. I hope it’s going to be a great conference. I think there’s already been some pretty cool announcement stuff, so hopefully we can keep up the pace. With that let’s jump right in. We have a ton to talk about today and I don’t want to waste too much time. So what I want to start out with today is what Nessie is and what motivated it. So some people may have already come across Nessie and some other talks or seen [00:00:30] stuff online. I just want to level set, make sure everyone’s really clear on what Nessie is. And then we’re going to get into a little bit more depth about what transactions are in Nessie and the two types of transactions in Nessie and how that can be used.

We’ll finish off with some examples of using Nessie from Airflow. And this is to use what we’re calling atomic pipeline transactions. We can see how those will work using it in a real Airflow use case. And then finally, we’ll talk about [00:01:00] the recent past and the future of Nessie along with plants to GA Nessie. So let’s dive right in. So when we started thinking about Nessie, we’re thinking about a lot of the problems… Not necessarily the problems, but a lot of things that were difficult in data lakes. So things that were hard for people to use or hard free things to work smoother. Just general, general challenges in the data lake world. And we [00:01:30] had experienced these problems ourselves. We talked to a lot of people about these problems. We found that there’s basically two clusters of problems. And then looking closer at those two clusters, they were actually the same problem just using different languages.

So when you look at it, there is a set of problems that had the database administrator language and the set of problems that was coming more from software engineers. So I’m going to keep on saying the same thing today from the DBA perspective [00:02:00] and from the software engineer perspective. And that’s to really highlight how this is trying to solve several problems at once for several people. So when we looked at the problems that DBAs were having with data lakes, it really came down to asset transactions. So you can do asset transactions on single tables. There’s some support for Hive transactions obviously. There’s Iceberg at Delta Lake and stuff, but it’s really hard to do large scale database wide transactions. [00:02:30] And then when you look here from the software engineer, the software engineer has all this fun stuff with DevOps that they want apply to data, do DataOps. And then there’s a CI/CD pipelines. And how can we start treating data, like we treat all of our other assets in software engineering? And as well as closely related as reproducibility. And that comes down to machine learning models and decision reproducers ability and stuff.

So when we started looking these two problems, we really saw that they’re actually kind of the same problem. [00:03:00] So first off from the database perspective, we started out in an untamed wild world. When [inaudible 00:03:10] first came around 15 years ago, we’d really didn’t know how to do anything that database has been doing for 40 years. So over the course of the past 15 years, we’ve added tables through technologies like Hive. And then we started adding more sophisticated SQL interfaces via Dremio and Trino and these sorts of things. And then most recently we’ve [00:03:30] added table formats. These table formats are things like Iceberg and Delta Lake. And these are really important abstractions that both provide us transactions and raise the level that data engineers are thinking about problems on the data lake. So for example previously, when doing things on the data lake, we had to think about where does this parquet file live? How big is it? Does it have the same schema as other parquet files? Is it in the right directory, so it’ll get [00:04:00] picked up by partitioning? All this kind of stuff.

And now when we’re talking about table formats, we’re actually starting to think at a level above that. So now we can think of when we’re adding data to the data lake, we can do an insert into an Iceberg table. And then we’re writing data to a table. And that means it’s a lot more intuitive about what’s going on and Iceberg and start taking care of things like optimal file sizes, making sure it has the same files schema, lives in the right petitions, all that kind of stuff. [00:04:30] It really sort of raises the level of which we can think about problems.

Dave:    Hey Ryan, real quick. I think your slides are not being shared. I think in the upper right hand corner, you may have a share button or something.

Ryan Murray:    I see the YouTube video still. I can see me screen-sharing though. Let me try again. Can you see that?

Dave:    Yeah, let [00:05:00] me see if I… I don’t know why it’s not showing up there. I don’t know if [Dean 00:05:10] can help. I’m sorry.

Ryan Murray:    Yeah, Dean. I think the YouTube video is still being shown.

Dave:    Yeah. Okay. Let me go check with him. I don’t know why it’s not doing that.

Dean:    I’m with you right now. Ryan you’ll have to hit share audio and video, and then within there you’ll see [00:05:30] a space to where you can share your desktop.

Dave:    He had tested it earlier, so I’m surprised it’s not working now.

Ryan Murray:    So everything is shared. And I can see the screen share on my desktop.

Dean:    Right. Wondering why this video has taken over. See if I can… There you go.

Ryan Murray:    [00:06:00] Yes.

Dean:    Gotcha. Okay. Sorry about that.

Dave:    Thank you.

Ryan Murray:    Okay. Sorry about that. So far you guys you haven’t missed much in terms of slides. So if everything’s okay, I think we’ll just continue from here. We were discussing how we can sort of leverage the idea of table formats to really make the thinking about data and data lakes a little [00:06:30] bit simpler through abstractions. So when you look at what’s still lacking then in the data lake, we can do transactions on single tables using Iceberg, but we can’t do multi table transactions. And that’s really what our database administrators looking for. So that gave us the idea of trying to add, go up another level of abstraction. So now that we have this really clear idea of what a table is in the data lake, let’s try and create a collection of tables. So now we’re [00:07:00] thinking about a metastore or a catalog or something like that. So that was the main insight we got from database administrators. They’re looking for effectively a metastore or a database in the same sense that they have a database on traditional databases.

So now what can we learn from software engineering? So as I said, some of the problems we were seeing all kind of relate back to [00:07:30] being able to version control stuff. So when we look at the rise of these technologies like analytics engineering, what usually comes down to is treating our analytics, our transforms and sorts of things, treating those as versionable objects. And being able to deploy those and do testing on them and stuff like that. So again, some sort of version control. And the way we’ll look at data quality observability, we can start to think of data quality as a unit testing and integration testing our data. And that’s [00:08:00] most easily done in sort of a CI/CD environment, which involves Git or at least version control.

And again, DataOps, DataOps is bringing all the lessons from DevOps into data. And one of the key components of DevOps as we know is Git. So all of these things require Git, reproducibility and being able to save a snapshot of your data, you can think of that as a tech. So again, version control. And then finally atomic pipeline transactions. And the idea behind this is being able [00:08:30] to extend a transaction past the database, be able to do much broader transactions. And as we’ll see this also invoke some sort of version control. So when we’re looking at the problems faced by software engineers and faced by database administrators, we see that database administrators want to catalog. And their software engineers want sort of Git capabilities. So can [00:09:00] we create a catalog that gives Git capabilities? And that’s effectively what Project Nessie is. So you can think of it as a catalog or a metastore with Git-like experience for the data lake.

I say, Git-like for a very clear reason, which we’ll get to in a few minutes. But what it means is we can bring concepts like branches and tags, commits and merges to the data lake. This metastore is also built to be cloud native. So it’s easy to scale out if it has very high performance and it can be deployed [00:09:30] in the variety of ways on the cloud. And the idea here is to really take advantage of the new technologies and innovations happening in the cloud space and really set this up to succeed in the future. And of course this isn’t going to be very useful if it doesn’t support a lot of common data tools. So currently it supports Iceberg tables, Delta Lake tables and SQL views. We’re not really dealing with tables that don’t have a table format on them, because we really want to encourage people to think about their [00:10:00] data like a database. Like a set of tables being managed by a catalog rather than try to manage things at the file level.

And aside from being able to support the table formats and views and materialized views and stuff, we want to integrate with all major data tools. So that’s things like Spark, Dremio, Trino. Automation tools like Airflow, data lineage, data catalogs: everything. The more things that Nessie is able to talk [00:10:30] to, the more capabilities we can start imagining in the same sort of organic way that DevOps grew all sorts of interesting applications: we think the same thing can happen using Nessie as the base. And of course this Nessie catalog and Nessie metastore brings along with it and multi table transactions of two varieties.

So I wanted to go back and discuss what we mean when we keep on saying, Git-like. And I think this is best [00:11:00] captured in the obligatory xkcd comic. We don’t really want to be Git. And the reason is captured here. Git can be exceptionally complicated and exceptionally hard to understand for a lot of use cases. And what we want to do is bring some of the spirit and some of the capabilities of Git to the data lake, without overwhelming people with having to cherry pick as something out of the ref log into a headless branch or something like that. [00:11:30] So let’s try and take what Git did right, and apply it to the data lake without bringing in a distributor graph theory model.

So with that, we can define a few core things, core terms inside of Project Nessie. And the first one is commit. And a commit we can think of as completely analogous to a transaction in a data base. This is something that changes, updates, modifies one or more tables reviews. So I [00:12:00] can be appending to an Iceberg table, updating a view. It can be changing several tables at once. Whatever, something like that. The next thing we want to take from Git is the idea of a branch. And a branch has effectively an ordered list of commits. Unlike Git, this list of commits is always linear. We don’t have any forks or merge commits or any branches in our history. And these things are basically just human-readable names that are connected to a particular hash [00:12:30] in the history of a Nessie database.

Now, one thing important to mention here is by default, everything happens on the main branch. And the idea here is that Nessie should be… You should only need to use the parts of Nessie that you actually need. So for example, if you’re an analyst, you don’t really care that your databases is a Nessie enabled database. You don’t want to have to think about branches and commits and stuff. You just want to use the database like a database. So that’s why all by default, everything happens [00:13:00] on the main branch. So an analyst and end user consumer, someone from Tableau doesn’t really need to think about all the complexity that Git might bring along. Similarly, if you don’t need branches, if you don’t need these atomic pipeline transactions, you don’t need to use branches and everything can happen on the main branch. And a commit on the main branch is still a transaction in a traditional database sets.

So with that a tag is basically just an immutable branch. Where you can [00:13:30] add commits to a branch, you can’t to a tag. And this is really useful for reproducibility. So we can pin the state of the world at a particular time when an ML model was trained or when our end of year data was analyzed, something like that. And the last thing I wanted to take out of Git was the idea of a merge. So this isn’t that Git merge with… As I said, there’s no merge commits and there’s no forking or anything. This is effectively taking everything from a branch and putting it onto another branch. [00:14:00] So we’re able to make any changes on an ETL branch, visible to consumers on main. And that’s done atomically as if it was in a single transaction.

So that introduces a couple of the potential transaction types that exist in Nessie. So the first one is what we call a single system transaction. And you can think of this as a pure vanilla database transaction. And we can even use the same semantics for it. So this can involve things like isolation levels, [00:14:30] and you can do a lot of different… All the kinds of the same things you would do in a single system transaction on a normal database, clean and committing and all that rolling backend stuff.

I think the more interesting transaction that I wanted to talk about today is what we’re calling atomic pipeline transactions or multi-system transactions. And the idea here is that we use the facilities of branches to create what looks to the user on the main branch looks like a single transaction. [00:15:00] So here we create a branch, we call it ETL. And then we’re able to have many different applications, several Spark jobs. They don’t even have to be in the same data center, they can be whatever all contributing to this transaction. In this example, we have two Spark jobs and two Dremio jobs happening on this branch. And these can happen in parallel or not, and they don’t need to share any context about what’s happening on this. All it is, is performing transactions on a branch. And then once [00:15:30] all these transactions are done, we can do a merge. So it has the same semantics as a database transaction, but it has a much different effect and that many people can contribute to this transaction.

So first let’s take a little bit of a deeper look into a traditional transactions. So here we can see, this has a very familiar semantics and it works just like we would expected a database transaction to work. We’re able to introduce the concept of isolation levels. Currently we’re working on recommitted, repeated readings, [00:16:00] serializable. And what this means is we can really treat the data lake like a database at this point. We can do commits and we can do transactions. We can commit them and roll them back. And what that will look like as a single commit Nessie. It’s just that there be multiple tables reviews involved in the same commit.

Currently, this stuff is being worked on. So there’s a pull request in Iceberg to be able to support this, and we’re actively working on it in Dremio. But I expect to start seeing [00:16:30] this stuff very, very soon in productionized systems. The other type of transaction is the pipeline transaction, and this is ready to be used today. And it’s in fact being used today. And the idea behind this is to act just like a feature branch would act in software engineering. So we create a branch, we perform commits on those branches, and then we can do a code view and we can do [00:17:00] integration testing and stuff. And once everyone’s happy we can merge it back. So that’s what’s happening here. And step one here, we have commit on the main branch and then the data engineer comes along and creates a fork… a branch, excuse me. And then on that branch, they perform a series of commits. While the data engineers are working on the ETL branch, anyone who’s clearing the main branch will see commit one. And they don’t know that all this stuff is happening behind the scenes.

So then at [00:17:30] the end of the commit set, that’s in step two, the data engineers can take a look at the data manually. They can review it to make sure everything worked. They can perform data quality analysis stuff on it that can be automated or not. And at the end of that, we can say, “Fine, it’s ready to go. And it’s merged into main.” So that on main, you’ll see all of these commands, but they’ve all appeared atomically. So that the next time a user comes along and starts clearing from the main branch, they’ll see all the new data that’s been added. And they’ll see it as if [00:18:00] it appeared in a single atomic transaction. So it’s not a transaction in the traditional sense, but to the user on main, it behaves very much like a transaction. And it’s really, really powerful because these commits, like I said, they don’t need to be done inside the same context. Anyone can contribute to what’s happening on the ETL branch. As long as it’s treated like a branch would be in software, where it’s code viewed and data quality and all that kind of stuff.

So with that, we wanted to demonstrate [00:18:30] the power of these atomic pipeline transactions. So we went ahead and we built a provider for Airflow. For those of you who aren’t familiar with Airflow, it’s sort of an orchestration tool used by data engineers. And it’s very, very useful in doing large and complex ETL pipelines. So the idea behind Airflow’s, you defined a DAG in Airflow. A DAG, a directed acyclic graph, and this graphs nodes are going to be operators. And each operator’s going to perform [00:19:00] some function on the data, whether that’s a transform or an add or delete or whatever.

So what we did was we built this Nessie provider, provides operators and a connection for an Nessie and it allows us to turn an Airflow DAG into an atomic pipeline transaction. So that a run of a DAG is going to look like a single transaction in Nessie. And this allows us to really start thinking about doing continuous integration on data. So to give us a bit more [00:19:30] concrete example, here is the Airflow UI showing a particular DAG. And this DAG is an example DAG from the repository on the previous slide. And in this DAG, we’ll walk through it. First, we create the ETL branch, and this is going to tell Nessie to create a branch. And then we’re going to perform a series of transactions on this branch. And these can be done from Spark jobs, they can be done directly from the Python DAG, remote, whatever. [00:20:00] Doesn’t really matter where these are happening. The important thing is they’re all happening on the ETL branch that we created in step one.

So all these things are going to happen. And once the materialization is finished building, we’re going to run a data quality check. And as data quality check is this could be as simple as count the number of rows that were added. Or it could be an AI model that’s investigating the statistical properties of the data that was added. Or it could be a whole [00:20:30] series of checks, that’s checking all manner of things. And at the end of the day, if these checks pass, we can go ahead and merge into main and delete the branch. So in that way, this looks very much like a continuous deployment, continuous integration pipeline. Where the key element here is being able to do these data quality checks as if they are unit tests or integration tests, and then the passing or failing of that determines whether or not we merge the branch.

And then just to illustrate how this looked I’ve used the Spark SQL extensions [00:21:00] to show the log of this particular pipeline run. So we can see the commit message for each of these operations. And you can even see how they’re not necessarily in order, depending on when Airflow scheduled them. And as well as that, we can get a pile of metadata about the Airflow actions. So we can see the run ID and the task ID for each one of these commits. Which means that if something goes wrong, we know the hash of each particular task ID and we can actually go back and revert [00:21:30] those commits, and then rerun parts of the pipeline. So integrating this commit information into the Airflow pipeline is really, really powerful to be able to find problems with your… If there’s problems with the Airflow pipeline, you can find them. You can then revert, re-run very simply from inside of the Nessie context.

As well as this information is exceptionally useful when coupled with data lineage and data quality checks. There’s all kinds of possibilities [00:22:00] to start imagining how Nessie and data lineage and data quality might work together. So this is sort of a beta release of this pipeline. We’d love to get some feedback on how it’s working for you and what features we can add to it, to make it more useful. At the very least, we want to make sure this has better integrated into Airflow’s retry mechanism. And we want to do this with a lot of other orchestration tools. And finally, it’s really exciting to start thinking about, as I said, all the things we can do with lineage [00:22:30] and data quality. So that’s something we want to do next as well.

And with that, I want to talk about the GA of Nessie. So we want to GA Nessie in autumn. This would mean we have API and storage compatibility guarantees first-class support for our table formats or database backends and our authentication and authorization flow. This means that this can be used in production as of September. And then future work we have a lot of stuff going on with a lot [00:23:00] of tools to integrate and a lot of people to work with to really see what kind of interesting applications we can build with Nessie. And with that, thank you very much for listening. Here’s some links if you want to contact me on Twitter, email and check out our project. We’re looking for both partners to work with the software and developers to help us build. Thanks very much.

Dave:    Great. Thanks Ryan. Okay, let’s go ahead and open it up for Q and A. Folks if you have any questions, use the button in the upper right side to [00:23:30] share your audio and video. You’ll automatically be put into a queue. And if for some reason you have trouble with that, you can ask your questions in chat: I’ll try to get it there. Have a look and see what we got here. Let’s see here. All right. Well, let’s see. Good stuff. Good stuff. Good stuff. And let’s see if I can actually see any of the questions [00:24:00] in the queue. I don’t see that actually. I apologize if you guys are asking questions in there, I don’t see that for some reasons. Okay.

Ryan Murray:    It’s either a really good sign or a really bad sign there’s no questions.

Dave:    Yeah. [00:24:30] There’s one question is there, I’m just looking for… Okay, well here I’ll just look in the Q and A section. Okay. It says, “Will Nessie be available in the community version?”

Ryan Murray:    The community version of Dremio? I’m not actually sure. I think I’d have to check with product. If you ping me the question in Slack, I’ll get someone to answer that for you.

Dave:    And there’s another question from [Ben 00:24:55], “Can you talk about merge conflicts and how they are resolved?”

Ryan Murray:    Of course, [00:25:00] that’s always my favorite question. So merge conflicts right now, we’re at the table level. So if you try and change a table in two different branches, you’re going to get a conflict. Currently, Nessie behaves very much like Git, in that it just says, “You have a conflict, sort it out.” Which means in this case you’re going to have to go down to the underlying table library and sort it out. So if you’ve upended data in two branches to an Iceberg table, the fix is relatively easy and you can use Iceberg to [00:25:30] resolve the conflict. If you’ve changed the schema in two different branches or the partitioning schema in two different branches, and you’re going to have a little bit more problem working that out with or without Iceberg tools. So it’s something that similar to GIt. It’s really the responsibility of the underlying tools to try and sort out those conflicts. But I think it’s our responsibility [00:26:00] as Nessie developers to provide the tools and the documentation for that. And I think that’s something that’ll be part of GA or soon after.

Dave:    Okay, now these questions are coming in fast and we only have about three minutes. So let’s see if we can get them all: “How would you compare Nessie to lakeFS and other differences?”

Ryan Murray:    I think the main difference between Nessie and lakeFS is at the level we’ve put our abstractions. So lakeFS is looking at things at a very granular [00:26:30] file level. And we’re looking at things from the high level sort of this catalog level. So we’re at opposite ends of the spectrum in terms of abstractions, but we do a lot of the same things. And I think what it really comes down to is if your specific applications, I think lakeFS will be easier for some use cases and Nessie for others. Obviously I prefer the catalog level abstraction, but it comes down to personal taste I think.

Dave:    I [00:27:00] think this is a related question, “What is the granularity of versioning on the datasets controlled end version and by Nessie?”

Ryan Murray:    So it’s at the Iceberg manifest level or metadata level I suppose or the Delta commit level. So if you append a pile of data to an Iceberg table, all you’re really going to version is the fact that Iceberg table A move from snapshot B to snapshot [00:27:30] C. You won’t know necessarily which files were added, or which were deleted, or which rows changed or anything like that. Of course you can back that out from the Iceberg metadata, but in reality, Nessie is only tracking the fact that Iceberg has a new snapshot.

Dave:    Okay. Rapid fire, “Is it on the roadmap to integrate with DBT?”

Ryan Murray:    DBT, I really, really interested in integrating with DBT and I would love to have someone from DBT reach out for me if [00:28:00] they have any ideas and want to collaborate. I think there’s a lot of opportunities there.

Dave:    Yeah. I think we’ll figure that out. Let’s see, “Is Nessie implement Spark catalog API? Is it high of metastore compatible?”

Ryan Murray:    It’s not high metastore compatible. I’d love to see if anyone has any opinions on that. It’s been a really tricky one for us. Ping me on the Slack channel. And for the Spark catalog, it basically uses Icebergs or Deltas libraries and both of those implemented catalog. So yes.

Dave:    Okay. I don’t know how [00:28:30] much time left. We have seconds. Can you diff two branches?

Ryan Murray:    Yes.

Dave:    Okay. All right [inaudible 00:28:37]. Okay, so folks, that’s all the time we have questions for. If we didn’t get your question, you’ll have an opportunity to ask Ryan in his Slack channel. So if you go to the SubSurface Slack workspace. But before you leave, we’d appreciate if you would fill out the super short Slido session survey in the upper hand side before you leave. And the next sections are coming up in just five [00:29:00] minutes, in the meantime you can go check out the Expo Hall. Thanks to everyone and really hope you enjoy the rest of your show. And nice job Ryan. Thank you.

Ryan Murray:    Thanks a lot for coming everyone.

Dave:    Okay. I think that’s it. All right Ryan, I think we’re gone and we’re done. We had at one point [00:29:30] over 200 people live.

Ryan Murray:    Oh, that’s cool.

Dave:    Yeah-

Ryan Murray:    You know how that compares to the other?

Dave:    Well, you’re the first round, so… Oh, I didn’t look at the other sessions. So I don’t know. But 200 is a good number. I don’t know how it compares, but it feels good.

Ryan Murray:    I’m happy with that. I’m always going to think I can do better, but I’m certainly happy with 200.

Dave:    Yeah. Oh, and [Lawrence 00:29:56] says, “We can still hear you.”

Ryan Murray:    All right [00:30:00] then.

Dave:    I know they’ll cut this part off for the video, but hi everyone.

Ryan Murray:    Thanks [Andrea 00:30:08]. All right I’m going to stop embarrassing myself and get off this before you say something you’ll regret.

Dave:    Yeah me too. All right.

Ryan Murray:    Thanks a lot for intro [Dave 00:30:19].

Dave:    All right, have a good night.