Dremio Jekyll


Subsurface LIVE Winter 2021

A Git-Like Experience for Data Lakes

Session Abstract

While traditional data warehouse concepts like transactions, commits and rollbacks are necessary in a SQL context, they're not sufficient for modern data platforms, which involve many users and data flows. A new open source project, Project Nessie, decouples transactions and makes distributed transactions a reality by building on the capabilities of table formats like Apache Iceberg and Delta Lake to provide Git-like semantics for data lakes. By using versioning concepts, users can work in an entirely new way, experimenting or preparing data without impacting the live view of the data, opening a whole world of possibilities for true DataOps on the data lake. At the same time, they can leverage classic data warehousing transactional concepts in their data lake including three levels of isolation across all tables and the classic START…COMMIT primitives.

Presented By

Ryan Murray, Open Source Engineer, Dremio, Project Nessie & Apache Arrow Committer

Ryan Murray is an open source engineer at Dremio in the office of the CTO. He previously served in the financial services industry doing everything from bond trader to data engineering lead. Ryan holds a PhD in theoretical physics and is an active open source contributor who dislikes it when data isn't accessible in an organisation. He is passionate about making customers successful and self-sufficient, and still one day dreams of winning the Stanley Cup.ryan_murray


Webinar Transcript

Anita Pandey:

Hello everyone. My name is Anita Pandey, and I’ll be your moderator for this session. It’s on a Git-like experience for data lakes. And I’d like to welcome Ryan Murray, who’s our open source engineer at Dremio and a key committer on project Nessie and Apache Arrow. But just before we get started, just a couple of housekeeping [00:00:30] items I’d like to review. For live Q&A, please note that we’ll do that 10 minutes before the end of the session and you can activate your mic and your camera for the Q&A with the button on the right top corner of the screen and I will put you in the question queue.

For those of you who prefer to submit your Q&A via chat, you can do that throughout the session and we will line those up also towards the end of the [00:01:00] call. And finally, Ryan will also have a dedicated speaker Slack channel that I will post on the Q$A for those of you who’d like to follow up and engage with him after his presentation. And last but not least, we have an exciting expo hall. So I urge you to go and get demos and deep dives with those folks and also experienced Dremio in a virtual sandbox that we have for a limited time, [00:01:30] and also participate in our fun giveaways. With that, Ryan take it away.

Ryan Murray:

Thanks very much, Anita. And thanks a lot everyone for coming. I’m really excited to be here at Subsurface and to finally talk about Nessie. So to get going, I wanted to set the stage by taking a little bit of a walk back in time I suppose, to talk about the time before Git when you were developing software. So if you were around then, if you had to use Subversion and CVS [00:02:00] and these kinds of tools, you know that it was not always easy to use version-control. It was expensive to create branches, it was very expensive to merge, you never really were kind of certain that you were able to have all the changes that you thought were on your main branch were actually there. And then Git came along and that really changed everything for a lot of people.

It suddenly made everything easy. Branches were free, it was easy to get code review and subsequently a lot of people, all the developer productivity, [00:02:30] code, quality, everything improved. And then that revolution kind of went everywhere. We started seeing our configuration being stored in Git. And eventually we started deploying things from Git. So our entire application stack was stored in Git.

Recently, we’ve seen infrastructure as code where we have the definition of our network and our Linux servers and our databases and everything is stored in Git. So Git really has [00:03:00] taken the development world by storm. But there’s one place it actually hasn’t gone and that’s data. And I’m hoping that by the end of the talk you’ll believe that project Nessie is actually going to be something that we can say has taken a Git-like experience to data.

Just to go through a quick outline, what I’m going to do with you guys today is get our basis with the data platform. What do I mean by the data platform? What are they, what are the highs and lows with these [00:03:30] things and what is the current state and current challenges with a data platform. And then I’ll go ahead and introduce Nessie, this is going to be my favorite part of the talk. I think in the almost 20 years I’ve been doing this, this is one of the coolest ideas I’ve ever worked with.

And then we’ll take a look at some use cases. Those that will look surprisingly like our challenges. And then we’ll wrap up with the current state and where we think Nessie’s going.

So let’s get started. Let’s talk about data platforms. [00:04:00] Now, when I say data platform, I mean the storage and analytics layer for our data. And why do we have data? It’s to help us make better decisions. So back in the day, that was just a single database, small-ish data sets and a handful of analysts. And that’s really evolved over the past 40,50 years. We had the data warehouses in the late 90s and that really transformed how we deal with data. And that really gave us a really strong platform for how to deal with data and how to analyze [00:04:30] data.

And in the early 2000s, we started to see the rise of the data lake, that was with Hadoop and S3 and these sorts of technologies. Where we can start dumping data into the data lake and then analyzing it later. And that was a very different approach than the data warehouse. But we managed to find a lot of different architectures through the 2010s, of combining these two with their strengths. So here lake is where the data lands, the warehouse is where [00:05:00] the analysis happens.

And in the past two years, we’ve kind of seen these two ideas start to merge together. People are talking about stuff like the lake house, where these, the data lake and the data warehouse, there’s not as much differentiation between these two things. And I think that this slow convergence is driven by a number of things. Obviously the exponential increase in data sources has a lot to do with it, as well as a drive to reduce complexity and cost. Cloud-based, [00:05:30] the cloud is a big driver for that as well. I think clouds are sort of the great unifier. What’s the difference between a warehouse and a lake when you’re staring at them on the Amazon console? Not much.

Finally, and an important one for me is this drive towards our open formats and open technologies where we’re starting to take the best tools for the job rather than having a one size fits all tool that never really takes care of everything we want it to. That’s kind of the journey we’ve been on the past 30,40 years. [00:06:00] But you’ll notice there’s two technologies in here that have been co-evolving. So if we were to take a look at this timeline again, but we just sort of rotate it, what would we see?

We’d see again, these two parallel technologies, but they’ve actually been coming closer together as time progresses. And they’ve been learning from each other. So we see a lot of tools that existed in the data warehouse from the 80s have been moved over to the data lakes. We’re seeing SQL interfaces and transactions and stuff. And then the [00:06:30] data warehouse was took from the data lake. We moved to the cloud, we started being able to import and export from Blob storage, this kind of stuff.

We have really gotten to the stage where, as I said before, there’s really not a whole lot difference between these two technologies. And I think over the next few years that we’ll actually see those differences completely disappear. And anyone who guesses what I think the technology that’s going to unify these two concepts, gets a gold star. So what are some of the challenges [00:07:00] we see in our modern data lake today? I think the first one and probably the most important one is consistency. If we have analysts looking at a dashboard or reading off of a complicated data set, lots of views, lots of tables, all kinds of stuff like that, how can we make changes to that data set? Maybe it’s our end of day batch jobs where all of our new data comes in. How can we present that to the user in a unified way so that the user never sees an inconsistent set of data?

Typically, the way we do that [00:07:30] is, there’s a lot of different ways you can do that. You can do run [ideas 00:07:33], you can do stuff like that. And one of the easiest ways to show how it’s done is talk about a pointer view where we select from yesterday’s data, maybe that’s with a [ware 00:07:41] Clause, maybe that’s with a different table name, and then at some point when we’re ready, we move that point to table over to point today’s data. And that’s a really straightforward way, it’s a really simple way. But it like a lot of its other cousins for managing these problems, can be relatively fragile.

So if we were to look at this from a developer’s point of view, how would a developer do [00:08:00] this? Using Git. Well, they would probably create a branch. They would do all of their changes on that branch, and then they would merge it back. They create a feature, they’d test a feature, and then they merge the feature back. And that would happen atomically. That’s kind of where we want to get to with this consistency problem. Closely related to consistency is the idea of verification. How do we know when to move the pointer? How do we know when today’s data is actually ready? And if today’s data wasn’t ready, and we moved the pointer too early, how do we [00:08:30] move it back safely? And then fix today’s data.

Right now, that’s really challenging. We tend to do this in ad hoc basis. Maybe we have a set of tools to run on today’s data before merging it. Maybe we just merge it, move it and wait and see if our users complain. And if we do have to revert back, how do I do that? Well, it’s usually easiest to throw today’s data and completely run all the ETL jobs. It’s too, to pick apart these complex connections. And again, looking at this from a [00:09:00] developer’s point of view, what’s a developer going to do? Well they’re going to verify it by raising a pull request. And that pull request is going to be a chance for people to review the code, comment on the code, maybe run some automated tasks, that kind of stuff, and then they can make any changes to it before merging. And if they find a problem after merging, what do they do? Well, they revert the change and it’s that easy.

So the final challenge is reproducibility. And this is a pretty big one. This has developed an entire ecosystem of products and tools and open [00:09:30] source projects on its own. This is the idea of how do I reproduce a decision I’ve made in the past? Whether that’s a machine learning model and we have the concept of ML ops to manage the life cycle of these machine learning models, or it could just be a simpler analytics, simpler analysis. How do I go back and say, why did I make this decision? What data did I use to make this decision and was this a good decision? [00:10:00] And that’s currently hard to do. We need to find a way to freeze all of our data. Typically, we just can’t do that. There’s other version data, there’s copies of data whose life cycle we have to manage. And then the queries and the analysis and the model are all stored in different places if they’re stored at all.

And again, what would a developer do who’s using Git? Well, they’d probably just create a tag. You think of a machine learning model or a decision as a release, when you do a release in Git, you usually create a tag to go along with it. And then [00:10:30] when you want to look at the release you go to V 0.1 and that’s the state of the world when you made that release.

So those are our challenges. Let’s stop and just double check what we’ve done so far. So we’ve talked about Git and all the ways they Git has affected our daily lives as developers. And we’ve discussed our modern data lake, modern data architecture and all the different challenges and concepts that go along with it. So what happens when we want to put [00:11:00] those two together and that’s where we can finally start to talk about Nessie. so with that, after over years of development from a number of really smart people, I’m excited to introduce to you Project Nessie.

So what is Project Nessie? Well, it’s a Git-like experience for your data lake. And when I say Git-like, it’s because it tries to take as much inspiration as it can from Git, but it actually isn’t Git. The reason behind that are reasonably straightforward, [00:11:30] but a bit technical for right now. So I’d be happy to take questions or discuss that later in Slack. But it is Git-like, so there’s the concept of branches and tags and commits and merges and all these other concepts that we’re familiar with from Git. And at the core of that like in Git, is the idea of a commit. And what is a commit in Nessie? It’s simply a transaction.

And Nessie doesn’t care if that transaction involves a single table or if it involves multiple tables. And it can even involve [00:12:00] many tables and the associated views and materialized views. From Nessie’s perspective, a transaction is a commit. It doesn’t matter what is in that transaction.

So how does it do that? Well, it does that by maintaining a key-value store of all of the information that we have about the data lake. So it behaves a lot like Git in that perspective, which is basically a key-value store on disk. But instead of being on disk, Nessie’s stored in a cloud [00:12:30] KV store, something like DynamoDB.

And the reason behind that is we wanted to make it cloud native. We want it to make it able to fit into our modern ecosystem. So this KV store is backed up by the Nessie server. And the Nessie server is really just a simple REST API that talks in JSON. So we can put that in a AWS Lambda or some Docker containers and Kubernetes or something like that. And then super easy to scale that in and out horizontally. [00:13:00] And since it’s backed by a cloud KV store, where you really don’t even have to think too much about how that scales.

And then next, Nessie supports all of our common data tools. Right now it supports Hive and Spark with Iceberg and Delta Lake. But we hope that in the future, Nessie will be supported by a whole range of data tools. And the idea is you just get more value with the more tools that use Nessie. It’s just a good scale thing.

And finally, I [00:13:30] think what’s cool about Nessie is there’s no data copying. So we’re doing version control on the data lake, but we’re not actually copying that much data or any data. Everything is stored in metadata on the Nessie server. So we really avoid the situation where if I want to do a Git commit of a table, I don’t have to copy the whole table. The table metadata is tracked by Nessie to represent the commit.

So now let’s go back to the challenges we talked about earlier, [00:14:00] and let’s take a look at how that would work in Nessie. So here we have a commit on main that’s denoted by number one there that that dot is the commit, and that represents the head of the branch main, or you can think of it as the entire state of our data lake as it stands. Where what all the data views, everything else is on our data lake right now.

So now we kick off our end of day, ETL job. This is this complicated ETL job with a hierarchy of things that have to happen with enrichment [00:14:30] and drive data and creating new different views and that kind of stuff. So what happens, this ETL process starts by creating an ETL branch and like Git that’s a free operation that happens instantaneously. And then on that ETL branch, all of our ETL jobs run. You can have a Spark job in that region, another Spark job in that region, doesn’t really matter. Everyone is just committing onto the ETL branch. And they’re able to do that, whether it’s a single table transaction, [00:15:00] multi-table transaction, doesn’t matter.

So at the end of this ETL job, the third blue commit there represents the end of our ETL job. The ETL branch looks like we want our main branch to look. That’s all of our data having been put onto our ETL branch. So what do we do? We simply merge it back into the main branch and we’re done. The important thing to remember here is these merges are atomic operations. So from the [00:15:30] perspective of the user, had 0.1 up here, they were querying that the whole time, they never knew that the ETL branch existed and they never knew that stuff was happening on it. And at some point when that merge happens, all of the data from that ETL job is atomically promoted into the main branch. And the next user to come along is going to see the complete ETL job run with no possibilities of [00:16:00] problems with consistency.

And you can sort of think of it this as our workflow that we discussed earlier, where a developer comes along, creates a feature branch, develops the feature and commits it back into main when they’re ready. Now, if we look at the verification stage, and it looks like my slides are screwed up, so verification happens very similar. In this case, instead of merging straight back, we’re actually going to do some [00:16:30] tests. We’re going to do basically code review on our data. And these can be automated tests, they can be something very similar to the code review that happens on a pull request, whatever.

And we can imagine that happening on a staging branch, we could imagine it happening on the ETL branch, but it’s a very similar process to what a pull request should look like. And then the data change eventually gets merged into main. Once we’re satisfied that all the data is there, that we want it to be there. And what happens if it’s not there? Well, it’s effectively Git so we can revert some changes, we can add more changes [00:17:00] onto the ETL branch, and then once we’re verified, we can merge back. There’s a lot of flexibility in how we can verify and then fix once things have proven to not be quite alright.

So finally, what happens with the reproducibility problem and I gave you a big hint earlier, we use tags. So TrainingRunA here is a tag of main when the machine learning model was promoted [00:17:30] to production. So we promote our machine learning model to production, we create a tag on Nessie. We do an analysis and before we present that analysis, we create an associated tag. So then we can always return to that tag, effectively check out at that tag. And we have the entire state of the world, everything in Nessie as it stood when we made that decision.

So then if we come along and create some more branches or we create some more commands, we [00:18:00] mutate the data lake whatever else, we can always return to TrainingRunA or we can create TrainingRunB, which is the machine learning specialist comes in and creates a new model, modifies the model, retrains the data, and then deploys it to production. And that production deploy gets a tag as well. So now we can always go back to either verify what happened in that run and how he decided to make the decisions we made. And we can even do stuff like starting compare TrainingRunA to TrainingRunB or [00:18:30] apply the TrainingRunB’s model on the TrainingRunA and see how these models have been progressing in time.

So that’s some of the use cases, this is some of the powerful things that we can do at Nessie. So where is Nessie at today? As I said, we can deploy Nessie using a Docker image or an AWS Lambda. We have a variety of backend storage layers right now. That includes AWS DynamoDB, which we feel is the strongest database for us right now. We also have one from [00:19:00] MongoDB, and we’ll soon have one for a Generic JDBC Database. Along with that, we have clients for Python, Java, and a CLI client. As long as they have a relatively simplistic UI. Just don’t criticize my UI skills, that’s not my forte.

But we do have a nice Nessie client for Java and Python, as I said, and we also have a CLI. So the CLI looks almost identical to Git. [00:19:30] Along with that, we have full support in Iceberg as of the Iceberg release on Tuesday. We have some stuff with Delta Lake, so we can use Delta Lake right now, but it requires a special build from us until we can get that merged into Delta’s upstream. And finally, this is important for us here at Dremio, so you’ll start to see this popup in Dremio.

Where are we heading next? Well we want to add more stuff to Iceberg, there’s a lot more features to put into Iceberg. Including things like multi-table transactions and stuff. [00:20:00] We have a lot of work to do yet on Hive and we hope to get full Delta support. As long as many other integrations. And our first focus is they’re going to be in the query engines. So that Presto and Flink and everyone else can query a Nessie table.

And finally, we want to look at table management services. And this is basically the care and feeding of your data lake. So compaction tools and ways to check out your Nessie history, obviously garbage collection, all kinds of stuff like that. [00:20:30] And so with that, I want to wrap up. Here’s a bunch of links. We’re going to post those links into Slack before we move into that channel. And yeah, this is how to get a hold of Nessie, how to start interacting with the community.

We’re really keen on making this a real open source community driven project. This is something we all feel very strongly about. So we’re actively searching for developers and committers who are interested in collaborating with us. So with that, thanks everyone for coming and hope you enjoyed it.

Anita Pandey:

[00:21:00] Thank you, Ryan. Right on time. Very timely. Well done. So actually you mentioned a good point on Nessie and the level, how independent it was, et cetera. So let me just maybe take a question that’s related to that. We have a question from Arnaud, how tightly coupled to Dremio is Nessie, or is it totally independent?

Ryan Murray:

It’s totally independent. I think from putting my Dremio hat on, it’s important that Dremio has [00:21:30] the best interface with Nessie. So that’s on Dremio to make sure we can deliver that. As far as Nessie actually goes, Nessie the project is a separate entity, and we want to treat it as such.

Anita Pandey:

Thank you for that. A question from Stefan here is, do we need a separate Nessie server in addition to a Spark cluster?

Ryan Murray:

Yes. It’s a bit like your Hive metastore in a sense that you have to run a separate service. But hopefully it’s a lot [00:22:00] lighter and easier to run than a Hive server.

Anita Pandey:

Got it. And does Nessie run on Kube?

Ryan Murray:

Yes. You can get it as a Docker image. So we don’t have a helm chart or anything like that, but it’s relatively straightforward to set up.

Anita Pandey:

And how many branches are possible and how are merge conflicts resolved?

Ryan Murray:

Ah, merge conflicts. I was waiting for that one. Branches, you can have as many as you want. I think the only limitation is if you have branches, you’re never going to be able [00:22:30] to delete the underlying data. So the branches are what holds the references to the underlying data. So you’re really only limited in how much extra data you want to store, but you can keep your entire history if you want.

In terms of merge conflicts, merges are handled very similar to how Git would handle them. If Git comes across a merge conflict, it says, “Sorry, I can’t do anything about that. Let me know when it’s resolved.” And Nessie’s going to do the same thing. I think the real key, especially with the table management services I mentioned, the real key there is [00:23:00] doing what the IDEs did with Git. And that’s making a really clean, simple, contextual way to resolve conflicts. So that’s table management services. And then that’s some of the integrations that we’re working with really defining what a merge conflict resolution looks like.

Anita Pandey:

Cool. Very cool. Another question here. So this helps with being able to look back at version one, but if part of the data is deleted in the future, say GDPR or some [00:23:30] similar context, how does that impact that version one?

Ryan Murray:

So the idea would be that the GDPR related delete would happen through Nessie. So then Nessie would resolve that for you. Maybe that means moving the version one commit to point that version one post delete or something like that. But in the end, you only delete data if you delete it through Nessie and then Nessie keeps track of what data is there and what date isn’t there.

Anita Pandey:

Got it. Cool. [00:24:00] Another question on merge conflict, clearly a very popular topic here is just one second here, are merge conflicts only on the metadata or on the data level as well?

Ryan Murray:

Right now a merge conflict is basically any change to the table. So even if you were to have two appends to the same table, which aren’t conflicting changes, Nessie would call that a commit. Over time it will become more fine-grained. So we can [00:24:30] say automatically merge and append and not worry too much about that. So it should become more fine-grained and more at the individual data raw level as Nessie matures.

Anita Pandey:

Mm-hmm (affirmative). Makes sense. And is it possible to do data versioning, the proof on all steps of transformations?

Ryan Murray:

I’m not sure exactly what you mean by that. I think the answer is yes. Maybe it’d be easier to discuss that a bit more in Slack because I just need a bit more context.

Anita Pandey:

[00:25:00] Yeah. Arnaud, I will Slack you Ryan’s dedicated speaker channel. Just a minute here. Good question. Let’s see. Oh, you’ve got a comment and a recommendation. That’s kind of fun. So from Laurent, he’d like you to find to actually… Is it Laurent? Oh, Joshua my bad. So Joshua recommends that you call the future architecture, the Lochhouse. [00:25:30] I don’t know how you feel about that given we’re not big on proprietary and locks on things.

Ryan Murray:

Well over here I actually have a Nessie stuffed animal, so maybe that’ll come up for the next talk.

Anita Pandey:

Very cool. Question from Alex here, how does Nessie compare with lakeFS?

Ryan Murray:

I think they’re actually relatively similar in a lot of ways and [00:26:00] there are some finer differences. The primary difference is the way that we version the data. But at the end of the day, as it were, we’re both trying to version data. I’d be happy to take more questions on that in Slack as well, because that’s a bigger talk.

Anita Pandey:

Very cool. How about streaming? That is not batch like Git.

Ryan Murray:

Streaming, I think we’re still developing exactly what streaming means, but if [00:26:30] you think about, streaming already exists in Iceberg, so you can stream a Flink table or an Iceberg table. So in that way, writing a table to the data lake with Iceberg or Flink is a commit. So you can do the streaming commits. It’s not a problem at all to have a commit be a very small change, few rows coming in from a streaming event. I have a ton of thoughts about all the interesting things you can do around streaming. Including maintaining your offsets in Nessie or something like [00:27:00] that. But yeah, I’d love to talk to you more about that question.

Anita Pandey:

Very cool. I think we are at the end of our questions, but we do have, Oh no, we’ve got some more good questions here, is the Delta Lake integration compatible with Databricks?

Ryan Murray:

Not yet. We’ve had a PR open on on the Delta open-source project for a couple months now. And we’re trying to get that merged into the open source [00:27:30] project. And I think that would pull it up into the Databricks project, but we need to get that one merged before we know for sure, but it’s something we consider important right now.

Anita Pandey:

Cool. Another question here is how much impact does Nessie have on performance measures on data ops? And then kind of an extension here is, how large can tables become and how does parallelizing Spark commands [00:28:00] work in conjunction with Nessie?

Ryan Murray:

So Nessie really only manages the metadata. So Nessie shouldn’t have any impact at all on the performance of the data queries. So you can have any table size that Iceberg can handle, any amount of history that Iceberg can handle. At the end of the day, we’re relying on Iceberg to tell us which data tables we need to read for a particular commit. So that’s relatively scalable. In terms of commit performance, [00:28:30] the goal for performance from Nessie is to be able to do 1,000 commits per second. I think we’re a little bit short we’re in the maybe four or 500 commits per second region.

But that’s kind of the performance we expect to have out of Nessie. And from that perspective we can handle quite a bit of extra load from there. That’s actually why we didn’t use Git. In the end Git was five seconds per operation, and that’s obviously [00:29:00] way too slow.

Anita Pandey:

Wow. That does give us a sense of proportionality there. Another question here is, are all of these snapshots on the main branch kept so we can go back to any of them?

Ryan Murray:

That depends on your garbage collection set-up. So we just merged the initial garbage collection implementation into our master branch about a week ago. And the idea there is eventually we have to garbage collect. Eventually we have to remove commits. So you can set [00:29:30] a keep everything for the past month or keep everything for the past year or whatever else. And then the garbage collection will only reap commits that are older than that. But at some point you do have to re-commit so you’re just going to keep an infinite amount of data. But yeah, it’s configurable to how much you want to be able to store.

Anita Pandey:

Interesting. Wow. Well, let’s give it a little bit here in case we have other questions. I see the Slack channel isn’t working. Let me check on that. [00:30:00] Oh, okay. Let’s see. It looks like we have another question. I’ve seen the question from Arnaud above regarding DBT versus Nessie. I have the same question. Do you want to maybe elaborate on that?

Ryan Murray:

Yeah, I guess you mean-

Anita Pandey:

Oh, sorry. Let me, let me give me a second here. Let me elaborate. I’ve seen it. I’ve just seen a demo of [00:30:30] BDT with Dremio. How do you see the overlap of approaches?

Ryan Murray:

So I think we haven’t reached out to the DBT community yet. I think that’s something we want to do relatively soon. I think I don’t see it as a versus competition thing. I think Nessie could really fit in quite nicely to DBT or some other ETL type of data processing tool. I think there’s some really [00:31:00] interesting collaborations there. We just haven’t reached out to that community yet, so I hope we’ll do that soon.

Anita Pandey:

Cool. And will it be more optimized if the main branches keeps Delta between versions. Will it be more optimized if the main branch keeps Deltas between versions?

Ryan Murray:

I’d [00:31:30] have to think about it. I don’t think so. What’s being stored in Nessie is, thinking about it from the eyes perspective, we’re storing a pointer to a particular Iceberg snapshot. If you think about Iceberg and Delta it’s to a particular Delta log entry. So there’s not really a Delta so much as a store. We’re basically saying, at time X this is the Iceberg table that exists or at time Y that’s the Delta table that existed.

Same with views [00:32:00] or something like that. So storing, what’s the difference between a Delta and a snapshot there? Not much really. Just pointing to a path on your Delta Lake. I think maybe I didn’t understand that question. So if you want to come on Slack, I’d be happy to go through that more.

Anita Pandey:

Yeah. Sounds good. Hey so folks, that’s all the time we have right now. I love how popular, and such good questions here.

Ryan Murray:

[00:32:30] Thank you so much everyone.

Anita Pandey:

Sorry about that?

Ryan Murray:

I said thank you very much to everyone for coming and for asking.

Anita Pandey:

Yes. Thank you and great questions.

Again, I will send you the Slack channel. I’ll just troubleshoot that, but I wanted to remind you before you guys all depart that we do have the expo hall open. So please do go there to experience Dremio in a sandbox virtually it’s really quick and fun. And also to do a deep dive with some of our, [00:33:00] on the boots and to get some fun giveaways. So with that, have a great rest of your day. And thank you so much, Ryan, once again for this fascinating new breakthrough.

Ryan Murray:

Thanks for having me. Take care everyone.

Anita Pandey:

Bye now.