March 2, 2023

12:20 pm - 1:00 pm PST

The State of Apache Iceberg

In this panel on Apache Iceberg, Iceberg developers will discuss the recent developments in Iceberg, exciting new use cases enabled by these developments, and the future of the project. We will also discuss how developers can benefit from open table formats, scalability and cost savings through Apache Iceberg, and the state of the community and adoption. We’ll also review where lakehouses do, and don’t fit in.

Topics Covered

Iceberg
Keynotes

Sign up to watch all Subsurface 2023 sessions

Transcript

Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.


Tony Baer:

Good afternoon, good morning. Thank you for taking time out of your day to join us. This is a session which we’re looking at basically the state of Apache Iceberg. We’re looking at the Lake House. We have quite an interesting panel of guests here. I will have to apologize to you in advance. We are having some technical difficulties today where at least we are not able to see each other. So the conversation will be probably a little more disjointed than we would like it to be, but hopefully. But we’ve got a pretty good diversity of opinions here, and I think you’ll still find the discussion pretty compelling. So why are we having this discussion? I think that, I mean, that’s going to be a question I’m going to throw out to the panel, but I’m going to throw out a few opening thoughts here.

What Is the Data Lake House?

Some say the Lake House, the Data Lake House which is a term I think we’ll probably debate here is about the open choice about an open table format. My take on this though, and I’ve recently concluded some fairly in depth research here on the market landscape, is that really what the real driver here has been the need really for to get better confidence in the data that we have in our Data Lakes. And that really starts with Iceberg. And we’re not here talking about turning this into a transaction database, but the idea being that is the data that we are using from the Data Lake, is it current? Is it consistent? Is it not corrupted? And until now if you want to have that level of assurance, you needed some, you needed to write that into the application, which was not, not necessarily the type of thing that data scientists or data engineers want to do.

They don’t want to be running their own transaction engines. So, anyway, and by the way, that same concern was automated animated discussion about data mesh, which if you were on sub-service earlier today Jean basically opened the day and talked about basic topic that actually from what I’ve seen from my experience on LinkedIn, it’s been really pretty much the hottest topic in data management over this past year. But the thing is, I’ve also seen rising interest in Data Lake House. So one of the questions that I think really need to answer is why are we having this discussion now? Well, my take on this is that over the past year, we’ve really started to see commercial ecosystems really starting to solidify around this. Now, we’re still fairly early on with regard to market awareness.

I’ve spoken with lots of practitioners and there’s a lot of lack of clarity to what Lake Houses are or whether that’s really the appropriate term. We’ll talk about that. So demand’s still pretty latent but definitely where there’s smoke, there’s fire. What we are not going to do in this session is we are not going to be doing a competitive bake off or SmackDown between the different open source Data Lake table formats. We are going to focus specifically on Iceberg. This is not necessarily meant to be an endorsement of Iceberg but we do have the creator or the co-creator of Iceberg here. So I think you will be getting some endorsement from some quarters here. So without further ado, I’d like to introduce our panel and we’re going to go one by one here. As I said, unfortunately our video here is not working. We cannot see each other. You can hopefully see us. So when I first would like to start is with Russell Spitzer. He is a software engineer for Apple Cloud Service. Russell, can you introduce yourself and give us a little bit about your background and we promise not to ask you anything about the next model of iPhone.

How Did Russell Spitzer Get Started In Big Data?

Russell Spitzer:

Alright, well so like I said I’m Russell Spitzer. I got into the big data world, starting at Data Stacks, which was a Cassandra company. I spent a lot of time with Spark and Cassandra and integrations between that. And then I joined Apple, where now I pretty much exclusively work on the integration of Apache Iceberg and a lot of other technologies. So moving from one open source technology to another with slightly lower latency or higher

Tony Baer:

Okay, thanks Russell. Next on our panel is a person we can blame for this wholeness, and actually is having a real red letter day. So I’d like to introduce Ryan Blue. You are a co-creator of Apache Iceberg and the CEO and co-founder of Tabular. And evidently you’re having quite a red letter day today. So tell us a little bit about your background. I will later ask you about how you came to develop Iceberg. But right now I’d like to know a little bit more about your background and tell you a little bit about what’s happening with your company today, the red letter day for you.

Ryan Blue’s Big Data Background At Netflix

Ryan Blue:

Thank you. Well, so I guess I’ve been in the data space for a long time now. Way back to my government days. Then I moved to Cloudera and Netflix where we started Apache Iceberg. After doing the Iceberg rollout and production deployment at Netflix, we thought other people could really use a lot of help with this. So we created Tabular and Tabular as of yesterday actually is open to the public. And you can sign up at Tabular.io. So we’re really excited to have our launch happening and to be this far along.

Tony Baer:

Congratulations, I’m sure the paparazzi are now lining up at your door as we speak. And last but definitely not least. And this is what I really want to talk about. The ecosystem that’s forming around the Iceberg is Robert Step. He is an open source engineer at Dremio Company. You may have heard of and offer the tech lead for Nessie. And Robert, I will ask you shortly more about Nessie, but right now just tell us more about your background.

Robert Stupp and Working For Dremio

Robert Stupp:

Yeah. So currently I’m working, as I said, for Dremio. It’s basically really all the things about open source Nessie and which is at the core Dremio Arctic of course. So my role is mostly about being a tech lead and of course committing a lot of code. But it’s a lot about the user/developer experience, the various query engines, integrations on Nessie and especially Apache Iceberg play together. And of course, thinking or planning or implementing the next things which are on our roadmap.

Tony Baer:

Creating an Iceberg. Give us the background and what really drove this and what was basically your intent and your design goals for doing this.

Ryan Blue:

Thank you. Really quickly, I think we may have skipped Jordan.

Tony Baer:

Oh, I’m sorry about that. Gosh, how did I do that? Thanks.

Jordan, my apologies to you. Jordan this is what happens when you don’t have video here. We have Jordan Tigani, who’s basically the founder and CEO of Mother Duck, which I’m sure you are for wetland’s preservation, but I have a feeling that’s not what you do. I think this has a lot more to do with DuckDB. So can you tell us a little bit more about your background?

How Did Jordan Tigani Get Into Data Through Google?

Jordan Tigani:

Sure. I was one of the engineers that helped create Google BigQuery. I worked on that for a long time. I worked on the storage side of things for a while. And so very interested in and excited about sort of the things that are now in Iceberg, in Tabular. There was a bit of a war between sort of like should the Data Warehouse or the Data Lake rain reign Supreme, and it seems like we’re working on some architectures to sort of give you the best of both worlds. I’m now the founder and CEO of a company called Mother Duck, which is doing serverless analytics based on DuckDB.

Tony Baer:

Okay. Sounds good. And my apologies, Jordan for that, you and I have actually, we’ve actually crossed paths quite a bit during your previous labs over at Google there. Okay. Getting back to getting back to basically our regularly scheduled program. Ron, give us a little background on the origins of Iceberg, Apache Iceberg. What drove this, what was, what was your approach and give us a little more color and background about what we’re talking about here.

The Origins of Apache Iceberg

Ryan Blue:

We created Iceberg to address the three main challenges that we had at Netflix at the time. Usability was very, very poor because you couldn’t count on your tables having a certain behavior either when writing to those tables or when trying to maintain them over time, your evolving schema or things like that. We also had constant performance challenges with the structure of the old tables and we felt that very keenly because we were on S3 rather than HDFS and everything took 10 times as long in the cloud, which is why I think we got to this on the early side. And then just general correctness issues the old formats and some of the new formats just have correctness issues, they’re just not as reliable.

And people really don’t like having operations that only kind of complete, or only kind of give certain guarantees. So we really wanted to fix all those problems. But also a goal from the start was to do this in a way that other people could adopt and build on and really invest in. And that’s what excites me the most about being here at this conference is that there are so many different people talking about Iceberg and so many different vendors supporting it these days. I really think that we’re finally delivering on that goal of being a project that the broad community feels like they can invest in. And it’s technically there and on the community side as well, welcoming and open to anyone who wants to contribute.

Tony Baer:

I think that’s what made this area very interesting to me and why I dove into the research over the past, I guess about six months or so which is that there really has been a shark pickup in activity. And it is and in many cases it has been very broad based. I would like to actually segue to Russell as you actually are using Apache Iceberg. And basically Ryan basically spoke about sort of the key criteria that he was trying to or basically a requirement that he was trying to deal with usability. The fact that the table performed performance, I should say was pretty flaky and not always being sure of having the right data. I want to ask you, which is basically why did you decide to use Iceberg and what were the goals that you were looking for, the benefits that you were looking for?

What Are the Benefits of Using Apache Iceberg?

Russell Spitzer:

Hmm. So obviously I can’t go into too many specifics on this, but as a general user coming to someone looking at Iceberg, it solves a lot of problems that you just can’t figure out when you have file-based tables without having a lot of ancillary machinery around it. So being able to have full ACID basic transactions is really important, especially if you want to start handling cases where you want to atomically change data that’s already in your table. So when we see use cases for things like GDPR and stuff like this where you want to maybe only change 10 rows in a single file, you need to be able to atomically swap out that file. You can’t rely on guarantees that the file system is always going to correctly do something or that the network is okay or that any particular transaction goes through. So it’s really captured a lot of interest for users. I think because of that, it really just makes a whole new set of things possible at a scale of data that used to not be possible to really do.

Tony Baer:

Yeah. I mean, put it this way, would you consider this to be the best of both worlds between Data Lake and Data Warehouse?

Russell Spitzer:

I mean, those terms mean a lot of things to a lot of different people, but I really think it’s getting much closer to the idea that you can scale up the SQL that developers are already familiar with onto a system that, will scale pretty much independently, is based on open software and can completely separate compute and storage, which I think are all real key factors. And why it’s so attractive.

What Does Lake House Really Mean?

Tony Baer:

Right. So, Jordan, I want to turn to you because basically you’re taking you’re approaching this from a very different perspective, and I think and so basically what does the Lake House mean to you, and in terms of work, do you see value? Do you not see value? so give us a sense of where you’re coming from.

Jordan Tigani:

Sure. I think one thing that I think is disappointing about the proliferation of Data Lakes has been sort of the standardization on file access that basically if we’re sharing data and what we have to share between each other is a file that gives so little flexibility and freedom for the actual implementer of the query engine and of the of the storage system as well to to be able to innovate and to give you, to give you high performance. And what I think I like about the Lake House is that it starts to sort of raise the level of abstraction above the level of a file to really a table interface.

And so I think that where the world needs to be is you standardize on a table, which you interact with it with SQL and rather than interacting directly with files, and obviously, if you’re have multiple different systems interacting on something like a Data Lake you’re eventually, you’re going to have to have access to the physical files. But I think at some point that becomes a lot less important is sort of what the actual format of the file is. And when you can give these sort of higher level primitives, you can give ACID transactions, you can give the ability to coalesce and compact data in the background repartition data so that you can get good performance as soon as you raise that level of abstraction, you open up the door to a lot more innovation both on the storage side and on the query side. And, to me, that’s actually what I like about systems like Iceberg.

Tony Baer:

Do you see any drawbacks?

Are There Any Drawbacks to Using a Lake House?

Jordan Tigani:

Drawbacks? On the other hand, you still have to give people access to files and I think Parquet, for example, is there have been sort of this Parquet version two, which be awesome if we could move to that, but there’s problems where everybody still is interacting with files means that it’s hard to innovate on the format side of things. And there’s some security level things that are harder, that are harder to do. Dynamic data masking column level encryption some of those types of things. I think that we’ll eventually get there. I think also making sure that metadata is actually one of the really important things in a data system and making sure that you can have rapid metadata updates. It is also also important to be able to do things like real time updates. And, my guess is that these are problems that are going to be solved, but I think they are some of the drawbacks with some of the current tech.

Tony Baer:

Right. And this is where I’d like to bring in Robert, because basically, the one other think the interesting phenomena we’re seeing, at least with the open source Lake Houses, is that we’re starting to see an ecosystem strong evolve and in some cases, basically subprojects if you want to call it that, and you’ve been involved with Nessie. And basically Jordan was talking about the importance of metadata. So can you tell us basically we’re something like Nessie would fit in?

What Is Nessie and Why Is It Important To Lake House?

Robert Stupp:

Of course. So first of all, I mean Nessie was built around supporting basically a lot of table formats, but in the end, the Apache Iceberg community was very open and welcoming the contributions of Nessi. So, and that’s how Iceberg basically became the table format in Nessie. With respect to what makes Lake House/Nessie so very important or very special? I think it’s a lot of combinations. For example, things that people would maybe naively or expect or disparately need, like many table transactions. I mean, you can manipulate a ton of tables in a single transaction or what people are used to from relational databases, right? Do it begin transactions, update these many tables, do commit or just throw it away with the rollback. Data Lakes didn’t have that. I mean, it’s like having your salary on your bank account, but your bank account balance isn’t updated, so you can’t even buy a single drink. I mean, what’s the point of that?

Tony Baer:

I prefer to have a transaction system on a bank account that only records trend deposits and no withdrawals.

Robert Stupp:

I think there are other maybe nasty but necessary requirements like all these regulatory or legal requirements to know, okay, how did your whole Data Lake look at that certain point in time as a consistent view to all the tables, which is comparable for people familiar with gate. It’s like a tag, you put a stamp on your whole Data Lake, and Nessie makes it possible to query exactly the state of the whole Data Lake at that point in time.

Tony Baer:

Well, let’s get a little specific here, because in terms of and this sort of points, the need for sort of ancillary projects, which is that Iceberg as I understand, and Ryan, maybe I’ll have you jump back in here for a second. Your Iceberg collects a fair amount of metadata. I mean, can’t you do, and Robert I’ll come back to you on this, but with Iceberg, can’t you with the way that you collect metadata there can’t you do time travel today on Iceberg with just plain Iceberg without any other other projects?

Does Apache Iceberg Support Data Time Travel?

Ryan Blue:

So that’s entirely correct. You can time travel you can also tag older versions at a table level rather than at the Warehouse level. You can also branch and do essentially like stack up some rights and then fast forward your main after you’ve audited that the data is correct. And newer use cases like that there I think is an open question here between is it better to do that at the the metadata level where you’re capturing the state of the entire Warehouse or at individual table level? But I think we all agree that we definitely need those multi table transactions and this area, like Jordan said, metadata is incredibly important and getting to the point where we can do more of these operations is definitely an area of tremendous effort and work right now.

Tony Baer:

Now for what Ryan was talking about Robert, I’ll come back to you. Where would Nessie pick up and Iceberg leave off?

Where Does Nessie Pick Up For Iceberg?

Robert Stupp:

Yeah so as mentioned, I mean like this multi table transaction thing, but also thinking about data experiments where, so answering questions like, how would a data my Data Lake or my production data look like if I use a different schema on my tables? Right? If it doesn’t work, throw it away. If it works maybe merge it back to your production main branch thing. But I think what also comes into play is it’s far a bit ahead of time, but maybe not actually. But CICD for data I think that’s a really, really important topic. I mean, having automated data validations with automated publishing of that data just, I mean, you don’t want to go to your boss and say, and let him figure out, oh, your data doesn’t agree that it’s maybe not the best idea to have.

Tony Baer:

Interesting. Anyway we had a fairly animated discussion about this last night when we were just just hooking up and syncing. But, and so I want to bring this up right now, which is that, is Lake House really the operable term here? Or is this just marketing buzz and and and we’ll start with you Ryan, and then we’ll end and we’ll go through each of you, each of you folks, and get what you think.

Is Lake House The Correct Term?

Ryan Blue:

So I think that that’s a good question. I’ve found that the main problem is that we don’t agree on a common definition of what it means. People see different things in the term. And so I always like to clarify like, what do you mean by Lake House, right? Earlier we were talking about the good parts of a Data Lake and the good parts of a Data Warehouse. And I think that joining those two things together is a perfectly valid definition of Lake House that I like because I think that the goal of Iceberg is to do exactly that, right? It’s to give you the sequel behavior and guarantees of the Data Warehouse world with the flexibility of using any engine and decoupling your storage from the computer of the Data Lake world.

On the other hand, I think it’s often broadly applied like, we’re a Lake House, we build a Lake House, Iceberg is a Lake House. This other thing is a Lake House. And I don’t know that any one of those definitions is, is actually correct, right? Like, Iceberg is a data format for very structured data, and most people say a Lake House architecture I guess now we’re saying Lake House architecture, so it’s a different term still, but that includes unstructured data. So it’s very nebulous to me and I like to just be specific.

Is Data Lake House Easy to Start?

Tony Baer:

Well put it this way. Well okay, I’m going to ask each of the others, but, so make me wonder, which is a Data Lake House in terms of how it’s being currently envisioned really the revenge of the sequel nerd. I mean, yes, we could do python routines on it, it’s in there, but the way the whole table structure is based is that we premise this on a relational table structure. So I’m going to ask each of you, I’ll start with Jordan, we’ll start with you, which is, do you see this as being the revenge of the sequel nerd or is this something which basically is available to all newcomers?

Jordan Tigani:

So there’s a great paper written by Michael Stonebreaker and Joseph M. Hellerstein called What Goes Around Comes Around And it basically talks about, there’s like these endless circles in the database world about a number of things. And one of those is schema on Read versus Schema on, right? The Lake House premise was, sorry, the Data Lake was based on Schema on Read, which is, you basically write whatever garbage you want and you can sort of see where I come in on this, but then when you read it, you can apply whatever schema you want to it. And it turns out that if you want to actually have reliability, like understanding of your data, you want to make sure that only the right things get in. So you want a sort of schema on read, but just sort of almost like dynamic versus static typing. And in programming languages, there’s this sort of like these sort of wars that once you go too far on the one side, things tend to swim to switch back on the other side. And so I really see the Lake House as a movement towards schema on writing. Versus schema on read. And you can do sequel over either one of them as sequel probably is a little bit more natural on the sequel on read our sequel on write side

Tony Baer:

Right. Russell as you’ve delved into the Lake House. What’s your take on that?

Russell Spitzer:

I mean, I like hearing from everybody else. I mean, basically when I hear Lake House, I’m usually thinking about the users I’m interacting with directly who are seeing something and want to know something specific about their data. And really what they just want is to be able to do the same things they could do on the relational databases on the data size that is enormous. Or basically they can’t fit their data in Postgres and now they want something that behaves pretty much the same. So they need a set of technologies that lets them do that. I definitely agree that this, the schema on write is critical to actually make this work at scale. Like, it’s very, very difficult to write the optimizations you need to really make querying big data possible without knowing a lot ahead of time.

Like, I think the key to big data and the Data Lake House, is the ability to not read all of your data all the time. And the only way you can do that is with metadata layers and things like that, so you can avoid reading in the first place. So I think that’s like the killer feature coming through all of this. Write. And whether or not anything is or isn’t a Data Lake House is hard for me because it’s more just like I’m talking to a user and I say, well, these are the technologies together that will satisfy your goals. And that’s, that’s where I am on it.

Tony Baer:

It’s interesting. It’s really been a pendulum swing. I mean, and I’m sometimes convinced and it’s not just in data, but that you see the history of and sometimes I think like that just nothing that’s new has happened. We just reinvent each time with some new knowledge and some new wisdom. And so with Big Data came in, it was like, oh schema unread don’t just get the data in there and as soon as you transform it, you’ve already basically reduced it and you’ve lost some fidelity versus basically what we’re talking about here is the repeatability. So Robert, I’d like to finish up with you on this question, which is, what’s your take on this?

Robert Stupp:

I pretty much agree with everything that my first speaker said. I mean like adding the database things assuming that we are using, talking about structural, so structured data basically. Like things that you can put in tables and columns and maybe add indexes or materials views on that. So like adding these database things to a Data Lake, if that’s an explanation for Data Lake House at least in my opinion, maybe. But I would also like to add that I mean we are talking about new things and new fancy things and such but also having the ability to remove all the burden of like manually or heavily scripted five system maintenance stuff. If there’s stuff that can take these maybe boring, repetitive and potentially error prone things away from users so that they’re free to do things that they want to do, I think that’s something a Lake could really do and could really where people could really benefit from.

Tony Baer:

Okay. I have we’ll see how much time we have here. I have one question I just am burning to ask. I’m going to bring on everybody here. And it’s going to be a complete surprise, but we like to live dangerously. Which is, do you see if we accept that we use the term Data Lake House for the moment, will it replace the Data Lake or will it replace the Data Warehouse? Or will there still continue to be a reason for each of them? And I’m just going to go in order of how we see the gallery here, Russell, we’ll start with you on that.

Will Data Lake House Replace Data Lake or Data Warehouse?

Russell Spitzer:

Yeah. The big trend I think in almost all of this is this layer disappearing from end user knowledge. Like, it just eventually gets to a point where they don’t know what’s going on. So regardless of what terminology we have, I feel like that’s something that folks like us on this panel will be talking about a lot. But what we end up presenting to end users is going to be much simpler. Like, they’re not going to be knowing the details of how this is all working under the hood. And I think that’s why there’s so many companies out here which are trying to do this Data Lake House as a service Warehouse, as a service, because managing all this and knowing those finer technical details is, I think, pretty niche. It’s going to be similar, I think probably in the future to how many people use SQL with a relational database versus how many actually understand how the internal query optimizers and file layouts actually work like in Postgres versus MySQLDB or all that stuff.

So I see it vanishing away from public perception, but our debates will probably still be very vigorous and we’ll be yelling about this for a long time.

Tony Baer:

Ryan, your term will Lake Houses replace or supplement Data Lakes and or Data Warehouses?

Ryan Blue:

First of all, I want to steal Russell’s answer because he’s absolutely right. And that was one of the reasons why we created Iceberg was to make these responsibilities, the file management responsibilities that Jordan was talking about disappear. We should be able to automate all that stuff just like a real grownup. And so that’s definitely the trajectory of the Data Lake world.

Tony Baer:

We’ll replace the Data Lake and we’ll replace the Data Warehouse.

Ryan Blue:

Well, I think it is replacing the Data Lake. So you will still need some things, like the idea of dumping a whole bunch of unstructured data in S3 isn’t going to go away. Right? What we will see is we’re going to stop losing structure, right? We’re going to share tables with one another instead of files. And we’re generally going to be better about holding onto the structure that we have, but you still have video files and you still have unstructured things that you need to keep as a source of record or source of truth. So definitely on the Data Lake side, we’re growing more towards Data Warehouse capabilities. I think Data Warehouses are doing the same. And you can see like BigQuery and SnowFlake saying, hey, there’s a lot of data out there that is just stored in files in S3 that people aren’t loading into our database for whatever reason. And we want access to that. And that starts making everything look very similar where the Data Warehouses become query engines on top of Iceberg tables. And we get this sort of merging of the two worlds, which is where I think everything is going. We’re going to merge the two worlds together and bring a lot of that database knowledge into what we’ve been doing in the Data Lake world. And then hopefully, like Russell said, users won’t have to think about any of it.

Tony Baer:

So Robert, do you see the world’s merging or diverging or replacing each other?

Robert Stupp:

That’s a tough question for me, to be honest. I think it’s always an evolutionary thing. I mean, it’s easier to say we just replace our Data Lake with the Data Lake House, but there’s so many things that just sit around all these things, all these processes and stuff. So I think it won’t go completely away and as Ryan said, there’s a ton of unstructured data videos and stuff like this, so there’s maybe also some chances to get metadata out of these things and put these things in the Lake House and add some value to that unstructured data.

Tony Baer:

Okay. And Jordan?

Jordan Tigani:

I think this question first of all, it’s good to hear that there are no Lake House maximalists here. I think there’s Bitcoin Maximalist that is like Bitcoin’s going to replace FIAT currency. Sometimes you hear Lake House maximalist, so like Lake, everything else is going away and everything will be the Lake House. Because I don’t see it, so Data Warehouses are not going to move to Lake House for their primary storage. There’s too many advantages to performance and security and manageability of owning your own data. I do think that you’re already seeing them already embracing as Ryan was saying, being to read data that people have chosen not to include in the Data Warehouse. I see Data Lakes as people are using them. I can see them going away, actually. There’s a bunch of like video files, et cetera that you need to store somewhere.

I think that the Data Lake was sort of a path dependent organism that showed up because people were using Hadoop and they needed to throw their data somewhere. And so they threw their data here, which turned out to be a big mess. The most high powered room that I’ve ever been in was Google had its customer advisory board that had CTOs of like 50 companies in there. And like, we asked ’em about their Data Lakes and they were pounding the table, angry that they had invested so much in this and they hated them and they couldn’t figure out what to do. And the Lake House gives them something to do.

It’s like, oh my God, this thing is broken and it’s breaking our company and we’re spending all this money and it’s insecure and it’s just this mess. The Lake House is something that gives them an out. And so I see people really jumping at the Lake House technologies because they give you a lot more control and understanding of what you actually have. And then I see the more traditional uses of a Data Lake going away a little bit as more people start to build on top of the Lake House.

What is Next For Apache Iceberg?

Tony Baer:

Okay, final question. And I’m going to give this to Ryan . What’s on the roadmap for Iceberg?

Ryan Blue:

Oh , well, in an open source project

Tony Baer:

Or can we close the patent office now? And Ryan, you can go up into retirement.

Ryan Blue:

That’s what I was going to say. That it really is driven by the amazing community. I can tell you what we are driving for. So in the next year Tabular, we are looking to support and work with Apple on encryption. So I think encryption is going to be a really great feature to have in the format of the Python implementation and possibly getting closer to c plus plus and native implementations. Love to see that improving that is bringing so many new people and use cases into the community. Multi table transactions that are just being able to release data at the exact same time, even if you’ve staged it and it’s not a real transaction. Like those sorts of operations we get questions about all the time. And we want to deliver true multi table transactions. Along with further improvements to the metadata layer and some of those things like better indexing, we’ve got some proposals for better statistics and basically a lot of things. It’s an ever-changing community with a lot of talented people putting in an amazing amount of effort.

Tony Baer:

And I’ll ask you as you were involved in creating this, are we ever going to get to the point where we have basically leveled the planes in terms of performance, in terms of control, in terms of governability as the Data Warehouses. Do you see that happening?

Will Data Lake House Ever Get to The Point of Leveled Performance and Control?

Ryan Blue:

So in a format, those things are extremely hard because a Data Warehouse has so many runtime components right now. I think that we’re pulling apart the Data Warehouse model and moving to separate compute and storage and Iceberg helps us along that path. But there’s a ton of metadata management and caching and other tasks that need to be done and sort of warmed up in order to have the experience that you have with the Data Warehouse. And so I don’t see that going away or being replaced by a format like Iceberg anytime soon.

Tony Baer:

Gotcha. Anyway, Ryan and everybody here, Ryan Blue, Russell Spitzer, Robert Stupp, and Jordan Tigani, very much appreciate you taking the time to share your insights on Iceberg and the Data Lake House if we’ve at least temporarily agreed to call it that. Anyway, this is Tony Baer and I’d like to thank you folks at Dremio for hosting this conversation here at Subsurface. Thank you.

header-bg