Transcript
Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.
Opening
Andrew Miller:
Hello, everyone, and welcome to the TDWI Webinar program. I'm Andrew Miller, and I'll be your moderator. For today's program, we're going to discuss how this survey reveals the state of data lakehouses. This is part one of our two-part series––part two is taking place on December thirteenth, and the link to register can be found in the resource window. Our sponsor today is Dremio. For our presentations today, we'll hear first from Fern Halper with TDWI, and we'll also be joined by Read Maloney with Dremio.
About the Speakers
Andrew Miller:
Again, today, we're going to be discussing how the survey reveals the state of data lakehouses, and our first speaker is Fern Halper. She is the Vice President and senior director of TDWI Research for Advanced Analytics. Fern is well-known in the analytics community, having been published hundreds of times on data mining and information technology over the past 20 years. She is also the co-author of several Dummies books on cloud computing and big data. Fern focuses on advanced analytics, including predictive analytics, machine learning, AI cognitive computing, and big data analytics approaches. Fern has been a partner at industry analysts firm Hurwitz and an associate and a lead data analyst for Bell Labs. She's taught at both Colgate University and Bentley University, and her PhD is from Texas A&M University.
It's also my pleasure to introduce our guest speaker today, Read Maloney with Dremio. Read is the CMO of Dremio and is a cloud data and marketing executive, with a history of building and leading high-growth marketing teams at AWS, Oracle, and H2o.ai. Most recently at H2o.ai, he served as the senior vice president of marketing, leading all elements of marketing for the late-stage startup. [Before] working in the technology industry, Read was a captain in the United States Marine Corps, serving 2 tours of duty as a platoon commander in Iraq. Read holds a bachelor's degree in mechanical engineering from Duke University and an MBA from the Foster School of Business at the University of Washington. Welcome to both Fern and Read. And with that Fern, I'll pass it over to you.
Topics
Fern Halper:
Great, thanks, Andrew, Hi everyone, and welcome to this webinar on the state of the data lakehouse. So we're really happy you could join us, and as Andrew mentioned, this is part 1 of a 2-part webinar series. Today we're going to cover the state of the data lakehouse in terms of adoption. We'll also talk about an important part of the lakehouse, open table formats, and then on December 13th, we'll be talking about the data lakehouse and AI, and even the data mesh there. So we're going to do things a little bit differently. Today. we're going to share the results of not only TDWI research but also a study that Dremio commissioned about the state of the data lakehouse. and we'll talk about factors driving interest in the lakehouse and the current state of data lakehouse adoption. As I mentioned, since open table formats are a big part of the data lakehouse, we'll also talk about those, and we'll close the discussion by talking about some best practices for getting started with the lakehouse to manage diverse data. So we're going to do this by comparing our results and Dremio’s––I think it's going to be an interesting webinar, and I'm going to kick things off, I'm going to talk about what we're seeing in terms of the state of the data lakehouse, and then Read will share Dremio’s findings. So we'll be going back and forth on this.
Characters/Pillars of the Data Lakehouse
Fern Halper:
So first, let me define what we mean by a data lakehouse. We're saying that the data lakehouse is a platform that's a combination of a data lake and a data warehouse that provides warehouse data structures and data management functions on low-cost platforms, such as cloud object stores. And so really, these new platforms have blurred the distinction between the traditional data warehouse and data lake. They support and manage large volumes of diverse data along with SQL, they support BI, AI, machine learning, and other advanced analytics, and most often we see these lakehouses in the cloud. I'd say that they grew out of the fact that organizations need to collect and manage diverse data types, so it's not just about structured data, but also organizations want to make use of unstructured data like text data, for example. And they had their data warehouse for structured data, and their data lake for unstructured data. But, as probably you know, many data lakes evolved into large data storage platforms that weren't performant, that weren't organized, and the data quality wasn't there––and this led to the data lakehouse.
Data Warehouses and Data Lakes vs. Data Lakehouse
Fern Halper:
So what are some of the characteristics or pillars of the data lakehouse? Well, first, the data lakehouse provides unified access to all data. If you think about the dual data warehouse and data lake architecture, that's a siloed architecture, which means that data is often copied multiple times for the warehouse and the data lake. Aside from the errors that can be caused by this, it also involves duplicate development and maintenance efforts as well, so the data lakehouse eliminates silos. And then the data lakehouse also supports diverse data types––so that include structured data, unstructured data, as well as semi-structured data. So the data lakehouse is a platform that can store massive amounts of all of these kinds of data. So I'm talking about structure, text, video, image audio, all of that––the type of data that organizations now want to use to enrich their data sets for analysis. The data lakehouse typically also supports third-party data, like demographic or firmographic data. Lakehouse providers typically also offer up marketplaces, where buyers and sellers can exchange data. And as I was saying, all of this data can help to enrich data sets for analysis.
So, for instance, you can use data from the lakehouse together with weather data or other industry data, maybe if you're doing some logistics, analysis supply chain analysis––so that's very helpful. And then another characteristic of the lakehouse is that it supports tools for multiple user types. So this includes analytics tools, it includes open-source tools like R and Python, as well as commercial BI and data science tools. And that means users don't need to jump from the data warehouse to the data lake for their analytics needs. If you think about something like machine learning, and you want to perform some sort of machine learning analysis, that's going to make use of high volumes of diverse data. So maybe you want to marry your structured data, like your billing data, with your sentiment data, which is going to come from unstructured text. That can be an issue for traditional platforms where the data scientists might need to access data from both the warehouse, maybe to get at this billing data, and then the data lake, to get at the social media data that they then want to extract sentiment from. The data lakehouse supports all kinds of data. So those who need access don't have to go to multiple systems to get their data. And another key pillar is that the data lakehouse supports models in production. So I'm talking about––I'm not talking about data models––I'm talking about predictive analytics and machine learning kinds of models. And that means that the data lakehouse [can] process the data––you build the model and then the data lakehouse [can] process the data to score the model, so it's supporting these models in production.
So those are some of the pillars of the data lakehouse. I want to compare the data warehouse and data lakes versus the data lakehouse, just to finish up this discussion. So the first is, unlike the data warehouse that deals primarily with structured data with predefined schema, the data lakehouse can store massive amounts of both structured and unstructured data, I talked about that. And then, unlike the data warehouse that supports basic analytics, such as reports and dashboards, which is typically what you're doing off of your data warehouse, the data lakehouse can support compute-intensive, iterative advanced analytics, in addition to reports and dashboards. And then if you think about the data lake, unlike the data lake that supports raw data, but you really can't query it, the data lakehouse supports efficient query and analysis. So it's acid-compliant––many data lakehouses are acid-compliant, and acid compliance ensures that every transaction is either going to be fully completed, or it's going to be fully rolled back, without requiring the creation of new pipeline processes. And then, unlike the data lake that often doesn't separate compute from storage, which means that when one grows the other one needs to grow, which can be very costly, data lakehouses separate the two, allowing each to scale independently. So while both the data warehouse and the data lake have their pros––they've been useful for many things––they also have issues. And those are the issues that the data lakehouse attempts to fix.
Cloud Data Lakehouses are Rising in Adoption…
Fern Halper:
So what we see at TDWI is that cloud data lakehouses are rising in adoption. So here's the first of the data that we're going to compare. And so this data comes from the best practices report survey that we did on modern analytics, we did it in 2023. So we're talking about analytics such as machine learning, self-service analytics, AI, those sorts of technologies. And we asked, what data management technologies is your organization using to support modern analytics? And the respondents could select all that applied. You can see that the cloud data warehouse is at the top. 46% of the respondents that were using a cloud data warehouse. You can also see that the cloud data lake if you read down the chart, that said 26%, a few rows down. I thought this was interesting––it was the first time we saw that cloud platforms were used more than on-premises platforms because, as you could see, some of the data warehouses on-premises, were less than the ones in the cloud, and so on. But what I circled was a cloud data lakehouse, and we see that 21% of respondents in the survey were already using a cloud data lakehouse. So in other surveys, I've seen that number range from like 15 to 20-something percent.
…And Are Expected to Grow
Fern Halper:
Here's data from another TDWI 2023 survey. Again, it was for a best practices report. Here we asked which of the following types of data management platforms, repositories, or patterns are currently in use or planned for use by your organization to support BI analytics AI, ML, you could see that the dark grey is currently using the dark blue is planning to use, those are the 2 most important ones, and then we get into the light blue and the even lighter blue, which you may not be able to distinguish. You could see that in this survey, the data warehouse on-premises beat out the data warehouse in the cloud, but more people were planning cloud implementations in general. And then if you go down the chart––you can see where I circled it––here you can see that a unified data warehouse data lake platform, eg, the cloud data lakehouse has the biggest potential for growth. So 22%––again, close to the 21% we saw in the previous survey were using the platform today, but 46% were planning to use it. And that's bigger than any of the other platforms that are listed here. And why is that? It's because of the reasons I mentioned previously––organizations don't like data silos, they didn't like their data swamp, and they wanted to manage and analyze diverse data in a performant manner. And then the data lakehouse can do all of that. So I'm going to ask Read what his data showed, but first, we wanted to ask you a poll question, and the poll question is, what is the status of the cloud data lakehouse in your organization? And there are 4 responses here. Please just pick one. The first one is, we've already implemented a cloud data lakehouse. Second is, we're planning to implement one in the next year. The third one is, we may implement it in the next few years if we need it. And then, we're using a cloud data warehouse and or a cloud data lake, but we have no plans for a cloud data lakehouse. So I'll give you a minute to input your responses…
State of the Data Lakehouse
Read Maloney:
Yeah, thanks for [that]––yeah. I know we have some differences in our data, so we wanted to see what the audience had to say, too, in terms of in terms of triangulating the information.
Fern Halper:
Right, so we'll see if it's––we may not see the same numbers, because people who are joining in may not be using the data lakehouse at all.
Read Maloney:
One of the differences we'll talk about, too, [is] we didn't ask specifically about cloud data lakehouses. We asked, data lakehouses in general, whether they're more of a hybrid or even on-prem kind of environment. We didn't differentiate between the two, and it's something that you'll see where our numbers are just slightly different.
Fern Halper:
Well, let's see what we have here. So, 27% say they've already implemented a cloud data lakehouse. Now, 31% are planning to implement it in the next year. That's good. So they'll learn more about the Dremio platform, and what you're saying there, and we may implement it. And only 14% are saying that there are no plans, but obviously, they want to learn about it. So, Read, what do you all see?
Read Maloney:
Yeah. So, one of the things that's interesting––and we probably have a biased set of people attending this webinar right now, trying to learn more about adoption, what's going on in their data teams––but if you go to the next slide, what we ended up going into was, look, we first wanted to test in our survey, which went up to about 500 different individuals: what's going on in the lakehouse? And then, as we talked about, in this particular one, we're going to talk about table formats, because, without an open table format, you're just querying data that's on the lake, and you don't move into this lakehouse environment. So we wanted to couple those together and look at those trends to help everybody that's in the space figure out what's the right decision for them. And then we also talked about data mesh and AI which we're going to cover in part 2. This is live on our website today. You can go out and download the report and read through all of the data as well, so we have that for you. But we'll also cover some highlights in this cast, and Fern and I are just going back and debating what we see in the data, and what we think is happening in the industry right now.
Data Lakehouse Awareness
Read Maloney:
If you go to the next slide, what we looked at is that we had 85% awareness of the lakehouse concept, which is just a massive change even in the last year. So I've been at Dremio just a little over a year since we came in, we're like, how many people understand this term? And even on our website, we explain the concept very early on like, what is a data lakehous? And the market in general, while there's a bunch of different interpretations of what a lakehouse is, and I'm sure you run into that Fern, of: does a lakehouse have this, or have that? Or is this in the bucket, or is that not in the bucket? There's this general agreement, you're going to get the flexibility of the lake, you're going to get the speed & performance controls that you would generally have in a warehouse on your lake using open table formats. And so that combined is what leads to the baseline of a warehouse. And so when we prompted people––we did prompt them, this wasn't like an unaided write-in, what you think––this was: in general, this is what the market defines us, the lakehouse as, are you familiar with this concept? 85% were very familiar with the concept, so that surprised us, this was higher than we thought. I don't know, Fern, if that's what you're running into in your conversations, too. But are you seeing that general level of awareness in the market?
Fern Halper:
Yeah, absolutely. And even in another survey that we did, we're just talking about unified platforms in general, and the lakehouse came up there. And I think it was 80%––more than 80% said that would be a big opportunity if we could do something like that. So again the question was asked differently, but a large majority of people said this would be an opportunity. The lakehouse was really on the move. I was surprised when I even saw that 20-something percent of our surveys were using it because it was a relatively new concept. And it's getting a lot of attention.
Read Maloney:
Yeah, these adoption models tend to follow value. The faster our concepts are adopted is usually because the value difference between the current paradigm and the new paradigm is different enough. And I think that's what we're seeing in the lakehouse construct, and why it's moving through the market from just word-of-mouth and virality perspective so quickly. While again, the definitions may vary, you get this genesis of the concept of, yeah, that's where I want to run my analytics. And we saw that come out.
Data Lakehouse Adoption
Read Maloney:
So what's interesting is Fern––it was great––you had some data on like, hey, we have 22% using this from a cloud perspective right now, with a much broader group to use that in the next year, or sorry, yours was just planning to adopt, and when you add up even the survey of the people are attending right now––thank you all for that––you had about a little over 50% saying they have adoption plans in the next year, and then you add another 20-plus percent in on adoption after that. So that was a little over 70%, but it wasn't time-bound. So ours is close to that, in terms of if you're in the audience today, this is what we saw, but in addition to that, we looked and said, what do we see as the percent of analytics running? And so, while we see 69%, and not only plan to adopt in the next 3 years, they plan that to be over 50% of their analytics in the organization. So this means, in 3 years, for most organizations, it will be the predominant way that analytics are delivered. And I think that's what you're seeing just as the performance continues to––I would say in some ways almost shock people in terms of just going direct query to the lake, and what can be done now and then managing that data with the open table formats––I can just start moving all of this data from a warehouse to a lakehouse, and I'm not seeing these negative trade-offs that I might have expected, and I think that's pushing people to see, oh, I can do more of this than I thought, and I'm starting to do it faster. And that's a pattern we see with our customers, which is––and we'll talk a little bit later about starting small, how you start, how you get going if you're not adopted in the group, but the second they see that, the adoption starts to go very quickly and the expansion starts.
So I think this is a different view, but generally aligns with what you're seeing in terms of the plan––was it 48% and 22%, Fern?
Fern Halper:
46 and 22––yeah, so pretty close.
Read Maloney:
Yeah. I think what we're seeing is there's at least some planned adoption in all aspects. But I also believe that, based on the way we probably ask the questions, if you were just querying a data lake, you were throwing that type of analytics workload into this bucket, too.
Fern Halper:
Yeah, I mean organizations that––I was just going to say that organizations want to do more analytics, and they've been hampered by many things, but one of the things is the infrastructure that they had. So this isn't surprising what I'm seeing here in your chart.
Data Lifecycle Remains Complex, Brittle, and Expensive
Read Maloney:
Yeah. And so like what we––part of why we're seeing that trend is, if you look at the current state of how the data goes from the sources to end up in the customers, there’s typically a lot of Spark in the middle of all these processes. But you got to go from the source, or you're telling that data, you're finally getting it into a lake, but then, again, you have to define the schema, and you're doing that for performance reasons, and so you're working back with the engineering team, and you're writing a whole ‘nother set of ETL pipelines to a set of data warehouses, or an overall enterprise data warehouse that they might also have a whole set of data marts across your organization––so it might even be more fractured than this. And then once you get to that, you're still trying to improve it for the client, so you might have cubes and extracts, etcetera, going on. And you just end up in this challenging set and scenario, where it just takes too long. The mean time to insight is––I got to talk to an engineer to do this, who has to talk to maybe even another engineering team to do that––and by the time you're back, I was just trying to answer this question or set up this report in Tableau or Power BI, it just takes too long. And then you start looking at the checks you're writing, and then you start looking at, how are you managing that code? How are you managing the code in between, and even managing the sequences? And every time we talk to engineers, they're like, I'm spending 30% of my time backfilling data based on the largely unsupported pipelines that created some data asset, data set at some point, and they're going back and trying to fix that––they didn't even know it was in production.
Enterprises are Moving to a Lakehouse to Simplify
Read Maloney:
And so I think the lakes really––why we see the lake come up, and this is not necessarily specific to Dremio, and we're a lakehouse and a big component of this, but we see these advantages across the lakehouse adoption spectrum, which is, you're trying to simplify all of that complexity, which is, you take out the warehouse need directly, you can query from clients at really high speed directly into the lake, and you're getting much closer to just doing raw source loads. In this case, from the source, you're transforming it into your table format of choice, because if you don't do that, you don't know that it landed, you don't have acid compliance, so you want to land it directly into that format.
What is Driving Adoption?
Read Maloney:
But this is really what we see as the drivers, and what that led us to is to talk about why our company is moving. And what we found is one of the big ones, not unsurprisingly, is cost. So if you look at all the cost and complexity in that cycle, and in that process, from going from source to the client to the actual value for the business, we see that of the people that are adopting the lakehouses, 56% of them think they're saving more than 50%, with 20% saying they're saving more than 76% by moving to a lakehouse. So a way to look at that is to say, there's only a smaller population that thinks their savings are in the 25%-ish range, and as you go through the chart, you start to see much higher cost savings. This aligns with the TCO analysis that we've done, and I would say it underrepresents what we hear from our customers. Our customers are often in the 70-80% range, although a lot of our customers are enterprise customers, and when we filtered our data by 10,000 and above, the 76-100% group went up to 38%. So that speaks to, I think, we'll see better returns to cost within the enterprise than you see in the overall market. I don't know, Fern, does that align with what you're seeing, too?
Fern Halper:
Well, I don't have a lot of data on that, and I have a question for you on that. So they're saving more than 50% with the lakehouse versus any other type of platform.
Read Maloney:
That's right, versus something else for analytics, let's call it any other analytics platform.
We just compared the current state, Fern, so it could have been like, hey, this is compared to a cloud data warehouse, or this is compared to an enterprise data warehouse. Both those options are possible in the way that we phrase the question.
Fern Halper:
Yeah, I mean, certainly we have data again on cloud, and cloud data warehouses, and cloud data lakes, saying that the top surprise, if you will, that organizations had with the cloud, was the cost, so I don't know if these are the reasons why they're saying that the cost was higher, but some of them might be if you know what I mean.
Read Maloney:
Yeah, could be, Yeah, yeah. I mean, I know, recently, like, obviously in the current economic uncertainty, it's hard to say that, there's an economic problem, but it was just a general uncertainty and especially like pressure on data organizations right now, from a budget perspective, we've seen cost become a much bigger element of the driver where they want to keep the data democratization, and they want to keep the self-service and the push to move faster going. But they're balancing that with a flat, or maybe declining budget. A lot of times its budget, but they have this push to do more with even, and I think that's putting a lot of pressure to say, hey, if I could offload this, if I could reduce the number, the whole web, of ETL processes we have to move from, get it from this view to this view, to this materialization, to this aggregation, to this summary––all of those different items add up and need to be maintained, and it just gets really expensive with all the systems in the middle. So if you just start shifting that left to the business, can you start getting those savings? Can you compute against that data more efficiently? And the market's telling us yes, right now. And that would align with the the adoption of lakehouses being fast, right, Fern, like awareness wasn't as high a year ago, it's built up this year, and the adoptions going quick, so there have to be value drivers.
The other one that showed up––and this is in the report for everybody that's listening––I don't have a slide on it––it was around ease of use and self-service and so that was 46% of people said that one of the reasons they were adopting a lakehouse again was the ability to essentially move faster. Fern, you mentioned eliminating silos earlier in the presentation, so we asked a lot of different questions around it, and we bucketed a few together to say, yeah, we think that's about what we're trying to get at was about 46% of people. So that was a main driver for them.
Fern Halper:
Yeah. And that's always that's always the driver. As I said, organizations want to move to unified, they want to try to unify things and make things simpler, they want one source of the truth, data governance, they want self-service––it’s their top priority, they're trying to make things easier.
Where is the Data Coming From?
Read Maloney:
Yeah. So it's like, it's not compromising. The lakehouse helps you, maybe not compromise as much where you can deliver better self-service, and you can do so while reducing your costs. And so you end up in the best of both worlds for that. And that's probably what's driving the adoption high. So if you guys remember, and you guys meaning the audience, if you look back to what Fern presented today––the top technology people are using today is cloud data warehouses. And number 2 is enterprise data warehouses. So it should be no surprise, that's where the data is coming from. It was a little bit of a surprise to me, because I was like, look if you already have a data lake, it’s pretty simple to just go and say, well, we're going to go and move that, and make some changes to move that into a lakehouse. It seems like there would be less movement of data to go and do that. And what we've seen is, that the number one source is cloud data warehouses. And so, why do we think that's happening? We think it's for 2 reasons. Number one is, it's highly used. It's a highly used platform. But if people are moving from it, there must be pain. What is that pain? And the pain we hear a lot is cost, which is I want these types of benefits that I get from the cloud data warehouse, which has generally done a pretty good job of separating compute and storage, but you just run into these massive cost runs because you're defining the schema, and then you're running against it, and you're letting everybody go, and then all of a sudden, the costs surprise you. And you just end up with this expensive bill at the end of the month.
Fern Halper:
That's so interesting because it makes me think about the fact that I used to see, not so long ago, over a couple of years ago, I would see, like 55% of respondents using cloud data warehouses, and now the number seems lower. And I wonder if some of them––it's what you say that they went from the cloud data warehouse to the cloud data lakehouse.
Read Maloney:
Well, I think it's slightly different––this isn't the number that is still using the warehouse, this where is the data goes into the lakehouse, so it's the leading source of the lakehouse. So if 86%, which I think was your slide, are using a cloud data warehouse today––well, if that's the predominant way analytics are done today and that predominates the way is not supporting the goals of data organizations, which is helpful to the business––reducing the mean time to insight (MTTI), and do so on a shrinking or smaller budget. Something has to give, something has to change. And the way we see the market, there's only one architecture that gets you there. And we see that's probably where the adoption is high here. And we know people have been moving on from enterprise data warehouses for a long time, for a wide variety of reasons, cost being one, often scale being another. And so I'm not surprised to see those show up, I just thought the data lake would be number one. I was wrong.
Fern Halper:
That's also yes, that's a good point. But it's also interesting to me that––I mean, that also suggests to me that organizations are still primarily thinking about structured data, that's what was in their cloud data warehouses. And to me, there's so much value––one of the big values of the data lake and the data lakehouse, is that you can analyze more diverse data, which I think is where the real value lies, and organizations are just starting to move down that path, and hopefully the data lakehouse will help. We just did a best practice report on diverse data, and I was heartened to see that data like text data, was being used more often and that organizations were starting to analyze that, and like the example I gave, marrying it together, there's just so much value to be had there. And we see that organizations are doing that, so hopefully that portends, [that] we’ll also be doing more with it in this new type of environment. We'll see.
Read Maloney:
We still see, though, if you look at well, say, including semi-structured data like JSON, if you bring these in together, we still see a lot of organizations––I'll use customer 360, or supply 360––even without using the unstructured bits, or analyzing the unstructured bits to structure them, which is a common use case for AI, you're reading through a bunch of unstructured data. And then what are you doing? You're adding attributes or tags to the data, and you're essentially creating more metadata on top of the data. And that's a great use case. But we see even things where it's just like you would think it's more basic, still haven't been done. And the reason is, again, the data is stuck in silos, it's too hard to query across these sources, or if you can get the query across the source, it's too slow to be something you could put into a daily dashboard or something that's even an hour, or whatever that is, it's like you just end up in these system limitation problems. And you're back into, hey, we thought we solved this 5 years ago, and solved it as a group. And again, I think that's part of that ease-of-use bucket, it’s hard for us to peel off sometimes, whether the ease of use is more like, hey, I'm breaking down the data silos, or whether it's more, hey, these user interfaces to create views are so much better, I can let the business just rock and roll now. But that grouping is some type of key driver into the lakehouse right now.
Fern Halper:
Yeah. Well, yeah, let's talk about the next trend, let's talk about these open table formats, because that's another thing that we're seeing, and an open table format refers to a method of organizing and storing data in a structured format that's compatible and interoperable with various processing systems. I mean, table formats have been around since, relational database management systems. But, you're trying to structure datasets and files in a way that you can get at your data, and in a performant way. And then this open table format, it's open in the sense that it's not proprietary to any specific software or vendor, and can be used and accessed by a wide range of tools and technologies. The data lakehouse open table format makes it efficient for both data warehousing and data science workloads that use diverse data. And I just put some key characteristics of this open table format.–––schema unread, the schema is applied at the time of reading the data, rather than when the data is stored, so that can help with diverse data sets; these open table formats like Iceberg, often support schema evolution, which allows the structure of the data, like the schema of a table, to change over time without the need to rewrite old data. That's one of the innovations of Apache Iceberg that we'll talk about. I mean, this is particularly important for managing large data sets that evolve. The schema doesn't have to conform to specific file formats. Many open table formats, we talked about, include support for advanced features like acid transactions, efficient metadata handling, like metadata trees and Iceberg, hidden partitioning, and optimizing for query performance.
Apache Iceberg
Fern Halper:
There are several of these open table formats––Delta Lake is one, Apache Iceberg is one, and Apache Hudi is another. Those are 3 popular open-source data table formats that are supposed to make it easier to manage large amounts of data in the lakehouse. So here I'm highlighting Apache Iceberg, which is an open table format that's designed for large analytic data sets. I think it was created like 5 years ago at Netflix, to solve some of the problems that Netflix was having with Hive, using Hive with Hadoop. And Iceberg focuses on defining tables as lists of files––I put this as the second bullet––instead of a list of directories and sub-directories that was done with Hive, that was associated with Hadoop. So systems like Apache Iceberg and Delta Lake, define tables directly as a list of files that can be stored in object stores which allows these systems to better utilize modern cloud architectures and distributed computing paradigms. I've also listed some of the features of Iceberg, including things like expressive SQL. The fact that there's a rich set of features, that––SQL features that enable complex queries and data manipulations on the data lakehouse, and Iceberg also supports a variety of query engines. There's schema evolution, which we talked about, because tables change, whether that means you're adding or removing something, or renaming something, this is a significant innovation in Iceberg because you don't have to rewrite existing data. There are things like hidden partitioning, where the system automatically organizes data into partitions, but the end user doesn't have to worry about that. There's time-travel and rollback, which allows users to query data at a previous point in time, and that enables the examination of historical data or recovery from accidental data changes. There are also things like data compaction, which refers to the process of optimizing storage by consolidating smaller files into larger ones, and that can improve performance and reduce the overhead associated with managing a large number of small files. So also, Iceberg has a scalable metadata management system, that's another innovation, to make it easier to manage large data sets.
Apache Iceberg and Delta Lake Usage Expected to Grow
Fern Halper:
So, in terms of what we're seeing––and this is a second set of data to compare––this is data from that other best practice report I showed you before. In it, we asked which of the following types of data management systems, platform storage cloud services, and technology standards are currently in use or planned for future use. So it's the same story, where dark gray is currently used, and dark blue is planning to use. I blocked off 3 of the table formats, Hive, Delta Lake, and Apache Iceberg, and you could see that 26% of the respondents cited Hive, and close to 20% cited Iceberg and Delta Lake––they were tied. They were also tied in our data, in terms of Delta Lake and Iceberg in terms of planning to use––the bars seem pretty similar. We wanted to ask you all, what table formats your organization uses. And you can select more than one if you want here––I put Hive, I put Iceberg, I put Delta Lake, I put Hudi, and then I put other. So let's see if we can get some data here.
Read Maloney:
Yeah. And this one is one where, I think our our data is slightly different. We'll compare and see what we think about that. It's not too far off, [however,] there are some adoption differences, in terms of planned adoption, that we saw in our survey. It'll be interesting because you asked this multi-question. [After all,] we'll run into customers who have multiple formats because there might be one that's running from the central team, but then, hey, the business wants to support some other tools, etc., and then they have another format, and then that's led to all this talk about catalogs that's going on right now in this whole table format concept.
Fern Halper:
And I realize I should have put we don't use it at all. But I think that people will put others if that's the case.
Read Maloney:
Do you mean if they’re just using Parquet right now? I mean they’re not using a table format? It’s almost more like none.
Fern Halper:
But that's why I think––I see more people coming in now because I think that people weren't using it, which makes sense. After all, they were saying that they were planning to use it.
Read Maloney:
So yeah, if only 20% adopted a lakehouse that is on right now, then most of them probably have none here, because you're going to have to have a table format––although you see a lot of Hive adoption and the data lake construct, and it's still out there a lot, and we see that as well. The question is whether––it obviously can't do a lot of what––I'd say, the groups that are using Hive extensively now in more of a lakehouse format have done a lot of work [for] it [to] work. And the rest of our adopting more of a format built for doing this, either Data Lake Delta Lake, or Apache Iceberg.
Fern Halper:
Yeah, let's see…Well, a lot of the people are using Delta Lake, they're not using Iceberg. They're using either Delta Lake, Hive, or ‘other,’ which probably means that a lot of them aren’t using anything. And so that's interesting, I think.
Read Maloney:
Yeah, well, yeah, where the group is right now if we go to the next slide––in terms of what our data is showing, we'll go back here. So when we asked, just in general, have you adopted an open table format? We had 56% answer this question as, yes. And this could be, hey, look, I have a data lake, and I'm using Hive that can could be part of it. But the way we phrased it is specifically what's on there, which is like Apache Iceberg. And then we looked at adoption timeframes, And so 50%, almost 56% had adopted a format, whereas another 25.7% had plans to in the next year. And so if you look at that all together, that would mean in the next overall, next year, we're looking at over 71% saying, look, we'll have adopted a format. And that will put them in some form of a lakehouse construct. And what's interesting, is that's slightly higher than, I think––unless you add up yours, Fern, the 22%, plus the 46% plan to adopt us all the next year––that would land you in somewhere the next ballpark. But I think what we're seeing in the data right now is people say, yeah, I have an open table format, some of that's going to just hey, we use Hive today, we have Hive tables and we got some sort of Hive meta store that we're using to manage those tables.
Read Maloney:
And I think in the lakehouse we're generally seeing adoption into the table formats change from Hive, as you can see here. So where you had Iceberg and Delta Lake being equipped roughly equivalent in adoption, our survey returned that Delta Lake had more adoption today, then next, Iceberg. And then, you see Hudi, and then obviously Hive’s moving. And so that tells me a little bit of that, in the interpretation of the survey, that our customers were largely looking forward into what called lakehouse table formats, and away from legacy formats, but some, I think, based on what they've done with Hive, the investment they've made into Hive, and how they plan to make that work in the lakehouse world is, that they're going to be sticking with it. So that would align with the customers I've been talking to. What we see, though, in terms of planned adoption, if you look at all the planned adoptions that are happening in the next 3 years, Iceberg starts to take over from a format choice. So there's sort of the––what is it looking backward, and what will it be looking forward? And we've seen a shift towards Iceberg, at least with our customers in the last year.
And we support both––so just for everyone on right now, you can do all read operations, so you can go and define logical data marts, we have ways where we accelerate those we're fast across doing data both in the lake and other RDBMSs and federating across that, and you can do that with both Delta Lake and Iceberg, but when it comes to all the write operations, we support, copying into creating tables, deletes, updates, etcetera, all the DML Ops, that's all going to be done with Iceberg as a format. And there's a reason why we did that, and it's a reason why we think that the planned adoption relates to why people are choosing formats, which is also in our full paper that's on our site––we think this is related to the ecosystem.
Read Maloney:
And so what I'm showing here is the contributions that are a part of both Delta Lake and Iceberg, in terms of updating and and advancing on the table format. And, as you can see, with for Delta Lake, it's over 91% from Databricks. And so when we went and we're looking strategically as a company on hey, which format are we going to support, we wanted to support the one that was the open Apache project and had a diverse group that was fuelling it. And if you go when you look at the ecosystem that's developed on Iceberg, it’s bigger, the open source ecosystem available around Iceberg is bigger. There are some competitive items here because Databricks won't let you write into Iceberg. So we see the market now with Stripe, and ourselves, and Netflix, and Apple, and Tabular, and Linkedin, and all these groups, AWS as well, investing in Iceberg, is, that's the viewpoint of the open standard. And that's what we think we saw with ORC to Parquet, where they were like, hey, what's going to happen? In general, ORCs out there, Parquet became the standard––that could happen here, or something else. A lot is going on in this space right now that could continue to emerge, and we're generally [here at] Dremio open to the one that focuses on the one that we think is open and allows customers the choice to own their data and be as productive with their data as possible. So that's our guess as to why the Iceberg adoption is higher for planned, than Delta. But when we ask the question directly, we give you all the raw results in the paper, it wasn't 100% clear, there were a bunch of different choices that customers made, including performance being one of the key elements.
Fern Halper:
That's very interesting. And yeah, our data showed that they were pretty equal, which is interesting also because I hear companies talking, at least in our audience, a lot about Delta Lake, but when we surveyed them they were also talking about Icebergs. They were created for 2 different things, but maybe, as organizations, they need to address performance issues. If that many were using Hive, maybe they're going to be moving more to Iceberg also.
Read Maloney:
So the data suggests that Fern-like it suggests that they will be, but I think in the end, companies are going to need to work within the platforms that work best for their requirements. And I think the note that we're seeing is most companies that go into––we'll call it the Databricks world––they're going to use Delta Lake. Just based on the way that they've set that up like, if you want to do your pipelines using Databricks, you're going to be riding into Delta Lake. And then, once you're Data Lake in that environment, you're in more of the overall Databricks ecosystem. And, we see the concern of, now we’re open, can you get locked in by a vendor again, like it’s happened over and over again, where you had things like Teradata, and then you're like, oh, we’re all good, we’re in cloud data warehouses, and then everyone's looking at the cost bill again. It's like, well, what's the open standard? When I look at a table like this, when we look at the contributions, we bet that Iceberg is the open standard, and our view as Dremio, where we’re co-creators with Apache Arrow, is, that we want to be associated with all the tools that can be done in that format. And so, our bet on the right data is with Iceberg.
Best Practices for Lakehouse
Fern Halper:
It'll be interesting to see because even though adoption has been faster than I thought it would be, it's still relatively early. So we wanted to save time for audience Q&A because I see some questions coming in, but we wanted to conclude this by talking about some best practices, and Read also wanted to share a little bit more about Dremio with you. But what are some best practices for getting started with the data lakehouse? I mean, it first includes understanding your needs and your goals like we were just saying. What are the use cases, what are your needs, what use cases don't make sense to start with, what problem are you solving? You're probably going to start small. But it's important to know why you're doing what you're doing. You're going to have to think about the lakehouse platforms tack that's going to fit your needs. And what's that going to look like? And how are you going to get there? I have, planning for integration, but don't forget about data governance. What we see––I just always want to put this, in the planning stages––don't keep it as an afterthought, because that's what a lot of organizations do. And we consistently see––then data governance becomes a top priority, and it's a top challenge, but don't wait for it, think about it upfront even with the data lakehouse. And, then you're going to have to determine how you're going to manage your platform and optimize for performance, and it's also going to be important to monitor what you have in place. So, Read, do you want to tell us a little bit about Dremio and some best practices for getting started?
Read Maloney:
Sure, just like one interesting note on governance is, that we have a customer that––there are requests for data. So you get into the platform and you can see all this data that's there, and you say, oh, well, I want access to that. So in a lot of cases, you'd be cutting a ticket to a team to say, hey, I'd like access. Well, they put AI in the middle of that with the governance team as a human in the loop. And so in certain cases where they knew what was in the data, they would say, oh, there's no problem with that. And so the AI would grant access. And then, in other cases where it was like, we're not so sure based on this job role and this part of the department, and what the data is that they should get access, they’d bring the human in the loop, and I thought it was a pretty innovative way that people are, going in a very large enterprise that has to, has to deal with just tons of requests that come in. How do you get access to the customer to their business, as they need it, in a way that reduces time to insight, but does so in a very safe manner?
A Fraction of the Cost of Other Analytics Offerings
Read Maloney:
So yeah, absolutely, going to a couple of things with Dremio, as you go back to some of the initial slides we looked at, just what's happening in the market now, in terms of going from source to delivering data that's highly performant to the business to the end customers, are we have specific technology at Dremio that's attacked each of these different elements. So, we have a way where we have a query engine, where it's intelligent enough to be able to understand different ways to be able to rewrite the query live. So what we might do is, have your business go and create a view, and they're going to write a view, and they're going to write a query against that. But that view is a logical view of the data, and if you have lots of stacked logical views, in the end, you could get poor performance. What we have is a technology called Reflections, where we will help, you understand and then recommend, hey, this aggregation of the data, or summarization of your data, should be created so that you maintain high performance while you let the business go and create the views and essentially virtual data warehouses that are right for them. And are the query engines smart enough to say, yes, I know you wrote this query towards this view, but what you want is to have written this query and go and hit the summarization? And so that allows many things to happen, but one of them is, that it allows the business to go and have a lot more freedom to operate without really slowing down and bogging down the system by creating all these virtualized data marts, but then, having the technology to ensure they still run fast. And then the other part of that is, we also see this move going into the data lake, becoming more from an ETL to more of a raw load. Now they are still a transformation into the table format that you've chosen, but after that, how do you version that? How do you ensure that the data, especially the data products that you're creating as a company and then exposing to the business are accurate, clean, and verified? And so we've brought in git-inspired data versioning. So you can branch, add the data to it, merge it back into the main branch, and version your data. And so it's a way to make all of these different, really complicated things that happen in the data life cycle right now, to go from, hey, I'd like to have a data set that looks like this for a report, to actually build in the report, and truncate that down. And what that also does is it just lowers your costs. So we become a fraction of the cost of other analytics offerings because we're targeting the pain that exists right now in between all these stages to get data to the customers.
Read Maloney:
And so in general, our customers are improving their mean time to insight between 8-10x. We have a set of customer stories like Amazon––Amazon's a great customer of ours, where, within their supply chain group, they were able to vastly improve their speed, and time to insight by using a tool like ours. Again, the cost and complexity go down. Most of our customers, as I mentioned earlier, are seeing 70-80% savings over currently what they are doing with analytics. We've also done a TCO report that we've put out. And so that shows how the lakehouse is much less expensive than current offerings. And then you can see, you're simplifying all that data management because you now need to have these data products coming in and you're exposing them to the business so that they can go out and build their virtual data marts. And we're accelerating that with our technology called reflections, you've now simplified all that, because you have a version control level across all of it. So that's really what it comes to. We view this as bringing all your analytics together because it can be in the lake, or it can be, Oracle or other RDBMSs, and you can federate across that, so you're getting this unified approach for self-service analytics, and you're doing so at a fraction of the cost.
Get Started with Dremio for Free
Read Maloney:
So with that firm last pitch here, you can get started for free. We have both software and cloud versions. They're completely free, there are no time limits. You can just get in and use it [for] what makes sense for you on data. And so what we see a lot of customers start with is picking just one or two use cases. And you probably heard that before, like, how do you do this? Well, look, if you already have data in the lake, most customers do, start with the data already in the lake, don't start by re-platforming an entire pipeline. And then go and experience the difference. The difference in terms of being able to query that data directly off the lake, manage it in the lake, and expose it to one department, allows them to define the different data assets they need virtually and still get that performance––that's a great way to start. And we see a lot of customer use cases start either in the customer 360 area, whether that's financial services, or retail those types of industries, or whether we see it over more on the supply chain. We see a lot of different groups in the supply chain side firm.
Questions?
Andrew Miller:
Yeah, that was great, thank you both. That was a fantastic conversation there. We don't have a whole lot of time for questions, so I'm going to jump right into the one question we have time for today. So, Read, this is for you, you mentioned that users are satisfied, generally speaking, with their experience of data lakehouses. However, this person asks if there are any areas identified for improvement based on user feedback.
Read Maloney:
Improvement––I don't know who asked the question––I mean, I'm going to have to take a stab, I guess because we can't ask you whether it's an improvement for lakehouses in general or an improvement for Dremio. Maybe I'll stick to improvement for lakehouses in general. I think if you look at it––I'm going to go back to warehouses for a second––if you look at warehouses and all the associated code that was written even around things like UD apps, etcetera that helps to expand the usability of the platform into multiple ways to customize it, I see that as still an improvement area that you could say is needed for lakehouses.
The other one is, what we have a point of view on––so this is a biased one, which is, that we want to make sure that you can do your entire pipeline: end-to-end source to get it to the customer with SQL. We view, SQL [as] being the language of business, and one that's going to help customers continue to democratize analytics, and data, and the use of it, across every single question and every single part of what they're doing in their organizations. And so I still think that that has ways to [go] from an adoption standpoint, in with the lakehouse environment. So it's just that the whole area, to simplify will make it even faster for the business to get the data sets that they need while they're working. And as a marketer, [it’s] near to my heart, I have a bunch of people on our team––we use Dremio internally as our lakehouse, and we are always trying to find ways to move faster.
Closing
Andrew Miller:
Alright, fantastic. Well, unfortunately, this does leave us with no time remaining today, so I would like to thank our speakers. And if you'd like to, you can follow up on Part 2. I would like to thank Fern Halper with TDWI and Read Maloney with Dremio. I would also like to thank Dremio for sponsoring today's webinar, and lastly, from all of us here, let me say, thank you so much for attending. This concludes today's event.