March 1, 2023

2:00 pm - 2:45 pm GMT

Panel on the role of open data lakehouse in data democratization.

No description available

Topics Covered

Keynotes
Lakehouse Architecture

Sign up to watch all Subsurface 2023 sessions

Transcript

Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.

Yassine Faihe:

Okay. Hi everyone. So yeah, let’s bounce on the presentation that Matt has just delivered, right? And I would like to welcome you to this industry panel session. The topic that we are going to dive into is really to explore the role of open data lakehouse in enabling data democratization. Okay. So this is an open discussion that we are going to have. My name is Yassine Faihe. I’m looking after a team of solutions architects for international, as mentioned by Matt, Amir, and APJ. And I’ll be the moderator for this session. So let’s get started. And I’m going to ask the gentlemen of this panel to start introducing themselves. Start maybe from the far right.

Peter Rees:

Hi there, Peter Rees. I lead enterprise data architecture at Maersk. So I’ve been working in this whole area from the early days of MIS systems data warehouses through Hadoop and now into these cloud lake data lakes and lakehouses. So, quite a range of experience there.

Matt Aslett:

Right? Hi, so I’m Matt Aslett, a research director. I’ve been doing research. As was mentioned earlier, I’ve been in the industry and as an analyst in the data analytics space for actually, for about 15 years, but 15 months with Ventana.

Shamil Shah:

Hi everyone. Shamil Shah. I’m a partner at Baringa Partners. We are a consultancy with a heritage in energy, but work across multiple industries and I guess about 20 years of experience probably in consultancy, but I guess I’ve got the privilege of going into lots of different types of organizations, and seeing the challenges they have with data and working out what the right architecture solution and organization is to support their data challenges.

JB Onofre:

Alright. JB Onofre, principal engineer at Dremio. I’m a member of the Apache Software Foundation for 15 years now, and working roughly on 20 Apache project’s as PMC member.

Business Users VS Central IT

Yassine Faihe:

Okay. Thank you very much, gentlemen. So let’s set a little bit of context here. Okay. So we are witnessing a tension or friction between two communities. On one hand, we have the business users or the data analysts who would really like to access any data from any source at the highest level of fidelity and in full autonomy. This is basically their dream. This is what they’re struggling to get. And on the other side, we have what may call central IT, who really would like to offer this services, but not only do they lack the appropriate tools and platforms, but more importantly, they struggle to have the appropriate governance in place, right? To reduce their security risks. So what I would like to understand here is in a very direct manner, what could be the playbook to address those challenges? And this is the first question for Matt.

Matt Aslett:

Yeah. So I think this, what you’ve described there is obviously not anything that’s new. I mean, this friction has been there for, well, forever probably, but it’s certainly been exacerbated, as you say by. And I talked about organizations trying to be more data driven. Obviously if you want people in non-technical roles for a better phrase to have access to and use data you need to give them the skills to obviously you need the technologies, but you need to facilitate that with business process change as well. And I think this is where organizations are and a lot of them are struggling. Clearly there’s no silver bullet on that. It’s not about you can just adopt a specific technology and that’s it. As I said, it is a combination of people and process and organizational change.

But I do think that we’ve seen to pick one technology that would, to sort of start with that can help, would be that the catalog as a centralized resource, but obviously in combination with the data lake. Because that provides the data management professionals with an environment where they can set the guardrails in terms of governance and security and privacy and access controls. But obviously it potentially enables the business users also to have more access to data, easier access, self-service access to data, obviously within the context of those guardrails. But it is something that we’ve seen, if anything in combination with a change in attitude towards data governance, those technologies can enable organizations actually to go faster because, as I said, those guardrails are in place that facilitates self-service, but in a managed and governed way.

Adopting a Self Service Approach to Data Access

Yassine Faihe:

Okay. Yeah. Thank you very much Matt for this first question. So let’s try to exemplify these challenges. And my next question for Peter and Richard is about, perhaps you can share some concrete examples, right? Of these challenges that you went through most probably before adopting Dremio. And something that is pretty important for us as well, is the key value drivers, right? That pushed you to address those challenges and adopt a self-service approach to data access.

Peter Rees:

I think one of the key things and I think coming to your point about the data Lakers being an enabler, it’s been an enabler to put a lot of data into an environment, and the catalog is a key part of that. And it’s like the cataloging is almost there’s a tension between governing the data and the speed at which you can put it in there. And I think one of the challenges that we’ve seen is that we wanted to get a lot of data and make it available to people. But then the governance has had to catch up on that. And what that leads to is I think a degree of friction in being able to access the data. And what we’ve seen is that maybe because we have like Maersk is a large global organization. We have a wide range of business units that are using our data lake and wanting to use some of the same data, but we’ve ended up with duplication and proliferation of data sets.

So I think that is one of our key challenges that then slow down the ability to actually access data. And that’s very much, I think, the thrust of what we’re trying to address now with cataloging and particularly around how you can get the value of fast data into the environment, but using automation, automatic scanning of files as well to try and automate a lot of that process of cataloging and classifying data and making it available to users more speedily.

How Are People and Culture Driving These Processes?

Yassine Faihe:

Thank you. That’s very insightful and let’s change gears and have the point of view on the same topic, but more from a consultant firm, consulting firm, sorry. And get the opinion of Shamil beyond the technology and the platform aspects. Right? As mentioned by Matt during his presentation, there is a huge component that is being driven by processes, people, and culture, right? So could you please also share with us your view on those aspects, please?

Shamil Shah:

Yeah, so I guess when I’ve arrived into a new client who can come to me with saying they need a data strategy or they’ve got a challenge and they need to implement a data catalog. I guess the approach I really employ is really trying to understand them as an organization, understand what their C-suite are looking for. What their strategy is and what is hindering them from a day to day perspective. And that’s fundamentally important. And the way I see, you know all the capability, the design patterns that we’ve got available to us from a data mart to a data lake to and now data mesh. These, for me, are all different tools that allow us to really think about how an organization wants to operate in the future.

And the journey that I take organizations on is really, well, let’s try and find out where’s the value going to come from and let’s solve that first. Let’s solve that and try and prove technology patterns and architecture to work out what’s right and fit for purpose for the organization. That helps you to go on a journey of discovery with customers, with the users of the systems to then start thinking about, “Okay, well what’s the future here?” And I think that then what comes behind that is to manage a data mesh to have the right capabilities holistically in an organization. There is a thought process about the operating model, the organizational capabilities that are required to do that. And that’s a long term journey, especially with the scarcity of skills in the market. So we look at that as a how do you solve for that business value upfront, learn from the experience of working with your customers, but then build the foundations that are going to give you a future state looking at the skills and the capability you need. And that dualities is fundamentally important for success. And you see so many projects fail because they haven’t thought about that overall organizational need and the operating model that they need.

Evolution of Data Platforms

Yassine Faihe:

Thank you, Shamil. Back to you, Matt. Now, what we would like to understand is the evolution of data platforms, right? So data warehouses were ruling the world for the past several decades, and now we are noticing the emergence of data lakehouses and more specifically of open data lakehouses. So could you please share with us why you believe, if it is the case obviously, that data lakehouses or open data lakehouses are going really to establish themselves as a credible alternative to data our houses moving forward?

Matt Aslett:

Yeah. No, I say, and I would agree, certainly there is these, the potential for the data lakehouse to establish itself. As I said we already see the data lake and by the addition of capabilities on that, the data lakehouse is already a significant part of the overall data ecosystem. Interestingly, we see in our data at least, actually it’s a… I was going to say very few, but it’s about a quarter of organizations that say they have replaced data warehousing entirely with a lake or lakehouse. And it’s a relatively small number. In the majority of organizations they have data lakehouse or data lake, and they also have data warehouse, and they’re potentially feeding data between the two as it makes sense, giving a particular analytic initiative or workload.

And clearly there are some organizations still that just have data warehouse, and don’t have data lake or data lakehouse. So definitely, we’re in a the… This market is evolving. I think definitely the direction of travel is towards lakehouse from data lake because of those capabilities that are now being added to those environments many of which are obviously things that are proven in a warehousing environment and now being adapted and being brought to the data lake to provide additional value on the data that is in those environments. And the data is in those environments, primarily because it was maybe not the like format or the volume to fit in a traditional data warehouse. Which obviously has different cost and complexity aspects to it.

So I’d say at the moment, it’s very much a matter of the two coexisting. Clearly. We also see it in an evolution in the data warehouse space, where many of the vendors there are adapting their offerings to work alongside and set alongside object storage as the resistance layer. And so is that a lakehouse? Some people would say it is. Others would definitely say it’s absolutely not. So we try to use slightly different terminology to talk about the overall direction, which is the combination of the data lake and the analytic processing capabilities. But definitely within that, I’d say Lakehouse is absolutely here to stay. And for those, as I said, for those organizations that have already made an investment in the data lake, that first phase, data lakehouse is definitely the direction of travel to generate even more value from those investments.

Is the Data Lakehouse Evolution or Revolution?

Yassine Faihe:

Okay. So back to Richard and Peter, maybe two words, coexistence of data warehouse and data lakehouse, or revolution versus evolution. Could you please comment?

Peter Rees:

Yeah, I think one of the drivers for us, first of all, for the adoption of lakehouse is that I see it as we can create a more self-contained enterprise service that we can deliver out to different parts of the organization to support an overall data mesh architecture. So it’s really, at the end of the day, it is, I think, a revolution. Okay. And it’s very much about getting that lakehouse architecture correct with the right tooling for whether it’s fast SQL query or high performance computes Spark or whatever for analytical workloads and that sort of governance wrap and data sort of visibility and accessibility through the cataloging. And I think that our driver is to really try and reduce the cost of ownership of that and actually be able to support multiple implementations of a standardized architecture. And through that it really helps fix that speed of value thing that Richard was talking about, the governance, the accessibility of data. And I think it just reduces the cost of ownership overall. So it is going to be, ultimately, a revolution, I think. Okay.

Yassine Faihe:

Any comment Shamil on this? Same topic?

Shamil Shah:

Yeah, no, I guess when I reflect on organizations, what you typically find now is actually they’ve got a combination of everything. They’ve got a combination of data math, they’ve got data sitting under people’s desks to lakehouses and some moving onto a data mesh. I think the key to all of this in my mind, it’s going back to really understanding what you’re trying to solve for. And I think that’s where I see a lot of projects fail. And I think that there absolutely could be a revolution, but I think customers need very different things and tools, which is going back to my point earlier, I think that the plethora of innovation in this industry, the plethora of tools we’ve now got, and different patterns and models to solve different problems and use cases is phenomenal.

And I think the. Our job gets more and more complex because actually we need to understand all of those, understand what users need, and how do you join up all these concepts and tools together. So I think there is absolutely an opportunity for a revolution in leveraging some of these technologies, some of these concepts, but the way you implement it, the approach you take, how do you engage users, how do you do training and adoption is going to be critical for that revolution to be successful.

Emergence of Factor Standards

Yassine Faihe:

Okay. Great. So let’s focus on the openness or the open side of the house, right? Of the open data lakehouse. And this is a question for you JB. When you’ve spent a couple of decades in the open source community, probably you have also witnessed the emergence of what may call the factor standards, right? In multiple domains. Could be in the last one that at least would come to my mind is file format. Right? Now I would like to get your opinion about open table format, right? If you can tell us a little bit more about that and potentially about the emergence of some of the factor standards in the open table format, please.

JB Onofre:

Yeah so I think the most important thing when we talk about the factor standards is probably the adoption. That’s the way that the project becomes a successful project from an open source standpoint. So when we talk about adoption, we are two parts. We have a technology part and the community and governance, which is completely sometimes related, but not necessary. So if we talk about the technology, especially for the table format, they have the technology to be speed and performance. And the features that are provided for the end users. So in table format we can mention things like low compaction, time travel, all these parts are basically the feature that allows you to say, “Okay, this stable format is the one I need for my use case.” On the other hand, there are also the governance and community.

There’s a huge difference between a project under the Apache license and a project from the Apache Foundation, because in terms of governance it is not the same at all. You can be a project under the Apache license, but it’s controlled by a single company. So this company decides the releases, you decide the features they want to include or not. But open governance is a way to guarantee an adoption on any kind of, let’s say use cases and general speaking. So the interesting fact, and Matt mentioned during the presentation is all table format today are open source Delta, Hudi, Apache, and Apache Iceberg. So the difference is going to be on these two criteria: the governance of a product and the features provided. So saying that there is one better than another is not really relevant. It depends on your use case. It depends what you are looking for. But if you keep in mind these two criterias, you can have your own opinion about which one is probably the most sustainable.

Yassine Faihe:

Okay. Okay. Maybe a loaded answer.

JB Onofre:

Yeah.

How Has Dremio Helped Your Company?

Yassine Faihe:

So yeah, let’s try to close this debate perhaps with some experience sharing, right? So, I mean, we started the discussion about the challenges and about ways to address those challenges from a people, process, technology, culture perspectives. So I would like now to get back to Peter, Richard, and Shamil. Shamil with perhaps the hat not of a consulting firm, but of a customer with whom you are working to tell us about how Dremio has helped your company address the challenges that you mentioned before. Let’s start from Peter.

Peter Rees:

Okay. So I think one of the challenges that probably was inherent in our environment with some of the complexity and the range of tooling that we had and the layers that we had within our data architecture and I think what we’ve been able to do with Dremio is to remove certain things like a database performance layer for our Power BI reporting. And that’s worked very well. Where we’ve used Dremio directly over ADLS data sets for power BI reporting. And that’s worked. So I think in terms of reducing some of the copies or the layers of data that we actually had within our environment and effectively either virtualizing those within Dremio or having reflections of them within Dremio has worked really well. I think that’s probably one of the key things.

Yassine Faihe:

Thank you.

Shamil Shah:

I’m going to ask slightly different questions because of course, and Nick are going to talk about RWE, who I’ve been working with the last three years on a journey and tell you about some of the benefits that we can see through the work that we’ve done there. But I think if I reflect on the journey that we’ve been through starting about three years ago, I think now with the strategy around the organization and looking at what organization that they want to be. I think we, what. I guess the advice that will go on is to really understand who are the users that are going to, I guess, be the most vocal, are going to be the users that are going to be the hardest to please. Get them into a room with the products that you’re looking at and considering as you go through this journey and get them to try them out, test them, and you’ll learn so much from that.

That’s what we did. We looked at what are the hardest use cases, what are the ones that are going to test the platform significantly around the performance and the volume of data. What are the non functionals around security and accessibility to individuals? What are the ways of working these individuals? And use that really as your guide to select the right products that you want. And as said, Nick and of course are going to talk to you about our journey with Dremio and why we ultimately chose that kind of product in that scenario and those sort of use cases.

Future of Open Data Lakehouse

Yassine Faihe:

Okay. Great. So, concluding remarks from JB about the future of the open data lakehouse. So, I mean, we’ve done, we’ve seen some innovation initially at the file format level, right? So we used to work with maybe flat files and then we decided that these are not suitable for analytics. So some new file format emerge Parquet.org. And then we went to a level above talking about open table format. This is the debate that is currently ongoing. Okay. And the next level of innovation will come from the metadata. Okay. So metadata has been maybe left aside for several years. We are still using the Hive Metastore which is the, perhaps the only remaining legacy from the Hadoop Data Stack. Sorry. So JB could you please tell us what’s going on from an innovation perspective in the metadata space?

JB Onofre:

So yeah, that’s metadata is actually a very important topic for when we talk about table format, because table format is on top of the data files. And so you bring some database style on top of the data files, but you do that, you need the metadata. So you need a catalog with all the metadata inside and manage the way you can create different tables, roll back to, if you do time travel, roll back to a privileged version of a table extra. So if you take an example like Apache Iceberg, Apache Iceberg disconnected the catalog from the table itself. So you can use JDBC as a catalog database, basically. You can use Nessie, which is another catalog. And what’s happening now is in this catalog, we have very different ways of managing the metadata.

At Dremio for instance, we created a project on top of Nessie, which is an Arctic, that’s a service on the cloud. And we do data as code. So basically you can use exactly the same Git way of managing your data as you do on the code. So you can do commit rollback branching. So imagine you can create, I have a branch now, and I would like to create another branch just to test my data. I can create this branch merge the data into my main branch, or scratch the branch if I want. And you can have all the log extras. So that’s, in my opinion, that’s one of the big thing that happening right now. It’s all about data, it’s good.

Yassine Faihe:

Okay. Thank you very much, gentlemen. That was very very insightful. So we still have maybe two, three minutes that weren’t planned, but if there is any question in the audience. I mean, for the panel we’d be happy to consider it. Yes, please.

How Can Businesses Understand Data and Delivery?

Audience Member:

I’ve been working on this area of data mesh data lake. Thank you. I’ve been working on this topic of data mass and data lakehouse for a private bank in Switzerland. We started this journey three to four years back. Merit is a key, yes? But understanding data itself is one of the top priorities and no organization or very few have data documented here. I’m talking about not only structured data metadata, operational metadata and business metadata until. Unless we have business in place, it’s very difficult for end users or business users to under. Because see, every organization has multiple applications. 

Each of those applications have their own data schema, data structure. For any data scientist, data analyst, business analyst, they need to understand data they want to use in real time or on demand. So what has been your approach in getting to a level where business can understand the data and how it can be delivered? Because it’s a very complex challenge, or very difficult to get those documented.

Yassine Faihe:

Thank you very much. Who would like to take this question?

Shamil Shah:

I’m happy to. That’s okay. So look, I mean, there’s no easy answer to this. I mean, it’s a complex world where you’ve got multiple different sources all with similar sort of data, you know mergers, acquisitions, global systems with localization of data sets. And you come at, you come across that constantly. I mean, the approach that I employ is typically one of let’s work out where the value is across the data, what the use cases are that we can then need to solve, and what’s going to make the exec and the board happy in terms of the insight that you deliver. You then focus your resources from a governance perspective, from an architecture perspective, from data modeling perspective on those set of use cases.

And you slowly build out a platform solving iteratively, building incremental value and solving for each of those use cases in turn. Which what that does is, I mean, it makes sure that as you. The investment that you put into solving that problem is justified as you incrementally build the platform out. The historical thinking about rolling out a huge governance organization with lots of stewards and owners has failed numerous times when I’ve seen it. That doesn’t work because they have all got day jobs and they don’t give them the tools and the capability. So you’re trying to fix a history and a legacy of bad practices in managing data, which is never going to work.

So it has to be pragmatic, it has to be value driven, and it has to be bringing the people that understand this data with you on a journey and making sure that their role and the value is recognized as we go through that. That’s how I get it to work. That’s the only way I’ve seen it work in terms of delivering that. But, so look, any organization, it’s a long, slow, painful journey, which we have to face as data professionals.

Yassine Faihe:

Okay. Thank you very much Shamil. Thank you very much, gentlemen.

header-bg