March 1, 2023

1:30 pm - 2:00 pm GMT

Hear from Ventana Research on the newest trends shaping the future of the Data industry including a deep dive into Data Lakes

Matt leads the expertise in Digital Technology covering applications and technology that improve the readiness and resilience of business and IT operations. His focus areas of expertise and market coverage include: analytics and data, artificial intelligence and machine learning, blockchain, cloud computing, collaborative and conversational computing, extended reality, Internet of Things, mobile computing and robotic automation. Matt’s specialization is in operational and analytical use of data and how businesses can modernize their approaches to business to accelerate the value realization of technology investments in support of hybrid and multi-cloud architecture. Matt has been an industry analyst for more than a decade and has pioneered the coverage of emerging data platforms including NoSQL and NewSQL databases, data lakes and cloud-based data processing. He is a graduate of Bournemouth University.

Topics Covered


Sign up to watch all Subsurface 2023 sessions


Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.

Matt Peachey:

Who is Matt Aslett?

So, Matt’s going to be our first speaker. He’s from Ventana Research. Matt’s been in the industry for over two decades and has been an analyst for over a decade. And he’s been doing a lot of coverage, particularly in our space, right? So in the emerging data platform space, NoSQL, new SQL, data lakes, data lakehouses, and the cloud based data processing space. And so please join me and welcome Matt onto the stage.

Matt Aslett:

Thanks Matt. Yes, I’m also Matt, I should say, in the interest of diversity, there are people with other names coming up, so stay tuned for that. So, as Matt said I’m going to run through quickly here in 20 minutes or so talking about some of the key trends we see in the industry that are shaping the evolution and adoption of data lake. Just briefly before I get into that, just a quick introduction to Ventana Research. So we are an industry analyst from a research firm. As you can see, we’ve got about 100,000 members in our community. So we’re quite large in terms of our scope, but actually we’re a fairly small organization predominantly based in the US and in fact, I’m the only person in Europe. So slightly lonely existence, but everyone else is very friendly and we always get together over teams and, and a lot of virtual sessions because we’re relatively small, we’re focused.

What Does Ventana Research Do?

So the core areas of focus include analytics, data, digital business, digital technology. As you can see some of the key application areas as well that obviously rely on those underlying data and analytics platforms. Things like customer experience, human capital management, and the office of finance. And I’m part of the team that covers data analytics and digital technology, and in particular, responsible for our coverage of data. So I thought to set the scene for some of the key overall business trends that we see driving adoption of technologies in the data space at the moment. So one of those we see is the consumerization of IT, and the drive towards real time responsiveness. Obviously we see this in consumer applications, which is now part of what’s driving requirements for more responsive enterprise applications. And increasingly, as I say, my colleagues who cover those application areas are writing about.

And in doing research around the use of things like personalization, recommendations, obviously, all of which needs to be driven by data and analytics. We see obviously data has always been important, clearly to organizations. But the number of roles, the number of participants within an organization that need access to data as part of their day-to-day working lives is increasing. And therefore, organizations are having to evolve the way in which they manage and deliver data across the organization. Increase engagement of business users through self-service with access to data that obviously is pertinent to their roles and their responsibilities, obviously with all the security and privacy and governance aspects that go along with that. And then we see the increasingly sophisticated use of data in operations. As I say, data informs all parts of a business today, not just the business analyst team, the data analysts team, the data science team, but all parts of an organization.

And we see obviously a lot of people talk about this concept of being more data driven and data-driven organizations. And I think we all know a data-driven organization when we see one, we think of companies like, I know Spotify or Netflix or maybe ING Bank. But it’s difficult for organizations that are trying to become more data driven to identify what it is they can look at and what they can attempt to achieve. So we’ve drawn a plan here, at least what we mean when we talk about organizations being more data-driven. And part of that is more of a data culture within the organization, and that needs to come from the top down. It’s achieved through a combination of people and process and information technology, but driven by leadership to encourage people within the organization, be the executives, business leaders, any level of decision makers to define and articulate the value of data to that organization and the vision of how data can be used in that organization.

Business Trends in Data Storage

In order to do that, we also have seen over the years an increased investment in data literacy. So ensuring that people within the organization, whatever their level, whatever their role, have access to data that’s pertinent to that role, but also have the skills to actually work with data and understand how data is part of their role. So that obviously requires the delivery of training and investment in skills. And increasingly we see that as being delivered not just through, obviously training everyone to be a data scientist or data analyst, but actually look at how data is relevant to someone in their particular role, the applications they use, the business processes, the workflows that they have, and ensuring that data is delivered to them using the applications they’re already used in order to make decisions. And obviously we see things like natural language processing and embedded analytics driving self-service access to data.

We also see and related to that, but this idea of data democratization, which we talk about removing barriers that prevent or delay access to data so the people increasingly have the skills, but then use the technology to facilitate access to data. And we see that increasingly organizations talk about the data being treated as a product that’s generated perhaps by a particular business unit and are made available to others either in the same business unit, other parts of the business or even potentially to partners and customers. And some of the key enablers of that are things like search based data, discovery guided data navigation, as well as obviously the all important data governance controls, which set the guardrails in place, if you like, that enable organizations to facilitate self-service.

And what we see is that with those appropriate governance controls in place, that can enable this concept of data curiosity, it can enable and encourage people within an organization to do more data exploration and experimentation. Obviously, data science is a key part of that, but it’s not the only way we see that being delivered discover and explore new business opportunities, challenges and that is obviously encouraged and reinforced through collaboration between business units, through education, which is part of their driving the data culture. So obviously this is cyclical. It’s not a matter of ticking each box. So that’s what we talk about organizations being data driven, including it’s a journey and there’s different aspects of that. And we can go through a whole nother presentation just on that topic alone, but I thought it was worth setting the scene.

Are Businesses Actually Looking to Move to Better Data Technologies?

One point I think is worth dwelling on this idea of data being treated as a product. And we see that as going hand in hand. I’m sure, well, I know because Zemax is talking later. Virtually we will definitely be hearing about data mesh more today. And we do see that as something that is increasingly important, at least conceptually to organizations. And we assert that more than one half of organizations are looking to adopt technologies that facilitate the delivery of data as a product within their organization. And that is not solely driven by data mesh. And we see projects that are doing that, which aren’t necessarily considered their data mesh, but it’s part and parcel of the whole concept. The cultural and organizational changes to data ownership that go hand in hand with data mesh, facilitate data, data as a product and vice versa, of course. I’m not specifically really here to talk about data mesh, although, as I say, I know we’ll hear more about that today, but obviously more predominantly data lake.

And it’s interesting, we’re now 10 years into, or more than 10 years actually into this concept of the data lake. And it evolved initially as an idea of a platform where organizations could pull data from multiple locations into a central resource and make it available for multiple people within an organization to conduct multiple initiatives and projects predominantly analytic. And we conducted it last year, actually with the support of Dremio. One of what we call our dynamic insights research projects, which is survey-based. And we found some really interesting findings from that. And I’ll just run through a few of those to see where we are in terms of the industry landscape today. And the key findings, as I say, is that data lakes are widely adopted.

Data Lakes Are Shifting to Open Format Data Storage

Data lakes are now predominantly cloud-based. Data lakes are delivering on the expectations that we’re laid out in more than a decade ago. And data lakes are shifting to open standards. So to run through each of those in turn, data lakes are widely adopted. Now, clearly this was a survey that was focused specifically on data lakes. So you would anticipate that the participants would predominantly be using data lakes. So with that pinch of salt, I think what was clear, as we saw almost two thirds of organizations were already in production with data lakes. Almost a third are planning to adopt data lake at some point in the future. Clearly add those together. You see that almost all the respondents in this sample are either using or planning to use data lake. Obviously if you look at the wider population of organizations there will be pockets where data lake hasn’t been adopted to date, but we do see it is now a significant and established part of the overall data landscape. And certainly those that have adopted data lakes are increasing their use of it.

As I said, data lakes are predominantly cloud-based. We see that more than two fifths of organizations that are using cloud data lakes, sorry, more than two fifths of organizations are using Cloud data lakes as their primary data platform. And actually, we look at those that are using data lakes as their primary data platform. Almost nine in ten are using the cloud for that. Those exist on the cloud. And that’s a big shift that we’ve seen over the last maybe five, six years. And obviously the early data lake projects were based on Hadoop which was predominantly used at that time on premises. And we’ve seen a really significant shift over towards cloud data lakes over to object storage as the underlying persistent layer for data lakes and obviously layering on top multiple different cloud services and, and technologies on top of that to generate value from that underlying data.

Using Data Lakes to Store Multiple Data Formats

So I said that we see that data lakes are delivering on their expectations. And to put that into context, I think if we look at the original concept of the data lake, as I said, it was about multiple data. So data from multiple data sources in a variety of formats in a single environment could be queried by multiple business departments for a variety of analytic workloads. And we see that in terms of those first two multiple data sources, multiple data formats, that absolutely is happening today. More than half of organizations that are using a data lake stored data from three or more different operational data sources. We also see that 59% of organizations are storing data using two or more file formats, and more than a quarter are using three or more.

And as you can see in this particular sample JSON was was the leading files format followed by CSV, and Parquet, but the point here is that multiple data sources, multiple data formats, that is absolutely what is being delivered and what was out initially outlined as one of the key areas for data lakes to deliver something that wasn’t previously available. Just a quick segue into table formats. We see that the use of new table formats is emerging. So almost half of the people in this study are still using Hive tables. And we obviously anticipate that people will continue to use Hive tables for some time to come, but more than half, 57% are using at least one of what we describe here is the emerging table format. So we mean, obviously Apache Hoodie, Apache Iceberg, and Delta Lake, and as it happened in this and the time that we did this, only 51%, just over half, were using Delta tables, just over a quarter using Apache Iceberg. Just anecdotally, I think this was about nine months ago, I think eight, nine months ago, if we were to do this today, I think we’d see those numbers shifting a bit. We see increased interest in iceberg, increased adoption of iceberg, but not necessarily to the detriment of Delta table. So, yeah. But I think the picture would be a little bit different today.

Databricks and Delta Tables

Audience Member: 

So are the Databricks on Delta tables?

Matt Aslett:

Yes. Yeah, yeah. Databricks well, obviously, it is an open source project, but most associated with Databricks. Absolutely. In terms of multiple departments, again data lakes delivering on expectations, we see things like R&D, customer service, finance, product management as being areas that, in particular, the organizations expect to benefit or are already seeing benefits from their data lake deployments. But actually nine in ten, almost nine in ten expect multiple business departments and functions to benefit from their investment in data lake. And then finally, multiple workloads. We see more than two thirds are actually running two or more analytics workloads on those data lake environments. Obviously, business intelligence reports were the most predominant there. Popular also 69%, or also data science, machine learning in interactive business intelligence, dashboard dashboards, Ad Hoc exploration, and then interestingly, more than the third actually running operational applications on their data lake. And that’s something we’ve seen an increasing amount of, particularly as we see an investment in interactive intelligent operational applications, which have as part of them things like recommendations and personalization. Obviously that requires an analytic process even if the application itself is operational. So obviously data lake if that’s where the data is clearly is increasingly being used for those applications as well.

Data Lakes Are Shifting to Open Data Standards

And then the final data point from this survey was that data lakes are shifting to open standards. And what was really interesting in this study, and this really jumped out at us, is that that perhaps that first phase of data lakes, as I said, a lot of data lakes built on Hadoop using custom scripts and code and organizations, getting multiple projects working together with a lot of hard work and a perhaps a bit of duct tape. We do see that decreasing. So 23% of organizations today describe their current data lake architecture as being based on homegrown scripts and code. We ask them what they plan for their architecture to be for the day lake in the future. And that goes down to 9%.

And as you can see, the growth is in the adoption of proprietary tools and cloud services going from 36 to 47%, and particularly open standards and open formats going from 21% to 39%. Interestingly, what isn’t shown in this chart was that actually almost two fifths of those currently using proprietary tools and cloud services are planning to move towards open standards. So, a lot of the growth in proprietary tools and cloud services is coming from people currently with homegrown scripts and code, or indeed they don’t have a data lake. And I think we see that if an organization doesn’t have a data lake today, it’s very easy to jump into a proprietary offering or a cloud service offering that is pre-built for them. But actually amongst as I said, two fifths of those currently using proprietary actually are moving towards open standards and open format.

Higher Business Satisfaction With Open Formats

So a real validation of the importance of open standards and open formats, even if it’s not immediately obvious when you look at the chart. And one of the, it’s not necessarily a reason for that because we didn’t actually drill into asking why that was, but I thought this was really interesting. It’s certainly a correlation we look at, there’s a much higher satisfaction level from organizations that are using today’s open standards and open formats with their data lake. So overall about more than four fifths of organizations. So they’re at least somewhat satisfied with their data lake. Actually, almost half of those using open standards and open formats are truly satisfied with their data lake, and that compares to only 28% of those using proprietary tools and cloud services. So there’s much more to drill into there, and that will be part of our ongoing research. But definitely a really interesting correlated data point, which highlights perhaps why we certainly don’t see people shifting away from open standards and open formats if that’s what their architecture is today.

I actually have got through all of this without using the term data lakehouse, which wasn’t deliberate, but.

There Will Be a Rapid Shift to Open Format Data Storage

Well, I suppose it was to some extent because I wanted to talk about the overall trends without going into specifically talking about lakehouses. But we do see this overall trend and this shift towards open standards formats and the adoption of newer technologies that enable organizations to generate greater value for their investment in the initial data lake architecture is being driven in part by adoption of data lakehouse. And we certainly see that amongst organizations that already have an investment in data lake, there’s going to be a real rapid shift, or there is already a rapid shift towards adoption of technologies that we would think of as turning that into a data lakehouse environment. And we can go into talk about that I know, in the next session. But it’s definitely about generating greater value from that accumulated data. I’d say the first phase of data lake, as I said, delivered on expectations in terms of having a repository of data from multiple sources that could be accessed by multiple business units for multiple purposes. The lakehouse is about making that much more efficient and delivering additional capabilities that deliver additional business value from that data.

So just to finish off, we always like to finish off with a recommendation. A couple of things here, obviously of organizations that have not already done. So, we definitely would recommend that they start experimenting with investment in data lake and data lakehouse architecture to improve the insight from their accumulated data across the organization. I think organizations that have already perhaps done that first phase of data lake adoption should explore the benefits of lakehouse functionality to improve and accelerate the value generated from that environment.

Well, I think it’s important to obviously evaluate data lake and data lakehouse in the context of wider business initiatives in a business transformation, digital transformation initiatives. We talk about data culture, obviously, as I said, it’s not just about the technology, it’s about the people and the process and, and the information. And it’s those wider initiatives that can drive through, not just the adoption of data lake and data lakehouse, but proving the value of those in the long term. And then lastly just be cognizant of the potential benefits of open standards and open formats. Obviously there’s multiple different data lake, data lakehouse technologies and platforms and options available to organizations, but we do absolutely see that there are some real benefits from open standards, open formats and particularly in relation to both proprietary tools in terms of things like avoiding obviously vendor lock-in and having flexibility to move from either from on-premises to cloud or between cloud providers.

Although clearly that’s not something we see organizations doing on a regular basis, to have the option to do so is important. And then compared to homegrown scripts and code, again, it’s really about efficiency. And particularly it’s removing some of the management complexity in the overheads associated with that. So with that, I’ll thank you very much for your time. If you’d like to learn more about what we do and some of our research, then you can follow up with some of the links here. Obviously, I know you can’t click on these, but I’m sure we’ll be sending the slides out and then we’re always happy to get feedback and hear from our organizations. So yeah, thank you very much for your time and I think we’ll move on. Are you going to introduce the panel? Absolutely. So yeah, thanks very much.