Subsurface LIVE Summer 2020
The Future Is Open - The Rise of the Cloud Data Lake
The rise of cloud data lake storage (e.g., S3, ADLS) as the default bit bucket in the cloud, combined with the infinite supply and elasticity of cloud compute (e.g., EC2, Azure VMs), has ushered in a new era in data analytics architectures. In this new world, data can be stored and managed in open source file and table formats, such as Apache Parquet and Apache Iceberg, and accessed by best-of-breed elastic compute engines such as Dremio, Databricks and EMR. As a result, companies can now avoid becoming locked into monolithic systems such as cloud data warehouses and Hadoop distributions, and instead enjoy the flexibility of using the best of breed technologies of today and tomorrow. In this session we explore these secular trends and the building blocks that have come together to enable this new open architect.
Tomer Shiran, Co-founder, CPO, Dremio
Tomer Shiran is the CPO and co-founder of Dremio. Prior to Dremio, he was VP Product and employee #5 at MapR, where he was responsible for product strategy, roadmap and new feature development. As a member of the executive team, Tomer helped grow the company from five employees to over 300 employees and 700 enterprise customers. Prior to MapR, Tomer held numerous product management and engineering positions at Microsoft and IBM Research. He holds an MS in electrical and computer engineering from Carnegie Mellon University and a BS in computer science from Technion - Israel Institute of Technology, as well as five U.S. patents.
All right. It's great to be here. Thank you, everyone for joining us here for the first subsurface event. The topic of today's talk is the future, and the fact that the future, we believe, is open. And we're going to be talking about that today and talking about the rise of the Cloud Data Lake. So my name is Tomer, I'm co founder, chief product officer of Dremio.
So the first thing I just wanted to mention briefly is how excited I am to have so many of you here with us today, from all over the world, different professions, different titles, ranging from data architects, data engineers, data scientists, folks that are responsible for BI, lots of developers, and lots of different responsibilities. And as Billy mentioned, we started really making this event known and allowing registrations only a few weeks ago. And we are so excited today to have over 5,000 of you here with us.
I'm really excited about the day that we have here ahead. And if you look at the list of speakers that we'll be talking today, it really spans a very diverse set of speakers ranging from the creators of open source projects, things like Iceberg and Apache, Apache Arrow, and Parquet, folks from Netflix, Apple, Dremio, Amazon web services. And then also a number of folks here that have created best in class data lakes at companies like Microsoft, Telenav, Software AG talking about the use of data lakes to power IoT analytics. And we have S7 Airlines here talking about the use of a data lake in the airline industry, and also Exelon talking about the use of a data lake in the energy sector.
And so I hope you're as excited as I am about all these talks that we have today, learning about both the open source projects that will power the future data lake, and also some of the lessons learned and the best practices for creating a cloud data lake.
So if we go back in time a little bit and think where we were 10 years ago, or even five years ago, really, we had two models for how to build analytical platforms. On one hand, we had data warehouses and then we also had kind of the Hadoop distributions. And both of these were very monolithic architectures, but that was the only option. So we could use a data warehouse and basically put our data loaded from all these different systems into one place, one proprietary system, and obviously very, very expensive. We've all gotten locked into various data warehouses over the years. And the challenge there is really a lack of flexibility. You have a single compute engine, a single interface, the only way to access the data is to use that one interface. And of course you have to pay to access that data. And that's true for both on premise data warehouses and cloud data warehouses.
And then we also had the concept of an on prem data lake or Hadoop cluster. And so what we had there was basically we licensed the software from a single company, a Hadoop distribution provider. That distribution came with a variety of open source projects, specific versions of those open source projects that were made available to us. And really that combination was what we had to work with for the most part. So it's not very flexible. We kind of had to run those specific projects. And not only that, but it was also running on a fixed set of hardware. So we had a set of servers, maybe a 20 node cluster, a 100 node cluster, and that's what we had. And so the architecture of the system was really designed for that fixed cluster, that static environment.
And so these two architectures, they of course had their merits, but also a lot of challenges. The lack of flexibility, the fact that they were kind of monolithic and provided by a single vendor. And so you didn't get to enjoy a lot of the advantages that come with a more heterogeneous, modern, best of breed type architecture.
And the big thing that's shifted really in the last ... or happened in the last few years is this shift towards separation of storage data and compute. And that's driven by two fundamental things that have been brought to market by the cloud service providers, by companies like Amazon, Microsoft, and Google. And it starts with having this always on, highly available storage tier. Things like S3 and ADLS where you have an infinite storage service. I can store as much data as I want there. It's extremely inexpensive, right? Twenty dollars per terabyte, per month, more or less. It's available in many different regions. And we can basically take data in open source and standard formats, put it in that system, and it's available to every tool, to every engine, to every library.
So that's one big thing that's made this new architecture possible. The second thing is basically the availability of compute. Things like EC2 and Azure virtual machines, and that's led to the rise of a decoupled elastic and isolated compute tier. And what I mean by that is, the compute tier is now separate. It's running a separate place from where the data is and these different engines, they're elastic. They can grow, they can shrink, they can basically run at the right capacity to meet the needs of the workload.
And they're also isolated from each other. So the work that a data scientist is doing in spark jobs to do some data processing, or to build some machine learning models is separate from the compute that is running and powering, maybe the company's executive dashboards. So we don't want interference. And that's what I mean by isolated. So we have now engines for interactive sequel, for spark, for batch processing, for machine learning and so forth.
So let's drill a little bit into each of these layers and talk about what's really part of that. So it starts with the storage layer. And so these systems like S3 and ADLS, infinitely scalable, highly available, globally distributed. Very easy to use and to put our data into these systems. But it's not enough to have this storage service, the storage system that is so available and accessible to everything. We also have to make sure that the data that's being put into the system is also as accessible and as open.
Unfortunately what's happened in the last 10 years is we've come up with these formats that are both open and also very efficient. So things like Apache Parquet for storing data in a columnar format, in a way that lots of different systems can read. And we have ORC serving a similar purpose. And then of course, we have these standards like JSON and TextFiles at more lower, kind of lowest common denominators that are used by lots and lots of different applications for log file formats and so forth.
So the data is now open and that's great, but just having files inside of directories, that's often not enough to represent a table in a complete enough way or in a flexible enough way. So that maybe is good enough for simple use cases, but when we want to define schemas and evolve the data in the schema in different tables, we need a little bit more structure there.
And so Hive Metastore was the first system really that was created and really grew to become the standard where lots of different engines and tools support that system. And inside Hive Metastore, you can define the structure of a table. So which files are part of that table? What's the schema that represents that table? What are the splits, the different kind of pieces of that table? And that allows you to evolve that also over time. And that's supported by Dremio, of course, by Spark, by Hive, by Presto, by lots of different things.
And then as we moved to the cloud at UC, for example, things like AWS Glue, really providing Hive Metastore as a service. So a serverless version of Hive Metastore, so that it's easier. You don't have to deploy and monitor the process. You don't have to set up a database to store kind of that information behind the scenes. But those systems also weren't enough because, ultimately, what we want to be able to do is we want to be able to mutate data and perform transactions, the way we could in a database. And so the Databricks created this project called Delta Lake. And so for cases where all the rights that are being done to the system are going to be done through Spark, that allows you to have kind of transactions and time travel on data sets and the ability to record level mutations. So deletes, updates, inserts, and so forth.
What's really exciting, and actually we're going to have a number of talks today from creators of Iceberg, which is an Apache project, is really a common, very open format that was created by Netflix, and Apple, and Expedia, and other companies to really provide that record level mutation capability, and transactions, and time travel in a very open and shared way.
So really with Apache Iceberg, you'll be able to take a SQL engine and perform a transaction, and also use Spark to perform transactions. And so you could imagine now scenarios where a Spark developer is committing some data, and then you have a SQL developer maybe is trying to delete some customer from some table because of some GDPR request to delete that data. And so now you can start to have these interactions across all these different compute engines in the transactional way that you would want to do that.
And then on top of all this, so now we have the table formats. They're open, they're accessible with things like Apache Iceberg, which you're going to hear a lot about today. But we also want to have the permissions and the access control done in an open way, in a standard way.
And so we have things like IAM from the cloud providers, and we have an open source project called Ranger that allows us to define permissions for our data. So that's the data tier and why it's so important that that's open and accessible. Then it comes to the compute tier. And that's also being reinvented really and powered by the rise of things like EC2 and Azure Virtual Machines providing an infinite supply of compute that is available on demand. It's rented by the second, and it's always there. And so now no longer do we need these architectures where it's about running this big cluster. We can actually take advantage and build systems that were designed for this new era where infrastructure can be rented by the second. And so you see different best of breed engines that can be created and operated independently from each other, all taking advantage of the elasticity of the infrastructure.
So when there's a workload, they can grow. When there's nothing running, they can basically go into hibernation so that you're not paying any infrastructure costs at that time. And so you can see, for example, Dremio provides kind of a cloud native engine for interactive SQL and BI, and Databricks provides a cloud native engine for Spark workloads. And EMR is powering a lot of batch workloads. EMR from Amazon. Athena is more focused on kind of that occasional SQL workload. If a data scientist wants to run a few queries a day type situation, then Athena is a great solution for that. Data warehouse extensions. Things like Redshift Spectrum, Snowflake, SQL data warehouse, which is now called the Synapse. So we have all these different best of breed compute engines that are taking advantage of the fact that we have this infinitely scalable and elastic infrastructure.
And these are architected, oftentimes, in a very different way from what the traditional on-premise systems or on-premise compute engines were built. And I'll talk about just in a few words, some of the examples of the architectural things we did in our own technology to take advantage of these properties of the cloud.
So for example, in Dremio, what we have is a multi-engine architecture. So instead of having one cluster, one big cluster that yeah, you can shrink it and expand it a little bit. Really we've created this multi-engine architecture where you can have a small engine to run your data previews, and a medium size engine to power your executive dashboards, and maybe a large engine to power more ad hoc data exploration queries that are coming from a data science team. Maybe the marketing team has their own engine, a medium sized engine in this case, to power their reports because they're paying for their own infrastructure maybe.
So you have this multi-engine architecture where each of these engines can basically come and go, and go into hibernation when there are no queries running as part of that workload. So that saves you a lot of money when it comes to infrastructure costs. And it also provides isolation between the workloads. So for example, the executive dashboard is not impacted by a data scientist who's running a large ad hoc query. And this again, is made possible because we have things like EC2 and Azure VMs that support this type of compute model.
When it comes to the storage and the access to the storage, one of the challenges with remote storage can be the latency. So the storage not being local to the compute, can impact performance. And so what we've done is we've created this thing called C3 or the columnar cloud cache, which allows us to leverage the fact that EC2 instances and Azure VMs, they basically come today with a femoral NVMe storage that's extremely fast and just comes with those instances.
And so what we're able to do is take advantage of the fact that you have that local storage, local NVMe, and automatically and transparently, cache data as it's being accessed on S3 or ASLS, so that subsequent access by different queries doesn't have to go every time to the object storage to get that data. Does that provides a significant speed up? Also reduces the cost of the IOPS that you have to do there.
And then long gone are the days of people bragging about the size of their Hadoop cluster. I remember back in the day, five, 10 years ago, the conversation was, "Oh, my Hadoop cluster is bigger than your Hadoop cluster. I have a hundred nodes. I have a thousand nodes."
And today it's almost like the opposite conversation. And I think that's driven by the fact that we're getting monthly bills from Amazon and Microsoft. And so that reminds us that we really want to run on as little infrastructure as possible and not more infrastructure. And so we've done things here around efficiency to reduce the amount of data that has to be scanned to answer a query. So if a lot of queries are, for example, performing similar aggregations, or accessing data in similar ways, having the data pre sorted, or pre partitioned, or pre aggregated, and the system automatically being able to take advantage of that to reduce the query cost and ultimately to reduce the bill. So that has become all of a sudden, very efficient. And we call that data reflections in Dremio.
So we talked about the open storage, the open data, the open compute tier, and the fact that we now have these best of breed compute engines. One of the things that I believe is going to be very important in the years to come is this other tier, which I call the interchange tier. And that will be powered by Apache Arrow and the Flight component. So Arrow Flight within the Apache Arrow project.
And so what this is, is basically a way for systems to communicate in parallel, in memory, more efficiently with each other. So rather than, of course, we can have different systems communicate by writing an open file into a persistent storage and some other system reading it. But that, of course, it's not very efficient. So with Arrow Flight, you can have different systems, including distributed systems, talk to each other in parallel at very, very high performance. And so you can see in this example, the different compute engines interacting with each other without having to go through files. And that provides lots of new use cases that are possible.
The first use case that we've really unlocked with this now in public preview is Arrow Flight to the client. So Apache Arrow, which we co-created, and you'll hear some talks today, actually from West and from Ryan about, is now downloaded, I think something like 15 million times a month. And because of that, basically every data scientist now uses Apache Arrow. So we're taking advantage of that and providing this really high performance interface from a Python or an R Client, or from these distributed systems to a Python or an R Client.
And in our experiments, we've seen that the typical speed-up compared to ODBC and JDBC, which were really created 20 or 30 years ago, we're seeing something like a hundred X speed up going from that to going to Arrow Flight. And that's really important when you're trying to populate a data frame on a client application. When you're doing more than just a dashboard or showing a bar chart with 20 bars on it.
So when you think about all these different components, the storage, the data, the compute, we also have the client applications, of course. And so at Dremio, of course, We deal with a lot of the global 2000. So a lot of big companies and they have lots and lots of different client applications that people are using to access and analyze and visualize data. And so that ranges from BI tools, things like Power BI and Tableau, Superset, Looker. So all these different tools, and there are many others, of course. And then you have also more technical users who are using things like Python and R, sometimes doing that through a Jupiter Notebook.
And so all of these different users, they want to have this open platform that they can access data and they can all do it in a consistent way. And so now these different layers are for the first time, really coming together in a very new way. And it starts with having this open storage architecture with S3 and ADLS, providing this infinitely scalable, always accessible, storage tier. And then we have this open data tier where we're basically providing files and providing tables that can be accessed. And not just access and read, but also written and updated and deleted by all these different systems.
And then we have these best of breed compute systems or compute engines that each have their own characteristics, their own workloads that they specialize in, and providing that best in class support for that workload. So it ranges from ... At Dremio, we focus on interactive SQL and BI you have Databricks doing Spark, and all these different workloads that are possible because the data is open.
So when you look at the agenda for today's conference, we have talks from the creators of many of these different open source projects. You'll hear a lot about Iceberg, you'll hear a lot about Apache Arrow. You'll hear from the creators of ADLS at Microsoft and from AWS. And really, we have a great collection of speakers about the various technologies. And we also have speakers that we'll be talking about how they've basically brought together this architecture and built best in class data lakes, taking advantage of all these various components to enable this kind of open architecture and provide both to avoid kind of the vendor lock in that we get with monolithic architectures, and also to avoid what we call the vendor lockout, because tomorrow there'll be another compute engine that's not the one that you see on this screen. It will be another startup or another open source project. And of course, you want to make sure that you can then take advantage of that new project that will come in the future. And that's what this open architecture is all about.
So I want to thank you for joining us today. Like I said, we have over 5,000 people here from all over the world. I know it's late in some parts of the world now. So thank you for joining us at this time, and feel free to reach out to me. You see my email here at the bottom. It's the at sign at the end, and if you want to interact on social, the at sign is at the beginning. Thank you so much.