Table Formats Nessie Metastores Apache Iceberg Subsurface LIVE Sessions

Session Abstract

Join Tomer Shiran, Dremio Co-Founder and CPO, as he explores the evolution of cloud data lakes and the separation of compute and data. He’ll share how emerging open source projects like Apache Iceberg and Project Nessie are bringing transactions, data mutations, time travel and even a git-like experience to cloud data lakes on Amazon S3 and Azure Data Lake Storage (ADLS). With these advances, data essentially becomes its own tier, enabling us to think about data architectures in a completely different way.

Webinar Transcript

Speaker 1:Ladies and gentlemen, please welcome to the stage Dremio co-founder and CPO, Tomer Shiran.Tomer Shiran:All right. Thank you everybody for joining me this morning, or afternoon or evening depending on where you’re joining us from. I’m very excited today to spend the next 30 minutes with you talking about [00:00:30] the trends that we see in the world of data infrastructure and data analytics, and also sharing with you three specific open source projects that are paving the way to a new architecture. Let’s get started.I have the opportunity in my role here at Dremio as the founder and chief product officer to spend a lot of time with different companies, many of the leading companies in the world in different industries that are doing amazing things with data, and [00:01:00] many of them are trying to democratize data access. They’ve realized that data is their most important asset and it’s something that they really have to make available to a broad range of people within their companies. In order to do that, there are very specific requirements that they are trying to achieve with their cloud data infrastructure, with their data analytics. And so I’ve tried to identify four of those key principles and that we see again and again and again in many of these [00:01:30] leading companies.The first one is cloud. You see this everywhere now. Every company is migrating to the cloud. In many ways, what’s happened in the last year with the pandemic has only accelerated that. Who wants to be managing an on-premise data center? It was already hard enough, so now you have all these additional complexities, and that’s really driven even more companies to the cloud in an even bigger way. That’s that first principle that we see everywhere, is this migration to the cloud and the use of cloud [00:02:00] architectures.The second thing is scalability. As we heard in the previous session, by 2025, the amount of data annually that will be generated is something like 175 zettabytes. Every company now has a huge amount of data. And with that large volume of data, it becomes increasingly challenging to actually make that data available. And so having scalable systems that can scale out, and actually scale down based on demand is really important. And that’s a key principle for many of [00:02:30] these companies.Flexibility and agility. How do you basically create a situation, provide an environment where you can very quickly respond to different requests and different changing demands, both internally and also market driven changes? That is what we call flexibility or agility, the ability to move quickly. And then finally, availability. When you want to democratize data, you have to have a system, a platform that [00:03:00] is always on where the data is always accessible. It’s not something where it’s available some days but not other days, or sometimes and not other times. Or maybe a situation where you have to request access and wait weeks or months before anything can be done. That’s what we mean by availability.These four principles are very fundamental to how so many of these leading companies are dealing with their data challenges. We’ve seen this before, and with application development, [00:03:30] companies, really, and developers faced similar challenges, and the architecture of applications has evolved to meet a very similar set of needs. So if you think back in terms of how we built applications 10 years ago or even five years ago, very much monolithic, right? We took client server architectures, mostly proprietary. Back in the day I, myself built applications. I would take that [00:04:00] Microsoft Stack SQL server in the backend, a bunch of MSDN libraries, and that’s how you built an app. It was all one kind of integrated, very monolithic architecture. And that had some advantages, it was easy to integrate, but really had some significant disadvantages in that it was very hard to scale, it was very inflexible. You couldn’t really take advantage of all sorts of innovations that were happening in the open source community and just various libraries that were being created.And if you look at the last five years, [00:04:30] we’ve completely shifted and transitioned how we build applications. If I look at here at Dremio, the technology that we’ve built, so many different microservices. We use Kubernetes and we run these microservices within that in this loosely coupled architecture. We take advantage of a lot of open source libraries. So if you think about our user interface as an example, lots of different open source libraries that are being brought together to provide that unique experience.And so the architecture in which [00:05:00] applications are being developed now is just extremely different from what it was five years ago. And for many of those same reasons and principles that we just saw on that previous slide, we want to be able to achieve scalability and flexibility, and we want something that is always highly available and easy to manage. And so those same principles have led to this revolution in how applications are being built.Well, if we think about data analytics [00:05:30] and data infrastructure, a very similar transition is happening right now. I was actually one of the first employees at one of the Hadoop companies, and back in 2010, we were all talking about co-location of compute and storage. Bringing those things together on a single set of hardware, a single cluster that you would build and you’d buy servers and you’d install a distributed file system and all these compute technologies, SQL, engine, Spark engine and so forth. So that kind of monolithic [00:06:00] architecture, and for good reasons. The networks were the bottleneck. You could overcome that by having co-location of compute and storage. And this was true for Hadoop, it was true for the data warehouses that people were running on prem. And in fact, even for some of the early cloud deployments that were mostly lift and shift.Then about five years ago, we saw the rise of a new architecture, and this one was enabled by the fact that Amazon and Microsoft were willing to rent us both storage and compute on demand, [00:06:30] the fact that we could buy a server and pay by the second instead of having to order it and rack and stack it. And so, because of that flexibility with the compute layer and these infinitely scalable and very easy to use cloud storage services, you saw cloud data warehouses come to life with this new architecture that involved separation of compute and storage. And in this architecture, each of those things can scale independently. So if you have more data but you’re not doing more compute, you can just pay for more storage. If you have the same amount [00:07:00] of storage or data and you want more compute, you can pay for more compute. So provides efficiency and scalability advantages.But at the same time, significant limitations here in that the data still has to be ingested and loaded and copied into that one proprietary system. It’s very much tied to a single engine. If you had multiple engines, well, now you have multiple copies of data. You can’t access that same data without paying that data warehouse vendor to get the data in, to use it, to get the data out. And so the costs [00:07:30] are… they may start very low because you’re not using the system, but at the end of the first year, they’re astronomical. And this becomes a very big challenge for companies that are trying to manage their costs in the cloud.What we have now, and we see this actually inaction at most of the big tech companies, but increasingly at many other innovative companies, is a new architecture that not only separates the compute and the storage, but actually separates the compute and the data. The data itself being its own tier [00:08:00] stored in open source file formats and open source table formats increasingly. And that same data can be accessed and interacted with from different engines. It can be processed from a SQL engine like Dremio, it can be processed from something like Spark, from machine learning engines like Dask or Array, from streaming engines like Kafka, Flink. All sorts of different technologies all interacting with that same data. And not just in a simple read only manner anymore, but actually with inserts and updates [00:08:30] and deletes and transactions and all these new kinds of capabilities. And we’ll be talking about some of that today. This is what we mean by the separation of compute and data and this new loosely coupled architecture.As we saw in that previous keynote, the constant copying of data has become a huge challenge. A huge challenge in terms of cost and a huge challenge in of security and compliance. You really can’t move fast and you can’t secure and control [00:09:00] data if you’re going to be creating lots and lots of copies all over the place. What it typically looks like in most companies is that you have data streaming to this data lake environment into something like S3 or ADLS on Azure. And of course it’s not some set of small static tables. We’re talking about massive streams of data that are constantly being ingested and changed both in terms of data and in terms of metadata and schemas and so forth. So it’s a very live [00:09:30] system.And then many companies then start ETLing some portion of that data, a subset of those tables, maybe the last 30 days into a data warehouse. You copy data from that data lake storage into the data warehouse. And it doesn’t end there. If that was the only copy, maybe that’d be okay. But then we start creating lots of different copies within the data warehouse. We create copies because different users want to see the data a little bit different. So we start personalizing the data. [00:10:00] For performance reasons, we can’t get the performance that we want on that raw dataset, so we start creating summarized versions or aggregate tables. We sort the data in different ways. So we’re creating all these different copies of data, whether for business reasons or for performance reasons more often.And then again, it doesn’t end there. What ends up happening because you want even more performance is that many companies then start creating cubes. The data engineering team or the data team starts creating cubes where data is pre- [00:10:30] aggregated in these separate structures that are independent and disconnected. The Tableau and Power BI users create extracts or imports because they need higher performance that the data warehouse is unable to provide. That’s another copy of data. And then the data scientist is exporting data onto their local laptop because they can’t get the data fast enough from the database in order to perform their algorithms, in order to train a model. They can get much faster access from a local file and [00:11:00] so they download it onto their laptop.And of course, that disconnected copy now means that the data is very quickly stale, it means that IT has no control over who’s accessing the data when it’s accessed. You can revoke permission or revoke access to a dataset that doesn’t affect the people that have downloaded it. When you get a GDPR request to delete somebody’s information, of course you can’t do it when it’s on a data scientist’s laptop. So all these copies of data that we have in many companies, [00:11:30] these really inhibit us from being able to achieve data democratization, and even just to comply with regulations.So today I want to talk about three open source projects that are powering the future of cloud analytics. These are projects that I’m personally very excited about, and I think you’ll see these completely transform what becomes possible in these cloud data lakes over the next 12 months. The first one is Apache Iceberg, and that’s an open table format for the data lake. The second project [00:12:00] I’ll talk about is Project Nessie, which is a modern meta SOAR for Iceberg that provides Git-like capabilities, things like time travel and version control and branches. And then the third one is Arrow Flight. And Arrow Flight is a subproject of the Apache Arrow project, provides a high-performance data exchange framework. I’ll cover all three of these, but more importantly, throughout this conference, throughout the next two days, you’ll be able to attend technical talks from the creators of these projects and really learn about what they do and [00:12:30] what they’re enabling and how they work.So let’s start with Apache Iceberg. If you think about historically how data lakes worked and what that data tier looked like, it was mostly about files. It was the responsibility of the company that was deploying that data lake to get the data into files, whether it was JSON or CSV files, or more efficient collinear files [00:13:00] like Parquet. The company would structure this. You’d structure the data in these files, you’d create the right partitions and so forth. But that was the fundamental building block. And then we had table facades introduced into the mix there to make it a little bit easier with things like Hive metastore and the Glue Catalog on AWS to provide that table level wrapper.But we didn’t have the capabilities in these tables that you see with like a database, with a data warehouse where you can do transactions [00:13:30] update records and so forth. And so what Apache Iceberg brings to the table is it creates this new abstraction that is much more sophisticated and provides a table level obstruction for the data lake. And by table level, I mean full transactions, record level updates and deletes, the ability to go back in time and query the data and process it as it was yesterday at 9:00 AM. The ability to separate the physical structure, the partitions from what the users actually experience, [00:14:00] the consumers of the data. A much more scalable metadata model where you don’t have to worry about waiting a long time for lots of metadata to be collected as part of a query. And most importantly, this works from any engine.So Apache Iceberg is supported from Dremio and from Spark and from Athena and many other technologies that are out there. It’s built in this open way. And the reason for that is that you have many large technology companies that are backing this project. In fact, it was created at Netflix, but you now [00:14:30] have companies like Apple and Amazon and Alibaba, Tencent, Dremio, ourselves, Expedia, LinkedIn, Airbnb, Adobe, GoDataDriven, Tinder and lots of different companies that are backing this project and investing in it. So we have a massive community that is creating all this innovation and adding support for many different engines and so forth.What I’d encourage you to do is check out these talks that we have at Subsurface. So today at 10:20 in the morning, you’ll hear [00:15:00] from Ryan Blue at Netflix who’s the creator of this project, or the PMC chair of the Apache Iceberg project. You’ll hear from Anton at Apple, you’ll hear from Andre and Gautam at Adobe about how they’re using Iceberg. So, lots of great technical talks about Iceberg at Subsurface.The next one project I want to introduce you to is Project Nessie. This is another project that I think will significantly change how people use data [00:15:30] lakes and what’s possible with data lakes. So when we think about one of the key capabilities of Nessie, it’s about enabling multi-table and multi-engine transactions, unlike a database, which is really a single engine, you have just that database’s engine. In an open data lake architecture, you have many engines. You have Dremio, you have Spark, you may have Hive, you may have Dask and Presto and lots of different things. So the ability to interact with the data from many different engines is really important.[00:16:00] In a simple case, you might have one table and one engine, as you see here. You might have a SQL engine like Dremio and you’re performing a transaction on the data. You’re reading from a table T1, you’re updating that table. Maybe you’re selecting from it again. All this is happening as one transactional unit. In a more sophisticated case, you might have multiple tables still with one engine. And this is what you see possible with like a database or a data warehouse. You start a transaction, you select from a table, you maybe update a different table. The classic example of moving money from one [00:16:30] big account to another requires a transaction.But in the more interesting example, you have many engines. And if we look at the companies that we deal with on a regular basis, their life and their world is much more complicated than a simple set of SQL queries. Maybe every day at the end of the trading day, they’re running a bunch of different Spark jobs and updating Dremio Reflections. And then doing a bunch of other transformations, getting the data ready to then be exposed [00:17:00] in an atomic way with their data consumers, with their analysts and their data scientists. And in the past, they’ve had to go through a lot of complexity and jump through a lot of hoops to make those updates in a transactional way. You update one table and then the referential integrity between that and a different table no longer works. And so those things have been complicated and we’ve basically had to, as users of data lakes, overcome those challenges.So with Nessie, you’ll be able to do multiple [00:17:30] tables and multiple engines and have transactions that span those things. And the way this is being done is through a very familiar concept called the branch. And this is something that if you use Git or GitHub, you’re already very familiar with. Git-like version control with commits, tags and branches. And these semantics, these Git-like semantics are actually very powerful for a number of different use cases. And I’ve shared a few examples on this slide, the first one being experimentation. [00:18:00] We have our production data lake and there’s a team of data scientists that wants to try something a little bit different. They want to experiment with something but they don’t want to yet make changes to the live data. Maybe something that’s also powering our dashboards.So we create a branch for them called the Experiment branch, and that team can then start using that branch in creating their own drive datasets, their own views of the data and experimenting in that branch without impacting all the other users. [00:18:30] Another example is promoting data. So promotion from a development environment to a staging environment to a production environment, as an example. We do this in software development all the time. So being able to do this with data where you can make a bunch of changes in your development branch and then merge those back into the staging branch, and then merge those changes that you do there once you’ve tested them back into production. That ability to promote from one environment to the next [00:19:00] becomes very easy. You don’t have to worry about making a copy of all these different datasets and managing what exactly is in each environment, which can be very difficult.The third use case here that you see as reproducibility. We see this a lot with MLOps, and actually analytics in general. When I produce a number that I’m going to share within the company and we’re going to make decisions based on that number, or when I create a model and we’re going to derive some results or some actions based on that model, I want to be able to reproduce that, to be able to create that [00:19:30] same model two weeks from now. And so I need to be able to access the data as it was at that point in time. Same thing for a dashboard. When somebody asks, “How did you come to this decision? How did you get that number?” I want to be able to show that number as it was at the time that I made that decision. So being able to go back in time to tag the data, a specific commit, and then be able to query the data based on that commit allows you to reproduce results and reproduce models.And then the fourth item here on [00:20:00] this list is data governance and compliance. As we talked about earlier, this has become a really important thing, a really important requirement because of regulation and because of all these security breaches. So the ability to know who’s doing what with the data and when is really, really important. And this becomes very easy if every change in the system is tracked and you can see all the changes that are happening to every table in the system. So I invite you to join the talk tomorrow by Ryan Murray who will be talking about a Git- [00:20:30] like experience for data lakes based on Nessie. He’ll share his experience building as one of the creators of this project.The third one open source project that I want to talk about today is Arrow Flight. Arrow Flight is a sub-project of the Apache Arrow project, which is something that we co-created actually at Dremio several years ago. Initially we took Dremio’s internal memory format and we open sourced it as Apache Arrow. And the idea was that if we could create an industry standard [00:21:00] where lots of different technologies we’re using Arrow, and it wasn’t just Dremio’s internal memory format anymore, well then, all sorts of advantages could be brought to the market later by having integration between systems that all speak Arrow.Well, today, we see something like 15 million downloads a month of Apache Arrow. It’s embedded into tons and tons of systems and technologies. Every data scientist uses Apache Arrow. And so that actually has become a lot more popular than we initially expected when we created the project. [00:21:30] And with Arrow Flight, we’re now taking advantage of the fact that Apache Arrow is pervasive and providing a high-performance data exchange across systems that can speak Arrow. So in this example, you see here that same diagram that I showed you earlier for the open cloud data lake where you have that shared data tier and all these different engines. And on the left, you see an example of a data science user, maybe writing some application or building some model.[00:22:00] What Arrow Flight allows these two types of integrations. One is it allows a client application, say somebody writing code in Python or using a Jupyter Notebook, a data scientist to interact with a distributed system like Dremio. So the ability to get data much faster in a parallelized way than you could use in older technologies. And the second piece of Arrow Flight is a kind of a many to many communication channel where you can have different systems, different distributed systems talking to each other in parallel. And what this basically [00:22:30] enables is a world in which the cost of different systems communicating with each other or coexisting becomes very low because there’s no longer a very high cost to having different engines running and collaborating in real time. They can move data and share data among each other almost for free.Specifically in the context of the client server [00:23:00] architecture, you have the opportunity now to provide a modern replacement for things like JDBC and ODBC with Arrow Flight. These technologies, ODBC and JDBC were created in the ’90s when networks were slow and the amount of data was a lot smaller. So now when we have these much larger datasets, and in some use cases, especially in the data science world, you need to download millions of records to the client because you’re trying to train a model or something like that. [00:23:30] You need much faster access to that data.And when we compare as an example in the Python stack, when we compare something like ODBC using the ODBC library and the speed at which you can get data from something like Arrow Flight, you can see that the difference is orders of magnitude. And it’s not just that your applications can be faster now. Of course, that’s great. But more importantly, you can actually start creating applications without having to download files onto the local machine. [00:24:00] A data scientist no longer has to download the data onto a file onto their laptop. They can actually start training their models and accessing data directly from the cluster, from the SQL engine or the database. And so we’re very excited about Arrow flight and what this means in terms of what’s possible for a data scientist.I want to invite you also to join a number of different talks that we have on Arrow Flight at Subsurface. [00:24:30] We have a talk actually today at 2:25 from FactSet. They’ll be talking about Arrow Flight and actually an implementation and go. And then tomorrow you’ll hear from the creators of Arrow Flight, Kyle Porter and Tom Fry at Dremio talking about Arrow Flight and Flight SQL.With that, I want to leave you with one final thought here, which is that with these types of innovations that we’re seeing now, we really have a new opportunity. For many years, there’s always been that tradoff between [00:25:00] the data lake and the data warehouse. We all know that if we choose a data lake, then there’s a lot of advantages that come with that. You have this loosely coupled architecture that’s much more flexible, less expensive, allows you to take advantage of these different engines for different use cases. Versus the data warehouse, which is of course, monolithic and much more rigid, much more expensive, especially the cloud data warehouses and the costs that skyrocket there.But at the [00:25:30] same time, there was a reason why in some cases you want to use a data warehouse. The fact that the foundation was table-based and transactional and you could update records very easily, time-travel transactions, things like that made it easier once you had the data in the data warehouse. But with these technologies, with Iceberg, with Nessie, with Arrow Flight and other things that you’re going to be hearing about at this conference, we’re moving into a new era. We’re moving into an era this year, actually, where [00:26:00] the data lake will be able to do everything that you can do with a data warehouse, and actually quite a bit more. You won’t have to make that tradeoff anymore. You won’t have to give up the flexibility, the scalability, and the low cost of a data lake architecture in order to get those other capabilities. You’ll be able to get all of that in the data lake. And of course, as you saw here, some significant additional capabilities that will never be possible in one of those monolithic [00:26:30] data warehouse architectures.I’m very excited about this. I hope you are as well. And I want to invite you to join as many talks as you can over the next couple of days where you’ll be hearing from the creators of these different open source projects, and also many companies that have implemented them and have built amazing data lake architectures that are enabling data democratization within their companies.Finally, before wrapping up, I wanted to invite you to join or to try out this free virtual lab [00:27:00] from AWS and Dremio, and you’ll be able to do that by visiting the Subsurface website. This will allow you to connect to an Amazon S3 data lake, create a semantic layer, create some spaces, query the data in S3, get a sub-second response time and see that that’s now possible on a data lake architecture. And there’s also, I believe a drawing if you visit the booth. So thank you so much, everyone, for joining me today, and I hope you enjoy the rest of the conference.