Session Abstract

The Open Data Architecture panel closes Subsurface LIVE Summer 2021 with a lively discussion about the state of the cloud data lake with some of the most influential creators and contributors to key open source data lake software. Moderator Gartner analyst Sanjeev Mohan opens the panel with highlights of recent cloud data lake industry trends. Then he asks the open source panelists questions such as:

  1. Why data lakes have not, until now, succeeded in having the fast turnaround of data that was expected;
  2. What missteps have there been along the way and what lessons have we learned from them;
  3. What each contributors journey has been, and why they have succeeded.
Video Transcript

Speaker 1:    Ladies and gentlemen, please welcome to the Subsurface stage Sanjeev Mohan, Vice President and Analyst at Gartner.

Sanjeev Mohan:    Hello, everyone. Welcome to the final session of the July 2021 Subsurface conference. You’ve made it to the very end and we have saved the best for the last. I am so delighted to be here today, to talk to you in front [00:00:30] of these four esteemed innovators in our open source community. This is an open source panel and it gives me great pleasure today to talk to you with people who have created some of the most influential products in the market. We take these products for granted, but it took a few of people on this panel to create things like Parquet. So Parquet was a project that started [00:01:00] in a number of different companies from these people, they collaborated and they came up with the open source format that we now use it for our analytics on mostly cloud object stores.

Pandas is another one that many of you have used, which also came from the panel and also the new ones that you’ve been hearing about in the last two days, like Project Nessie or Apache [00:01:30] Arrow. So the purpose of this panel right now is because you take a peek into the minds… Who are these crazy people? What goes off in their minds that they feel compelled to come up with a new innovation? So that’s the first question that we want to understand that what compelled them to come up with a new approach? The second thing that we would love our panel to tell us about is what were the missteps [00:02:00] along the way? What was some of the lessons that they learned as they went about in creating the new open source project? The last thing that we want to find out from them is what’s next? What are they most excited about, what will complete this journey?

So before I introduce the panel, it is very exciting to see that a journey that [00:02:30] we started in the opensource many decades ago seems to be coming together. We finally seem to be in a place where in open data architecture, we now have a set of open source projects that compliment each other and they help us build out an end to end solution.

So with that, I would love to introduce to you the panel today, we have two Ryans [00:03:00] on our panel. The first one is Ryan Blue. Ryan Blue is the co-creator of Apache Iceberg. He is calling us today from Boise, Idaho, where he has moved to start his new venture called Tabular. We have Ryan Murray, he is calling us from Munich. He is a co-creator of Project Nessie and he is an open source engineer at Dremio. [00:03:30] Julien Le Dem is calling us from Berkeley, California. He is a creator of OpenLineage and co-founder of a company called Datakin. He is also Apache Arrow PMC member. And finally we have Wes McKinney, he is calling us from Nashville, Tennessee, and he is also an Apache Arrow PMC chair, creator of pandas and he is co-creator, [00:04:00] co-founder, CEO of Ursa Computing. With that I will hand over to Ryan Blue, Ryan please tell us what made you create Iceberg?

Ryan Blue:    Mostly complaining users. I just needed to make those support questions go away.

Sanjeev Mohan:    [inaudible 00:04:24] What kind of questions were they asking you?

Ryan Blue:    [00:04:30] Well, pretty regularly we had a system for making changes atomic, where any insert or change to a table would actually overwrite data silently. So you needed to know that that was happening. So actually making atomic commits work as expected, making schema evolution work as expected, hiding partitioning from users so they didn’t have to deal with that all the time. It was just a bunch [00:05:00] of different user issues that made us zero in on the table format as the next thing that needed to be evolved.

Sanjeev Mohan:    That is great. And Ryan Murray?

Ryan Murray:    Yeah. So when we started thinking about Project Nessie, we’re really thinking about the progression of the data lake platform over the past 10 or 15 years. We’ve seen slowly people, a lot of people on this call have been building up abstractions, [00:05:30] whether that’s abstractions to help with compute or abstractions for things like tables and data files and that kind of stuff. We started thinking, what’s the next abstraction? What’s the thing that makes the most sense? What we saw was it was a metastore, a catalog that sits on top of the table formats, can help interact with some of the other things. So for us, it was trying to identify what’s next in these layers of abstraction that is making the data lake easier and easier to use.

Sanjeev Mohan:    So Ryan, we already had Apache Hive.

Ryan Murray:    [00:06:00] Yeah, we did have Apache Hive.

Sanjeev Mohan:    So…?

Ryan Murray:    I think it’s the same way that Ryan Blue felt that Apache Hive wasn’t quite well-suited for the table format. I think the single point of failure, the huge number of API calls to that metastore, even the thrift endpoint made it really hard to scale, it made [00:06:30] it really hard to use effectively, especially in a cloud native way. So looking at something that was going to be cloud native and would work with modern table formats and we could start thinking about extending to all the other wonderful things that my co-panelists are building, is what we were really thinking of.

Sanjeev Mohan:    That is great. So it seems like things that we’ve taken for granted in the relational world, some of the best practices from the age old data warehouses, like [00:07:00] data modeling, transactional support, so we are bringing those into the new era of big data if you may, would you say that’s where we’re headed?

Ryan Murray:    Yeah, I think so. I think the transactions are a huge part and I think that’s the reason that Ryan [inaudible 00:07:22] let Ryan introduced transaction as much as Ryan did. So I think it’s important to hear from everyone, but I think transactions are a really important [00:07:30] aspect for the data loss stuff that Ryan was talking about and for conceptually, how do people deal with the data lake? I think the transactions are really key to that.

Sanjeev Mohan:    Great. So Julien, what was your rationale for various open source projects that you’ve been involved in?

Julien Le Dem:    Yeah. So if we talk about Parquet, I think Parquet came first from [00:08:00] Iceberg and Nessie would come at the layer that we’re missing on top of it. Parquet was very much looking at, there was Hadoop that would scale up very much, but not very high latency queries and on the other hand, at least at Twitter, we had [inaudible 00:08:16] so kind of more traditional data warehouses that were lower latency queries, but they didn’t scale as much as Hadoop. So it was kind of, we were always in between the two options and I think some of it was [00:08:30] making Hadoop more like a warehouse, right? Like starting from the bottom up, starting with the columnar representation and make it that more performant following the tracks of all those kind of columnar databases.

So really that was the beginning and it just makes sense as you go up the layers, the next missing layer was the table format, the transaction layer, how we abstract that better, because Parquet is just the file format, right? So it just [00:09:00] make things more performant for [inaudible 00:09:02], but it doesn’t deal with anything like how do you update the table, how you do all those things. So we needed that layer on top and it was great to start seeing this happening in the community.

Sanjeev Mohan:    So, what about OpenLineage? What was the reason for that?

Julien Le Dem:    So I think one of the drivers we’ll see also for these open storage architectures is because there are a lot of [00:09:30] people who don’t use just one tool, right? They use things like Spark, they use things like pandas, they use warehouses whether like the SQL and Hadoop type, things like Dremio or Presto, but also like other proprietary warehouses. And so there’s lots of fragmentation, but they still want to be able to use all those tools and do machine learning on the same data. So I think this common storage layer makes a lot of sense to standardize this so that you can query it and transform data from various sources.

[00:10:00] The same thing applies to visibility in the lineage graph of data, right? So if you care about understanding dependencies between those things, right, you might have a dashboard at the end, but you care about how all these data came to be and when it’s coming from your ETL or your raw source of truth data. And so OpenLineage is the same thing, right? Now we have these somewhat fragmented, very heterogeneous environment [00:10:30] where people might use different tools for different jobs, but they still need to understand how the whole ecosystem works together. So OpenLineage is really about standardization and so really following in the tracks of how do we store our columnar storage on disk, or in memory with Arrow and OpenLineage is about standardizing lineage across all those things. So really tracking each transformation as it happens, what was the version of the code? What was the scheme of the input? What was the version? [00:11:00] The nice thing about those transactional storage layers is you can keep track of what version you’re dependent on and what version you produced. And so really capturing this across the entire graph of transformation. So I think that really, we’re getting to the point where we do need to get visibility across everything.

Sanjeev Mohan:    That is great. So that’s yet another piece of the puzzle that you’re solving. Which takes us to Wes, [00:11:30] so Wes can you please enlighten us with some of the rationale for your involvement with the open source space?

Wes McKinney:    Yeah. So, along with Julien, I have been involved in Apache Arrow since the very beginning. Around six years ago, we recognized that the community had developed Parquet as an open standard for data storage and data warehousing, for data lakes and for the [00:12:00] Hadoop ecosystem. But we were increasingly seeing this rise of application and programming language heterogeneity, where applications are increasingly bottle necked on moving large amounts of data between programming languages, between application processes and going through a more expensive intermediary like Parquet to move data between different steps with the application pipeline was very expensive. So the idea [00:12:30] of Arrow is to have a language independent compute engine, independent representation of tabular data, which can be used for very fast transport between computing processes and used for in-memory processing.

We found that it’s portable across computing hardware, you can use it on GPUs and CPUs. For me, the motivation was that I wanted to provide an efficient means for the big data ecosystem and for data warehouses, [00:13:00] to be able to export data at very high speed into the data science ecosystem, because I had been working to build pandas and the Python data science ecosystem, and you had all of these big data systems that were struggling to make data available to data scientists. So Python was becoming really popular, really important in 2014, 2015, and so the idea of Arrow was let’s give all these applications a single point of efficient import and export, since then Arrow has been adopted by [00:13:30] many different databases, data warehouse systems, big data systems, Spark and BigQuery and Snowflake and all these systems as a means of interchange. That’s really radically simplified the process of getting data into the data science world or transferring data between these different computing environments.

Sanjeev Mohan:    That is great. So thank you Wes, I am very keen to have you talk more about where Arrow is being used, in what other places, [00:14:00] but we’ll hold on to that so we can make sure we cover all our questions here. So I’ll go back to Ryan Blue. Ryan, what were some missteps? Lessons learned along this journey?

Ryan Blue:    I don’t know that I would characterize things as missteps as much as just a process of getting better. If we look at what Iceberg is doing specifically, because I think that’s one of the ones I’m most familiar [00:14:30] with, there’s the period of innovation where it’s about getting something working at all and then the next phase is getting it to work well. And that’s sort of where what my previous comment about taking care of all these user issues and user complaints comes in, because we had something that worked and did a good job in certain areas like Hive tables, but we realized through that whole process of innovation that we needed to [00:15:00] have these other features, we needed to bring back SQL semantics and things like that. So I would say our missteps were really just the evolution of going and saying, let’s do something that works even if it’s imperfect, which was really necessary and then moving on to where we are today, where we’ve had a chance to rethink and rebuild a lot of those components.

Sanjeev Mohan:    So if you were to do it again, would you do it any [00:15:30] different?

Ryan Blue:    Iceberg or other things?

Sanjeev Mohan:    Yeah.

Ryan Blue:    Yes, I would probably do some of the sequencing and features a bit earlier, or yeah, I’d probably do the Iceberg V2 stuff a lot earlier than we are the [inaudible 00:15:50]. But then I think that might just be that we’re seeing GDPR type use cases now and the benefit of hindsight.

Sanjeev Mohan:    [00:16:00] Yep. Yeah. Very true. Yeah. And Ryan Murray, same question for you?

Ryan Murray:    Yeah, I think I really agree with Ryan Blue on this one. I think to call anything a misstep or a mistake, I think you need the benefit of hindsight to be able to do that. The past 15 years has been all about we had this new paradigm and now we have to figure out what to do with this new paradigm. [00:16:30] Of course, we’re going to make some missteps, we’re going to make some mistakes, we’re going to go down some blind alleys, but it’s like scientific discovery. There’s no such thing as a failed experiment, every failed experiment teaches you something new. I think that’s what I’ve seen is we could have done things differently, but we didn’t know then and we had to make mistakes to be able to get to where we’re at right now.

Sanjeev Mohan:    That is true. Yeah. Yeah, because I know Julien, you have quite a bit to say on this because [00:17:00] when you were working on Parquet, for instance, you had to be very sure that whatever you are going to do is going to stay forever in the open source community, right? So what was some of your thinking process?

Julien Le Dem:    That’s the thing about creating a file format is any mistake or bug happening along the way stays forever, because once the data has been written in that specific [00:17:30] format, so any design mistakes, or under specify the element of the spec that leads to heterogeneity of implementation, we have to deal forever. So it’s something and we’ve been very careful, and Ryan and Wes remember some of those steps, some of those slips, then you have to maintain the code that we deal with, the street cases that can happen [00:18:00] somewhere, right? You’d like to have a very much cleaner implementation, but because the initial spec was under specified in some area, then you have to deal with the three different ways it’s been implemented because of a lack of specificity.

So that’s something you have to deal with when you store data in a format and then you need to be able to read it and it stays with you forever. And I think that’s informed all the things along the way. So it made us more careful with time on how we specify the metadata [00:18:30] of things. And I think I’m also taking some of that in the way OpenLineage is being designed, is really taking some of that into consideration. Making sure we enabled versioning of different aspects of the spec independently. We identify the origin of the metadata. I think in case of OpenLineage, it’s a little more forgiving because this kind of data is more transient, right? Like how [00:19:00] the data was transformed a long time ago may be less relevant, but it’s something we definitely learn from in terms of what’s the best way to specify metadata, how to make sure there’s no room for interpretation and be ready to hear those aspects.

Sanjeev Mohan:    That is great, thank you. Wes, how about yourself?

Wes McKinney:    Yeah. I would echo the other sentiments around, [00:19:30] having the benefit of hindsight, but it’s hard to predict what will be most important to people in the future. In the Arrow project, for example, early in the project we were faced with a choice. Do we invest in hardening, cross language testing, interoperability and data access tools versus building computing frameworks, which are Arrow native. At that time, we had Dremio which [00:20:00] was Arrow native from the very start, but we didn’t have Arrow native computing frameworks for Python and C++ and for other programming languages. And so now that we’ve built kind of this hardened standard for interoperability, we’ve made it stable, we’ve gotten lots of systems to adopt it. Now we’ve got tons of people clamoring for computing frameworks which are Arrow native so that they can process all of this new Arrow data that’s flying around in their systems.

So it’s easy for people to criticize and say, hey, you should have been working on [00:20:30] more computing frameworks that are Arrow native three or four years ago, but at that time we were really concerned with getting people to use Arrow in the first place. I think that there’s always that kind of people showing up at the project later and questioning your choice of priorities throughout lifetime of the project, but I would say in all of these projects we’ve applied our engineering best practices and our learnings from other projects that we’ve done in the past and that [00:21:00] by replicating and trying to avoid the mistakes in the past has generally led us down very reasonable paths.

I think also something to be said for working in the Apache software foundation, working in the open, our ideas are being continuously vetted by the community and presented for comment. So that kind of sanitizing the work with the light of day, I think really helps with helping [00:21:30] avoid doing something which is obviously a bad idea because people generally will chime in and say, hey, I don’t know about that, maybe we shouldn’t solve the problem that way.

Sanjeev Mohan:    So Wes, which products today are using Apache Arrow? [inaudible 00:21:48]

Wes McKinney:    Yeah, so it’s used very actively in the Python world as a means of fast data access and interoperability with pandas. So Arrow’s [00:22:00] used to connect pandas with Apache Spark, with Parquet files or RC files. We built a framework for fast data transport called Flight, which is on top of GRPC. There’s an extension for Flight called Flight SQL, which is intended to be a replacement for ODBC and JDBC. So we’re really keen to see the overall kind of speed of data access holistically improved across the board. Arrow’s also been really successful as a preferred bulk export format [00:22:30] for data warehouses, so it’s been adopted in Google BigQuery and Snowflake and I’m blanking on the other name I was thinking of, but so data warehouse vendors have adopted it and see it as like the fast on-ramp for data into tools like pandas or in our other systems that already speak Arrow. So that’s been exciting to see.

Sanjeev Mohan:    That’s great. [00:23:00] Yeah, great progress. I’m sure you had no idea at that time that this is how far Arrow could go.

Wes McKinney:    Well, it was certainly my objective, so I’m happy that we were successful in achieving our goals. I think that the interoperability with the JVM, the Java ecosystem, was one of the big problems that was facing the ecosystem in 2014, as Python was [00:23:30] becoming more popular, because a lot of the Hadoop ecosystem was Java based and so there wasn’t nearly the concern of how do we get data out of the JVM and into native code, C code or Python code. So, to have that interoperability problem for large tabular data effectively solved between the JVM and the outside world is a real boon for the community and has made application development a lot simpler for [00:24:00] other open source developers.

Sanjeev Mohan:    That’s great, that’s fantastic. So I want to go back in the remaining five minutes we have. So Ryan Blue, what excites you, what’s next on your plate?

Ryan Blue:    I’m really excited about, I think, the space that Nessie occupies and sort of evolving metastores. And also, I think part of that is making [00:24:30] the ecosystem just easier and easier to use, continuing that evolution. For a long time, you needed an army of data engineers, as well as platform people. I’m happy to see things sort of coalescing around these technologies, things getting easier to use and companies that are making it possible to have a small organization that can take advantage of this stuff rather than [00:25:00] needing that 20 person data platform team.

Sanjeev Mohan:    That’s great. So you talked about Nessie, so Ryan Murray, what would you, so he’s stolen some of your thunder, you can still hope, right?

Ryan Murray:    I agree completely with Ryan. I think what we’ve seen is it’s starting to happen now and what I really hope to see in the near future [00:25:30] is for people to stop talking about Parquet and Arrow and Iceberg and Nessie and stuff. I think a lot of data engineers I’ve interacted with are thinking about how big is my Parquet file and which directory does it belong in so it gets taken so that partitions get taken advantage of, and how do I make sure that it has the right scheme and all this kind of stuff. And I think where we’re so ready to just stop talking about that so that engineers can just start writing [00:26:00] SQL and applications on top of these things and move beyond all these great things that have been created. Not move beyond, because they’re still under the hood, of course, but let’s raise the bar of abstractions so that people can start doing interesting things on top of what we built.

Sanjeev Mohan:    That’s great. I have noticed that data engineering is probably one space that is the most in flux, in data analytics. [00:26:30] Why I say this is because traditionally, if you’ve done data engineering in a silo, the data engineer has written code, but now we are seeing there’s more dev ops pieces happening, there’s automation happening. It seems like it’s a last bit of data and analytics, which needed to catch up with all the best practices of SDLC. Would you say that’s the case?

Ryan Murray:    Yeah, I think there’s some pretty good [00:27:00] memes you can find on Twitter and LinkedIn and stuff, of data engineer’s CVs, or data engineer job specs, which are can you do everything that has ever been invented effectively? I’d love to see the data engineer job spec get a little bit smaller.

Sanjeev Mohan:    Yeah. That’s so funny, right? So, and Julien, how about yourself? What’s exciting for you?

Julien Le Dem:    Yeah, I think along those lines, when you talk about operations, I think the data [00:27:30] ecosystem is quite beyond compared to a service world, right? Like the service world notion of SLAs, a lot of things are well aligned. Where in the data world, we’re just starting to adopt all of this, right? So the notion of data observability, tying together data quality metrics and how the transformation is happening, how the code has been deployed, how this is changing over time, right? Because you have these complex draft of dependencies [00:28:00] of transformation and you may be changing your product or changing how you collect data about your product, changing your source of data. While you may have like many sorts of data, your ETL system, you need to ingest and then a lot of transformation. And all of that is constantly changing. You’re a bunch of teams who consume and produce data and you need a lot of observability of how everything’s changing so that if your dashboard is wrong or the machine learning model is 2% less accurate and that impacts your bottom line, [00:28:30] you can’t really quickly understand where this is coming from.

So really that’s why I’ve been pushing on the OpenLineage and really taking a page out of the Arrow playbook on how we start an open source project in reaching out into the community, building these together. By the way, today went out the announcement that it’s part of the LFAR in data foundation, which is the equivalent of the CNCF, but for data. We are really making this a reality, having a standard, [00:29:00] how we express those transformation. For every job that trends, what were the inputs? What were the output? What was the version of everything, keeping track of that and building these lineage graph really [inaudible 00:29:11], every transaction that’s happening, understanding the lineage of it in this vast exosystem. So that’s kind of the layer on top of everything you talked about with Arrow, Iceberg, Nessie and the transformation on top, is OpenLineage. I’m really excited about [00:29:30] bringing this visibility across the ecosystem and having the same kind of conversation that have been happening with Arrow over the past few years.

Sanjeev Mohan:    Yeah, which takes us to Wes. Wes what’s exciting for you next?

Wes McKinney:    Yeah. I’m really excited to take Arrow into its next stage of development, which is the building and deploying Arrow native computing engines into all of the different places where data [00:30:00] is accessed and data is processed. So I think Arrow has been really successful as a tool of accelerating data access, but it hasn’t yet made its way as a system or a toolbox that can accelerate everyday analytical computing around a lot of the projects that you use in your day-to-day work. So I think our goal for Arrow is for it to be largely invisible to you and something that the developers of other projects that you use can incorporate it into their systems and make them faster and more efficient, [00:30:30] more scalable and you, as an end user, wouldn’t have to think as much about it. So that’s what I’m excited about and what I see the next several years of my open source development career, where I see it taking me.

Sanjeev Mohan:    That is fantastic. Amazing story. I’m so happy to see the journey of where you guys started and what’s coming next. Stay tuned, everyone. Thank you for joining this panel, really [00:31:00] enjoyed having everyone on. Until next time, thank you, bye-bye.

Julien Le Dem:    Thank you.

Ryan Murray:    Thanks.

Ryan Blue:    Thanks.