March 1, 2023

9:35 am - 10:05 am PST

Ten years of Building Open Source Standards: From Parquet to Arrow to OpenLineage

Over the last decade, Julien Le Dem has been lucky enough to contribute to a few successful open source projects in the data ecosystem. In this talk, Julien will share the story of how these projects came to be and what made their success possible. He will describe the ideation process and early growth of the Apache Parquet columnar format and show how that led to the creation of its in-memory alter-ego Apache Arrow. He will end by showing how this experience enabled the success of OpenLineage, an LF AI & Data project that brings observability to the data ecosystem. Along the way, Julien will talk about the key elements that catalyzed their growth, from project focus to governance and community.

Topics Covered

Open Source

Sign up to watch all Subsurface 2023 sessions


Note: This transcript was created using speech recognition software. It may contain errors.

Julien Le Dem:

Hello everyone. I’m Julian. And today, so I’m talking about 10 years of building open source standards, and I’ll start with Parque, go to Arrow and, and with open image which is a natural conclusion to this took. so it’s organizing three chapters. and along the way I’ll talk about the story, how those things happen and the lessons learned along the way for each part. So I’ll start with the burst of Barquet. So this starts out a little bit about me. 15 years ago I was at Yahoo working with Ache Pig and you know, ache Pig as kind of like, let’s use project nowadays. And it’s been mostly a replace with Spark as the better map produce as it was at the time. And, but that’s the project where I got my first commuter ship on an ACHI project.

I studied as a user moving to contributor to commuter to PMM C member to eventually be P M C chair for a year. And so I’ve seen great experience through that path was to learn and understand how those different roles work and what princip George works and understand this notion of becoming part of the project, not just the user. Around that time. That’s also when the journal paper was published and I initially read it with as, which has been seminal to some of the things I’ve done later. So after that I moved to Twitter, like through this open source community. There was like a lot of people in Twitter, LinkedIn, Netflix working on similar projects. and that’s what, that was the context for the inception of parque. So at Twitter there was Hadoop and Hadoop would scale a lot, you know, be able to store a lot of data or process a lot of data, but was high latency, right?

So you would start a job you would go get coffee and then you would get the result, right? but it was pretty cheap. You can put a lot of data in it and figure things later out later. Meanwhile, we also had Vertica, which is data warehouse on-prem at the time. And that provides more interactive queries. and you can actually also store that data, but it’s not as scalable. It could not store as much data. It was more expensive. so there was always this tension of you would’ve a better experience querying data and verica, but you didn’t have all the data that you had in Hudu, right? So there was a constraint. So part of that thinking was can we make Hadoop more like Vertica and thinking about how Hadoop is a map produced framework on top of a distributed file system.

One verica is a distributed query engine on top of a columnar storage, right? So some of that thinking of like, oh, how do we get the best of both worlds, right? So one thing that was interesting at Twitter is we have this paper reading group. and so we have, we kind of, people would read papers present to the group and explain things. And I got to write, read the verica and C store, preor Verica papers journal paper again, writing money db, understand better the architecture of those M P P databases and how that works. And digging a bit back into the column layout, what’s the column layout? Well, if you have this table or presentation in our head, you have this bi dimensional table with columns and rows, right? but when it’s physically stored on a, on disc or on storage it is linear, right?

You have to turn this two dimensional thing into one dimensional list of bites. and so roll out, you just put each value each row one after the other. And so you interleave data of different types when in columnar out you just put all the data for each column together, right? Which brings, if you’re reading only a few columns, you can just read a lot less data because it’s easy to quickly scan just that data and it’s easier to encode and compress better that data. And so it has a lot of benefits on retrieving data from disk, from the storage layer a lot faster than in a roll out where it kind of assumes you ex you retrieve everything before you do anything with it. So that led to the birds of Red Elm at the time. So a red elm is the tree, and at Twitter they named everything after birds, but birds leaving tree.

So I got, I thought that would be a good name as well. And that was the intersection of implementing the algorithm in the DR paper, looking at the existing first formats that existed, like T files, RC files, sire, which add good things and things that were missing and so on. And looking at the schema systems we were using at Twitter, which were thrift, which is similar to Afro photograph and Pig which had its own schema model. So combining all three things led to the birth of the design of Red Elm. And I got these first comments going in August in 2012. And Red El is a nanogram, right? So actually the project was quite ambitious at the time as a site project I was doing mostly in the shuttle drive to and from the office. as rhythm is a nanogram of Dr.

And I was envision envisioning at the time while we are going to rebuild the entire thing. but also at the time, starting from the storage layer, I figured that I would not be able to, you know, if I build my own proprietary file format, I would’ve to rebuild everything, integrate with everything, and I was like too big of a task for one person or even one team. So I started like seeking for partners and contributors. So, you know, this is a bit tongue in cheek tweet Twitter at the time of implementing the algorithm on the journal paper. And but that led to the connection with the Impala team at Cloudera, right? Because obviously this problem I was looking at, I was not the only one seeing it, right? So there were other people looking at how we make, you know, had you more like a database a less like you know, bruteforce scanning of data.

So that led to the creation of Parque. So from the red Redon side of things, you know, where the column nano representation in well integrated in Java ecosystem looking at map produce and pig and nested structures on the iPaaS side, they were looking at building a sequel in Hadoop as this category was called at the time, it’s now it’s more like the lake house model. and having gotten our presentation, that code generation to be able to have a distributed query engine. So that’s not like a great thing because like, oh, they’re already working on the distributed query engine part. Like we just need to connect the dots and make things work better. So that’s, we agreed on joining forces. We merged design, I reworked a lot of my Java side of things. They implemented it on their native side of things and that became parking.

And we picked a name that would reflect more for like being the bottom layer and the story to the year for the project instead of the world thing. So that’s how we created parking. And if we look quickly at the timeline, when we started from the announcement of Twitter and Cloudera collaborating, and it was not just one company quickly several companies started like being interested like crypto, which is an ad targeting company in Europe a hi support Netflix adopted Parque. There was a huge point in the timeline. Apache drill adopted Parque as a baseline for storage. And then it entered the Apache incubator becoming like part of the Apache Foundation, got integrated in Spark, became a top level project at the c plus plus implementation, and then it reached escape velocity. And in a few years went on the trajectory that today it’s integrated in all major warehouses, everything plus Spark A, you can read Parque from your Excel if you want. And so the lesson learned along the way of building parque, I’ll stop in my story to talk a bit about that. And one important thing is every contributor is a stakeholder in the project, right? So you join and you’re not just

Like the thing and just being a consumer, you become like part of the decision making project pro process, right? You become a part of making the project as successful things helping with adoption, providing input, being part of the decision making and being part of this, right? And this is really important to an open source project like this and establishing a standard and really making sure us users, contributors become stakeholders and there’s a snowball effect to it, right? You see in Parque, initially it was just two companies, but all e even that is really important to show there’s momentum and especially as being Twitter, being vendor neutral and showing this is going to be neutral and not like favoring anybody in the process and quickly gaining momentum and moving fast. And open source comes in all shapes and size. You know, like the first thing, the most basic definition of open source is the code is available, but the code is available doesn’t mean you can use it, right?

The, that’s where licenses are important and you know, you have various types of licenses. ache license is very permissive. All you have to do is give credit when you’re using that code. GPL is more restrictive is just if you build something on top of it, it has to also be open source and full it same license and then the project can have an open license, but still you have no control on the direction of the project that where governance comes in. And governance is like being clear of can you become a committer? Can you even contribute to it? If you can contribute to it? Do you have any say on the future of the project, do you have any paths to become like a decision maker part of the project? That’s where governance is important. And having a clearly stated how, you know, do people participate in the decision making process as is there a voting mechanism and so on.

And the last thing is you can have a governance without being part of a foundation. but then the foundation is the thing like achi Foundation of the Linux Foundation being the two big open source foundations is what guarantees that the license of the project is never going to change. That the governance of the project is always going to follow some particular rules and being an open project and inclusive and bring people in. So, you know, any of those people may or may not follow those things. What’s importantly being upfront about it and all shapes and sizes are fine and it’s just important to understand what’s constrain the direction of the project. So second part is from park to Arrow. And so there was the beginning of conversation in the park community that we also needed an income memory columnar format.

And as part of the discussion, it became clear that it awfully looked like the in columnar format that the ACHI drill project at built. And the reason there’s a need for this is like on this is for fast retrieval and in memory is for fast execution. So if you look at the money DB paper, that’s very seminal to this kind of thinking and vectorization and how you take advantage of modern processor to process data, better having this in-memory column representation so you can take advantage of processing data much faster using CD instruction and these kind of things is available. So we kicked off with a group of people this initial consensus of parque as this fast retrieval and error as this fast in-memory processing, vectorized execution, their copy transfer and through a initial group of people like came from the park community and around it a little bit created this arrow project.

So if we look at the timeline we had this kickoff from requirements like getting an initial group together, an initial spec together, establishing it as a top level project in the Apache Foundation, spin off the Java code from the Apache drill and start the c plus plus implementation. And there’s a super motion. And you see this timeline starts like at the end of my previous timeline where I stopped for Parquet, you can still grow in momentum and they, it became the basis for pandas, the new pandas implementation, making it much faster we can integrating in Spark because the app pandas to spark integration would be much faster and so on. And nowadays, if we fast forward to two today, you see major warehouses are also supporting HA as a better way to exchange data with like Python or other things people are using. And that DB is this kind of odd new thing that is been built on top of error, right?

So it’s like SQL lights for all app or local mode for your data warehouse. If your big data is dead, your big data fits in memory now, like machines got so much, so much bigger, there are lots of process error. So it’s funny to see, we went to, you know, appliances to distributed computing and a lot of computers back to you Buffy machine that can do a lot of parallelism and a lot of processing. so lesson learns along the way, right? So one is bootstrap the community with an initial initial spec, right? If we’re pushing this snowball to get the snowball effect, you want to make sure the ball is as big as possible when you get started, right? if you like exponential curves, well, exponential curve at the beginning are insanely flat and it takes a while to get to the inflection point.

So if you want to build a standard start, like bring a bunch of influences and perspective in the initial spectrum, people to help pushing that and making sure you’re going to have a lot of adoption initially, right? So that you can get this going faster. And then it’s important to find like-minded people who will drive this vision, right? So you don’t need to convince everyone at the beginning, right? So you always have the people who wait to see if a project is going to be successful before they adapt it. And then there’s the people who understand that bef because they adapt it, because they’re going to put their weight behind it, then it’s going to be successful and they’re going to benefit and everybody’s going to benefit from the project, right? So you need to start with the visionaries and you just need to find a few people, a few groups, a few companies or organization or other projects that see the vision, agree with it, airline and are going to help push it.

And then you get the early adopters, people will kind of like more easily see that vision and see the initial success of the project and are going to adopt it and grew the community. And eventually it becomes mainstream, right? In the end, the rest of the people get convinced because the initial early adopters see the benefit from it, right? So you just need to keep this going and start with the right subset of people we’re going to contribute to build that vision. And any really, it’s about the connection we build along the way, right? I’m presenting this like, like there’s like a plan and we’re kind of, if you do this, then you’re going to be successful. But it’s about the journey. It’s not about the destination, right? The the trick is we already enjoy collaborating and building those connection and getting people to work together and making it happen, right?

This is an enjoyable project that’s makes the whole thing working. Like people have sometimes told me like, oh, you’re doing open source the right way. Like meaning sometimes like it’s the hard way, right? Because you include everyone, you do the thing, yeah. But that’s the way it works and it’s enjoyable should be able to collaborate with those like smart people along the way. So on that, I’m going to my like next logical step, right? Like talking about open lineage and, sorry take a slide. And then the main problem of thinking about what’s the next thing? Why do we need open lineage, right? if you look at the way people do data transformation and depend on each other, usually it works well inside of each team and they have their own practices and they know how they do things. They know how, they know how things depend on each other.

Usually there’s lots of friction in between teams because that’s where like there’s lot of obesity. Where is this data coming from? Where it’s going? Who’s consuming it? I don’t like not understanding dependencies beyond a certain bubble you live in. And that’s where, you know, there’s this Maslow’s hockey of needs, like to reach happiness first you need, you know, shelter, food safety. And then, you know, you can think about like being happy, being the best version of yourself. There’s something about data a lot of the time you stay under the water line, like you have the hair head underwater, and first the data needs to be available, it needs to be fresh, it needs to be correct, and then you can start building on top of it. And a lot of the time we, we spend too much time like, oh, just like the data is not available, it’s not fresh, it’s, it’s, I need to fix it.

None of that, right? So it’s really about how do we get to the next level and like get rid of those problems. So everything’s good. So that led to the inception of Marquez. So like being the architect for data platform at WeWork, one of the big missing pieces, in my opinion in this data ecosystem was a place that we know about all the jobs, all the dataset, all the thing that existed and how they depend on each other, right? So we started marque at WeWork as part of the data platform. And same thing, you know, bringing other opinions, like validating the use case. We want to make sure this is going to be something that’s reusable, integrating the ecosystem. You want this to be open sourced, it integrates in the office ecosystem and you don’t have to do everything yourself or like integrating with everything and reinventing and reverse engineering everything to get clean, understand dependencies.

And then when led to the creation of data, like focused on the data observability, data reliability aspect, and now part of astronomer. And so before be Lineage, we were markets, we were part of this ecosystem. And you have all those things that people can use to collect lineage, but they all have to understand the internals of all the data transformation layers that everybody’s using. So it leads to this incredible complexity and duplication of effort. Like everybody has to understand everything about everything to collect each. And we’re like, okay, we don’t need to compete with everyone on doing this. Everybody need that same layer. How do we expose lineage? How do we extract lineage? How do we understand how everything work? How do we make this piece just that small speeds reusable for everyone in solve it as an industry, as an ecosystem altogether, we get together and we solve it.

So that led to the creation of Open Lineage and we spin out spun out of Marcus integration, we reached a group of people really taking a page out of the arrow playbook on how we boots strive the community, how we get enough weight beyond this at the beginning so we can bootstrap this normal effect and get them going. And it fits as like the spec and protocol layer, how we exchange lineage between all the producers of lineage and metadata, data quality metrics, right? Like schema, column level, lineage like cost, all those metal you can collect around this operational lineage and push it into all the metal layer. Like here I’m focusing on the open source ones, but that works also for like pro commercial ones as well. And so going little faster and then possibilities are endless. You can be governance, you can compliance, you can focus on the privacy use cases, reliability use cases, banking regulation, wall length of things.

And so open lineage, you know, something you start, we join the l ffi i n data, which is a, some foundation, the Linux Foundation, like very similar to the Apache Foundation in terms of having a neutral body that makes sure the project is going to always be neutral and always have a clear governance. and then there’s a bunch of support of open source project like Spark, airflow, great expectation. And then a commercial solution also came in the game and became interested in being part of this. So Microsoft Purview is a big contributor and consumer auto opening edge, there’s also a Snowflake Labs implementation and working with Manta with a company focusing on Lineage that’s super opening edge and a bunch of other things and growing same thing, like getting the same momentum and getting adoption because there’s this need here.

And so if I go and finish my talk with the lessons learned here, you know, someone asked me, you know, I was telling about, you know, I’ve done Par, I’ve helped like get Arrow off the ground and get this spec going and adopted, and I understand, you know, we can just do, we do the, we’re doing the same thing with open image and they say like, what makes you think you can do it again? Well, the trick is, it’s not like you start a project and you have some mind trick and you convince everyone to use it, right? That’s not how it works, right? You start with, oh, there’s this thing that everybody needs, right? How we get this very focused, very single mining project on that thing that everybody needs. We align everybody’s incentives, everybody benefits when this thing happens. We empower the community, we make sure everybody’s involved and they become a stakeholder and they understand that they benefit everybody benefits.

And there’s very like lc life cycle happening and then success happens, right? That’s the, it starts with that. And I have this stone soup story. So the Stone Soup is a children’s book story where people, like someone comes to a town square in a, in a village and they have a pot and they start steering the pot and they put a a stone at the bottom of the water. And if people ask, what are you doing? Well, I’m just making a stone soup, but you’re welcome to add your own ingredients to it and make it better, right? So basically they’re doing a soup at nothing, right? The stone doesn’t bring anything to the hot water, but they’re sharing the part, they’re making it available, they’re creating this focal point. And people come with carrots, come, people come with leaks, with potatoes, make the soup better.

And just because someone is tearing the pot and making it available and making it easy for everybody to contribute, then it happens, right? That’s, so it’s exactly what I’m doing with open edge. I’m just tearing the pot, enabling everybody to contribute being part of this and making it happen. And really it’s all about aligning incentives and building a network effect. You know, someone peop sometimes people say, oh, you’re boiling the ocean, right? So that’s not true, right? If, because if I were like boiling the ocean of pushing a rock up a mountain as I have more people, right? We just like linearly making it faster, right? Like the, we boil the ocean a little faster every time someone comes. It’s not like that. It’s not linear, it is exponential because every time someone join, then they convince other people, right? There’s this network effect that every people that gets convinced that is going, are going to become like visionary and seller for the others, like can convince other people that this is happening.

And the magic happens. Not when you know, the original people who work on the project convince everyone. But then when those people talk to each other and then convince other people and there’s really this network effect. And so to conclude, you know, there’s like those lessons learned. People are stakeholders, right? They’re not just consumers. There’s a snowball effect, right? They gain momentum, it gets faster as it gets faster and open source is all shapes and size and it’s fine. You want to bootstrap this community to get the snowball going faster at the beginning. collaborate with other visionaries and trailblazer, enjoy the connection you build along the way and find whatever everybody needs, feel that needs and it’s going to be successful. And, you know, steer the part of the stone soup align incentive and this network effects is happening and and it’s working. And that’s it. On that, I would like to thank you.