May 3, 2024

Unified dataframe API for different data backend

As data use cases become more complex, data platforms evolve to meet their needs. Data workloads run in different environments (local and production), in different modes (batch and streaming), and across different hardware (CPU and GPU). This talk introduces Ibis, an open-source project that allows users to work with different backends in different settings. We’ll go into how Ibis works under the hood, common use cases, and the roadblocks towards a unified API vision.

Sign up to watch all Subsurface 2024 sessions

Transcript

Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.

Chip Huyen:

Hi, everyone. I’m really excited to be here today. So, today I’m going to talk about Unified DataFrame API for own backends. So, I’m part of Voltron Data. Our company invests a lot into open standards or open data format for data systems. So, we maintain, we contribute to and maintain projects like Apache Arrow, Kibitz, and Substrate. So, a lot of people ask us, “Okay, you can’t make money from open source. So, how do you make money?” So, another area of our report is in our commercial product, which is a GPU-native query engine called TCS. So, data processing turns out to be very naturally parallelizable, which makes a lot of the workloads ideal for GPUs. And we have found that for a lot of companies, by moving their data processing workload from CPU to GPU, not only they can save a lot of money, but they can also speed up their workload a lot. So, here’s a benchmark, but this is not the focus of the talk. So, basically, our sweet spot is for companies that have a lot of data, like over 30 terabytes of data. That’s when you see the most benefit from TCS. 

Diverse Data Systems

Okay. So, because our talk today is on unified interface, and whenever this topic comes up on Reddit, Twitter, LinkedIn, there’s always somebody who posts this XKCD comic, because basically, it’s really hard to build something that’s like standardizations of anything. However, I hope that I can convince you through this talk that it’s not only doable, but we have had a lot of success, not just with IBIS, but with other projects like Apache Arrow. Okay. So, one thing that I hope we can all agree on is that data is extremely important, especially with the rise of AI in recent years. So, we have AI data for many use cases, and the diverse data needs have led to a lot of diversifications of data systems. So, one of the diversions in data is what you think of as a local and distributed data. So, for a lot of people, when you do experimentations, you want to think quick and dirty, so you run experimentations on the local machines of the laptop, but then when you deploy the systems, you have to move them to some remote engines, like Raise, BigQuery, Spark. And another diversion that we saw was very recent, maybe about 10 years ago, is the rise of streaming. So, now we have the batch system and streaming systems. And a very common pattern we see, especially in machine learning use case, is that for training, you would train the model using batch data, but in production, then you would use streaming data or real-time data to serve the models. And there’s a lot of mismatch. We see a lot of pain caused by the mismatch between batch and streaming. For example, there’s a very interesting report by Pinterest, where they found out that for streaming, for a lot of their operations, it’s like 90% of them actually inconsistent with the results they would have got from batch systems. So, another diversion, but really the newest one, is with the rise of GPUs. 

So, now we have CPU infrastructure and GPU infrastructure. So, a lot of data science tooling are not naturally runnable, or are not built for GPUs. For example, NumPy, Pandas, or even scikit-learn, are not built for GPUs. So, we’ve seen a lot of effort into making the system work on GPUs. For example, NVIDIA has a whole rapid AI team, so that it can build things like Spark, run on GPUs, with a lot of success. And also, for example, because Pandas doesn’t run on GPUs, we have a lot of dropped-in replacement, like Dask, that runs on GPUs. And of course, our team is doing TCS, so that it can make data workloads run very well on GPUs. 

Challenges

So, all of these diversions and the diversity of data systems leads to several big challenges. So, firstly, it takes a lot of effort to build a custom data system. So, we’ve seen many companies, not just large companies, but also medium-sized companies, having to build data systems from scratch. So, you’ve probably seen Meta, Netflix build their own data systems. And it’s not just big companies, but a lot of smaller companies as well. So, we see a lot of code rewriting and translations. So, I can’t really see anyone here, but I wonder, how many of you have ever had to translate the code from Pandas to Spicebark? And it’s quite painful. So, when you have to move data across different systems, there’s the challenge of front-end inconsistency. So, for example, you have some things that run really well locally, but then things just don’t run well. They just don’t quite work remotely. So, we’ve seen a lot of tests. We’ve seen a lot of people complaining about inconsistency when moving across different data engines. And of course, there’s another issue, like when they log in. So, no matter how cool the system is, eventually they’re going to become outdated. So, if you build a system that’s so tied up with some engines, and it’s very hard to move onto another platform later on. 

So, because of that, just last year, as VLDB, there was a paper jointly published by Meta, Databricks, Von Charn, and Sundeck about composable data systems. So, the idea is that even those systems, even though data systems are complex, they can have many different data systems. But of course, they have very similar components. And if we can standardize these components, it will make it a lot easier for people to plug and play different components to create their own data systems. So, here’s the outline from the paper. So, at the top, we have this thing called the interface. So, you can have simple or non-simple. So, in case of non-simple, a lot of people use the Frame API so that you can interact with the system so that people can write queries to interact with the data. And then, we have the intermediate implementations like IR so that it can be translated to query into the query optimizer, such as Apache Consulate or Orca. And then, again, it can compile into execution engines, which can be like Spark, Ray, or other engines. So, I’m not sure you can see my mouse, but one thing that the paper didn’t really go deep into is a standardized memory format for data. So, for different engines, query engines, which communicate with each other, there needs to be some certain standardized memory format that everything can understand. And that’s what our role was designed to be. 

Unified Interface

So, today, I want to focus mostly on this layer, which is a unified interface across different query languages. So, we can think of what we are building as something to sit on top of different interfaces. So, we call our project IBIS. And when we set out to design IBIS, which aims to be a unified data frame interface, we thought of requirements. What would this unified API need to do for it to work? So, first, this API has to be extensible to a wide variety of backends. So, there are many, many different backends for different purposes, right? Like we talk about best streaming, local distributed, CPU native, GPU native. And this API will have to support most of these engines, or at least the ones that users care about. And the second is that it has to give the native performance, because we sit on top of these different engines, right? So, if people write code in this API, it will have to run just as fast as if they had written the code in the native backend. And the last is that it has to give consistent results across frameworks, right? So, because say if you write code in this API and run it on NAS, and then again, you run this on Snowflake, and if it gives different results, that could be pretty alarming. So, that is what we set ahead in mind when we built our engines. So, what we created was IBIS. And IBIS today supports over 20 different engines, like including across both SQL engines and Dataframe engines, as you can see in this image. 

So, the way IBIS works is that you can write code in IBIS API, and then we compile into the intermediate implementations, relational expressions, which is SQL equivalent. And then from there, depending on how or where you want to run the query, it will compile into the backend. So, for example, you write the code and say, “Hey, I wanted to run it using Polar.” We compile it into Polar. And then if you want to run it on Snowflake, we compile into Snowflake SQL. And similarly, if you want to run it in Flink, we can compile it into Flink SQL. So, it makes this very flexible. You can swap between different engines in just one line of code. So, I can show this here. So, here’s one of our team members, this is a really cool tool. So, first of all, you can see the code across different engines, right? So, here is when you connect to a DuckDB table, and here’s the code. Or if you want to connect to Flink, it’s the same code. If you want to connect to Spark, so you just change the engine, and the code is very, very similar. So, this means that by using IBIS, once you can just support multiple backends out of the box. And the second, if you want to change the backends, for example, from local to remote, you can just do so by just changing one line of code. Cool. So, here we also have the QR code. If you want to play around with it, you can click on it, or you can scan it. 

Extensible API

So, the first requirement, right, is the API has to be extensible. So, because we have supported 20 plus backends, we are pretty confident that our API is fairly extensible. So, the focus is that we want to maintain these different existing backends, right? So, when we say that we support a backend, it’s not just, there’s two aspects. One is coverage, and the other is performance. So, the API has to be able to cover, like, our unified API has to cover, like, most, or, like, all those operations. That’s the native API can do. And the other is that the performance has to be pretty well. So, you can see on here, like, just some of the PRs that we merged, or, like, other issues recently, is that we’re continually adding a lot of functionality. So, for all the backends we support, we have pretty good coverage. But, like, these different backends are going to evolve over time, right? So, we need to keep up to date with those and keep adding new features. And also, of course, there’s a lot of, like, requests for new backends, and we are slowly adding them as well. So, first of all, like, Dremio is a great company, and we do want to support Dremio. And hopefully, we can get to that soon. Another requirement for the unified API is that it has to run just as fast as the native performance. So, we’ve done a lot of tests, a lot of benchmarks to make sure that, like, by just using IBIS, the user don’t lose, like, performance. So, first of all, here, you can say just, like, by using IBIS on top of Polars, the performance is just very similar to, like, using Polars. And using IBIS plus Data Fusion is very similar to, like, writing code in Data Fusion directly. And there’s a lot of benchmarks with other frameworks that we use as well. So, in this case, we also try to benchmark with Pandas, and Pandas is running out of memory. So, on a side note, I’m actually really surprised by how Pandas is still so popular, because given that there’s so many different alternatives, that I’m not quite memory hungry, and a lot more performant, and it’s equally easy to use. So, I do think it’s, like, for people who are still using Pandas, do consider using, like, one of the many other data frameworks. So, you can see more on the link here, if you want to read more about the performance benchmarks that we have done. 

Deferred Execution with IBIS

Another aspect of the requirement of the unified API is that it has to be consistent across back ends, right? So, at this point, we have done a lot of tests to make sure that the results are, like, consistent. So, here’s a slight screenshot of some of the tests. Some of the tests, you can go and choose /ibis/backend/test. So, we have both tests across a framework, and we also have tests for each individual back end. And we also run each PCH test, so that to make sure that, like, our frame, like, IBIS API can pretty support workloads across different back ends. So, it’s been a lot of fun maintaining and building IBIS. So, I just want to go into a little bit of one function, one feature of IBIS that people like. It’s, like, deferred execution. So, without it, if you want to do local executions, like, using, like, connect your remote engines, you usually have to connect, you grab a table, and basically, like, something like Pandas would have to, like, read the entire table in, even if it’s just one, like, one column or two columns. We should make Pandas run out of memory very quickly. So, for us, we do a little bit of, like, we have a different approach. So, like, when you connect your remote table, we only get the table schema, and then we generate, we only execute when you actually do the queries, so that you don’t have to, like, lose on them, so that we can read and execute the data that you need for the executions. And it makes things a lot easier to run. 

Standardization of API

So, let’s get back to the original questions. So, is this possible to create some standardizations of API? Because everyone tries to standardize things, and even if it’s, like, an extra standard instead of, like, an extra API instead of, like, a unified standard. So, I would argue that standardization is inevitable. And we have seen that with a lot of products, but I think the closest thing we’ve seen is, like, Aperture RL. So, Aperture RL was created as a standardization of data in memory data format. And today, it has been adopted by pretty much all the major data engines out there. And it does make a lot of things easier when there is some standardizations going on. For IBIS, we are seeing similar. And because of, like, IBIS, and because of, like, I do believe incentive, right? We can’t force people to do anything unless they see the reason why they should be doing it. And we do see that, like, for users, there’s some clear value propositions for them to move on to a more standardized API. 

So, here’s some of the incentives people are using IBIS to achieve. So, one thing is it makes it a lot easier for you to write the same code for both experimentations and deployment. So, you can write the code, you can experiment with anything locally. And once you’re happy, you can deploy to Snowflake. It’s pretty easy. One use case that I think people are using IBIS for, which is something we did not plan for initially, but we find pretty helpful, is that now you can actually run tests across different engines. So, I was catching up with a friend recently, and he was complaining about saying something like, oh, we have a lot of different workloads across different engines. And for me, if I want to create a test, end-to-end test, I have to do a lot of manual glue work. So, using something like IBIS, you actually go to IBIS blog, and there’s a tutorial on how, like, you can orchestrate or, like, orchestrate an end-to-end test across, like, WB and Snowflake. So, like, we can do a lot of glue work for you. It makes it a lot easier for you to test out the pipeline end-to-end without a trial, like, with minimal overhead. And of course, we also, like, do a lot of work to, like, unify batch and streaming. So, streaming data has become, like, pretty common. So, we see this, like, many, many companies today, not just have, like, tables, but also, like, data coming in from, like, Kafka, Red Panda, and just, like, streaming. Streaming has some, a lot of functionalities that, like, batch, like, syntax, that, like, batch doesn’t support. First of all, like, windowing syntax. So, one of the first things our team did, so, by the way, I joined Von Schon Data after they acquired Clipboard AI, and one of the first things our team did was to contribute the streaming support to IBIS, and we extended the SQL syntax, which is support, like, a lot of, like, windowing syntax that streaming has. So, I’m pretty proud to say that, like, IBIS is one of the first, not the first, but, like, one of the more popular ones to support streaming and batch pretty well. And also, like, another big use case that we see for IBIS is, like, form data platform. 

So, imagine that you are a data platform that supports a lot of users, and users may have their own backend. So, occasionally, you might have people request, like, hey, can you support this backend? Can you support the other backend? And it can take a lot of effort to, like, support each individual backend. So, by adopting IBIS, you can just, like, support, like, all the backends that IBIS supports out of the box. So, and also another use case that we also got asked a lot is, like, using IBIS to benchmark different backends. So, imagine you are building a new data system, right? And you don’t know which data engine, query engine, to use. So, you can also look at different benchmarks. So, benchmarks, like, might not cover all the queries you want to do. So, what you can do is that you can write a set of queries you want to test for, and you can use IBIS to, like, run all the backends you care about and see, like, which one achieves, like, the best performance. And because IBIS queries can just perform just as well as the native code, it’s a pretty good indication, like, for you to choose, like, which query engine is best for you. 

Benchmarking Different Backends

So, here is an example of, like, someone who used IBIS to benchmark different backends, and it seems, like, that person seemed pretty happy because you can just do so with, like, one line of code. We also, like, do a lot of benchmarking internally. So, here’s another example of, like, I’m not sure you’re familiar with the billionaire challenge recently, and we just, like, did it in IBIS just for fun, and we could, like, do one. It’s very easy for, like, benchmark to be polarizing the diffusions, like, using IBIS. So, we’re also, like, dealing, and it’s because of all these cool use cases, we’re actually seeing a lot of adoptions recently. So, IBIS is a weird project because it was started, like, pretty long time ago. It was in 2015, 2016, and it was started by Wes McKinney. So, he’s a creator of Pandas, and back in 2013, Wes already discovered, like, Pandas didn’t scale very well, and Wes wanted to build the next generation of Pandas, and IBIS was a project. However, Wes, at the same time, Wes also started Apache Arrow, and you can see that for a long time, IBIS didn’t see a lot of growth and development because no one was looking at it. So, our company started picking up and maintaining, developing IBIS in around 2022, and you can see from there, we’ve seen a lot of consistent adoptions from quite a few companies. 

How About SQL?

So, one big question that we also get asked a lot is just, like, okay, this provides a DataFrame API in Python, but how about SQL? Another person actually told me that he thinks that the biggest challenge of his adoptions is that people just don’t want to use DataFrame API, and they just don’t want to use Python, and I can suddenly, like, see some reason to it, because I do think that, like, a lot of people just, like, like SQL, and SQL is pretty versatile, right? Versatile. And I think, so, I come from a machine learning world, and usually, like, a lot of data scientists and machine learning engineers, you should prefer, you should really, like, you should really like Python and didn’t know a lot about SQL, but we are seeing a lot of adoptions in SQL from, like, machine learning and machine engineers and data scientists. However, like, SQL also has a lot of challenges, and we can actually, like, have pretty long, we can have a long discussion about it, because for one thing, like, what does it mean to do, like, what does, exactly what does SQL mean? Because, like, every, there are many different SQL flavors, and, like, every backend can be, like, slightly different, and we know that because we support many different backends, and it’s always found, like, a little bit different in every different backend. But, however, regardless of, like, what, which one is better, like, SQL or DataFrame, I do believe that they are both, like, great for different use cases. IBIS actually supports both. Like, you can actually write SQL within IBIS as well, if that’s something you really want to do. 

Okay, that’s it from me. Thank you so much for attending my talk. Here is the GitHub link for IBIS. I want to check it out. Also on Twitter and LinkedIn, if you want to follow up and ask any questions.