May 3, 2024

Flight as a Service (FaaS) for Data Pipelines: Combining fast Python functions with Dremio through Arrow

As the lakehouse architecture drives data access, new users need support for multiple use cases and runtimes to facilitate them. Leveraging Dremio for Flight streams and Arrow tables, we explain how we built a fast, serverless Python runtime for data pipelines mixing SQL and Python. The result is a developer experience that combines ease of use with unlimited use cases.

If multi-language pipelines live in our own cloud, we have the advantage of a high-trust, highly secure and cost-effective environment: through the virtues of open formats and the composable data stack, the entire organization can now access a consistent – language independent – data representation through a shared Nessie catalog.

Topics Covered

AI & Data Science

Governance and Management

Sign up to watch all Subsurface 2024 sessions

Speaker

Jacopo Tagliabue

Founder, Bauplan Labs

Transcript

Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.

Jacopo Tagliabue:

Thanks, everybody, for attending. Thanks, Daniel, for organizing this fantastic conference. Let’s go into the talk. Without further ado, I have a lot of ground to cover and not so much time. But thanks, God, you can watch this later on if I talk too fast. So, hi, everybody. I’m Jacopo. I’m a student entrepreneur now building PowerPlan. I’m a professor of machine learning and I’ve been doing open source, open science for quite a while now. And I’m very passionate about the intersection of data preparation and downstream application, like, for example, machine learning, data science, AI, which is like an integral part of the modern way of dealing with data. And it’s gonna be the main focus of today’s talk. PowerPlan is a new company in the layout space.

We’re very proud of our backers, including some of the best Silicon Valley venture capitalists and executives and founders at Voltron Data, Brady Stalker, Cloudera, Stanford, and so on and so forth. If you want to know more about the company and our roadmap aside from what we’re gonna present today, please reach out to us anytime.

So what we’re gonna talk today. So we’re gonna make a few simple points for you to take away. And they’re based on our, you know, decade of experience dealing with data in all shapes and form, from garage startup to public companies. And we do truly believe that the open lake house is the new paradigm for working with data. And we want to discuss what the lake house actually entails when you start understanding all the personas and all the use cases involved. So what we want to talk today is, A, lake house as no language fits all situation. So if you’re really passionate about democratizing data and giving your organization more access to actionable insight, there’s just no one language that is gonna be good at everything. So multi-language is not just nice to have, but it is a necessity if you really want your investment in the lake house to pay off.

Then we’re gonna show you the challenges and opportunities in adding Python function to your Dremio-based pipeline. You may, you know, if you’re attending this conference, you may already be using Dremio or you’re considering using Dremio. We use Dremio and it’s amazing. But of course, Dremio just does one of the languages that we should support as part of the lake house. And today we’re gonna see what does it take to add Python functionality in what we call the naive way, and then what it means to do that in a full cloud platform that has been designed for that.

And finally, we’re gonna go back to the centrality of the catalog. This also will be touched upon yesterday in the keynote. To stress the fact that multi-language and multi-persona doesn’t mean chaos, right? Actually, even if you use something that is not SQL, you can still leverage the same centrality, the same semantic layer that Nessie and Arctic actually provide.

The Lakehouse

So everybody here knows the lake house. Hopefully, you know, it was the original paper that spreaded the concept. And, you know, to put it very simple, you have all your data in your object storage in open format, let’s say, Iceberg, for example, and you want to do a lot of things with this data. For example, you may want to build data pipelines. You need to produce dashboard and report for executive or embedded analytics for the product. You may want to do forecast and insight. So we have data scientists that want to do that. And once you start understanding that this common foundation in object storage is mirrored by pluralized use cases, you need to also recognize that multiple persona are actually working on this same data foundation. In particular, you may have a data engineering team that is the one that is more tasked that we’re building data pipeline, let’s say going from raw data to semi-aggregate data, fine data, transformed data, so the downstream application can be easier to actually be run and developed.

Then you may have the typical business analyst, right? The person who knows a lot about the business, knows perhaps mostly SQL, and is in charge of doing the last mile of this kind of data workflow, right? For example, is the one doing the final queries that then go on and produce a dashboard. And then, of course, you have the data scientists, right? So some forward-looking organization don’t limit themselves to report on the past or to describe the present. They also want to predict what the future entails. And to do that, you need to collect the data, clean the data, prepare the data, but then you need to train machine learning model and AI model to try to figure out what’s gonna happen next. And when you look and go and talk to these people, you go and talk to data engineers, you go and talk to business analysts, you go and talk to data scientists, you’re gonna find out that these people are profoundly different in the choices of languages and choices of things they like to do and the paradigm they actually employ when creating business value. For example, if you start from the center, business analysts are typically very comfortable with SQL, that created aggregation, window functions, joins, and all of that. But if you go one step left, so if you shift left, data engineers are more concerned about plumping together different intermediate tables, cleaning them up, enriching data through APIs, perhaps doing some text-heavy manipulation, especially now with Gen AI. And they may use SQL for sure, but also other languages. For example, Java-based languages, JVM-based languages like Java Scala, or of course Python, which is becoming the de facto standard for data work. And of course, this is the same if you go on the opposite side to the right.

So the data scientists, like kind of the more advanced use cases for prediction, data scientists, both by training and by convenience, they tend to use Python a lot. So while we recognize the power of the lake house and why the lake house is better than the data lakes of 10 years ago and the data warehouse lock five years ago, we also need to understand that the true power of the lake house lays in multi-languages. And if you’re familiar with SQL, of course, you know that SQL is great for aggregation, last mile exploration, and some of very simple transformation. But on the other side, you may know, especially if you’ve done this for a while, it is actually very hard to use for other common use cases when you deal with data. Takes three minutes of Googling to find dozens of articles, like the one that I posted here, about the fact that Python is becoming the de facto language for data workload, both before dashboards, as we said, data engineering and data pipelines, and after dashboards. Let’s say, training machine learning model, doing forecast and so on and so forth.

In particular, Python is very good when you need to call an external service. Let’s say you’re processing some geolocation data and you want to call a weather API. Python is good for data science and everything involving linear algebra or similar tools for predictions. So if you need to do something that is a bit mathy or statistics, Python is fantastic support for those type of libraries, but SQL is much more limited in that sense. And of course, there are many, many cases in which working with data science doesn’t take the form and the shape of a declarative query, right? SQL is a declarative language. What makes it amazing, because something can optimize it for us, like for example, the training engine, but also makes it very, very hard to express other things that are more naturally expressed with imperative-style workloads. And for that, Python is really hard to beat.

So the question is never, what is the best language? There’s no such thing as the best language or the only language. There’s the best language for what you’re trying to do right now. And again, any effective lake house cannot be built without recognizing the importance of the multi-language paradigm.

A Simple Scenario

To quote Shakespeare, or maybe a slightly drained version of Shakespeare’s, there are actually more things in data than a drem of your SQL. So if you wanna complete end-to-end pictures, a workflow that actually works for your company, you’ll need to find a solution for Python user, no matter what. And the question is, what is the best solution? If you already like the lake house, if you already like prem as a SQL engine, what’s the best solution out there? And to introduce what we think is a good solution, we’re gonna start exactly from a Dremio blog post of a year ago, something like that, that you can easily find on Google if you want to while I’m chatting.

So you can actually scroll down and follow that with me, or you can do that asynchronously after this talk this time. What is this scenario? What is this use case about? What are the Dremio people trying to convey here? So what it is quite is a very common end-to-end workflow with data. So you start with some data in S3, so and you want to ingest them in the lake house, and up until here, I guess, so far so good. Then you’re gonna use Dremio as a SQL engine to query the rows that we need to train some machine learning model, and then you’re gonna have some Python code with some machine learning library, like EGBoost in this case, that you’re gonna use to create some models and some tries describing the data and trying to predict what’s gonna happen in the future. It’s very, very straightforward and standard data pipeline for those of you that have worked with data science or machine learning before. And the article is very well-written, so it’s very easy to follow.

So what does the pipeline look like at a first glance? What are we gonna model here? And so there’s a few steps, and I’m gonna go through them. Again, you can go through them yourself if you have the browser open, or you can just follow with me right now. So first step is loading the data into the lake house. It’s very easy to do with Dremio. You can do that through the Amazon SQL connector. And then there’s a creation step, right? And the creation step involves you saving the data set that we just load with some names, dropping some columns, just refining a bit with the UI, the data we have. And then finally, we get into the meaty part, the business logic of this use case, right? And so we have a first step, which is data retrieval. In this case, it’s a simple select start, but you can imagine this being as complex or as simple as the use case required. And of course, this SQL query gets executed by Dremio. And then finally, there’s some Python workflow, right? And I just copied here one of the examples in the blog post. It’s a library called lightgbm. And this is the Python code that actually train a model. And you can see here the printout of the terminal. So the model gets trained. And it gets better and better the more you train, OK?

How to Connect and How to Run at Scale?

So we have two questions right now. How do we connect the Dremio step, the SQL step, to the ML workflow in Python? So how these two steps, which are obviously logically cohesive, they’re consistent, they’re one after the other, they cannot live in isolation. Python cannot live without the data from SQL. But SQL cannot do the use case that Python is solving here. So how do we connect these steps? And second, how do we run them in scale? By scale, I mean, how do we go from something that runs as an experiment in my computer, a prototype, to something that runs, I don’t know, every day at 3 PM and powers some really important use cases for my company? Because every day, we need those predictions to be ready.

So how do we solve these two problems? And the first thing that we can try and do, which is let’s do it with the easiest, was the first thing that we think, which is I open my laptop. My laptop has some Python environment somewhere. I have my Dremio account. I have my Dremio software in the cloud. I did the ingestion from S3. And what I’m going to do, I’m just going to connect my laptop locally to Dremio. And then I’m going to run Python on my laptop. And that’s the naive way, but it works. And that’s, of course, the way in which the blog post is staged, right? And it’s all good and well. And if you have a couple of hours to spare, I will actually encourage you to go and try and reproduce the blog post yourself, right? But what is the catch? Well, there’s a few things that make this non-ideal when you think both in the perspective of the data practitioner that just gets thrown to solve this problem. Like, imagine your boss telling you, hey, you need to go and create this machine learning model and make it work. What do you do? And from the fact that if this actually works, what are you going to do in production? Like, how do we go from something that works on my computer to something that actually works for everybody?

So problem number one– if you never operate flights, there’s no standard way for you to now query Dremio as of today in Python. So there’s a bit of learning there for the user to be able to actually do that. It’s not impossible. There’s actually another very good blog post by Alex that explains how to do it. But it’s still one more hurdle that you need to solve. Second, it’s a bit slow. The fact that you’re downloading data from Dremio that queries Iceberg on S3, and then you ship it to your laptop makes it slow because the bandwidth of your laptop is as good as this. It’s going to be, I don’t know, 300 megabits or something like that. And it’s going to be expensive because you’re going to pay a gross cost. So especially when you try to iterate, like, you do a model, then you change the model, then you do it again, and then you try a different version, this may actually become a problem for your feedback.

Third, you have the syndrome of works on my computer. What does it mean? It means that once– even if this works on your computer, even if it’s low and not super scalable or whatever, once you need to make it work for everybody else in your company, there’s a plethora of tools and things that you either need to know or that somebody else in your company to help you with to make it work on a schedule. You need to know what a orchestrator is. You need to check the code in GitHub in the same place where you check SQL and Python. You need to know what Kubernetes or any other nomad or any other Docker orchestrator works. And of course, you need to contain it as your code and ship it to, I don’t know, ECR or any other available repository. And finally, it’s one way old. What does it mean? If you run the blog post end-to-end, what you’re going to find at the very, very end is that whatever you do in Python stays in Python. So if you train your prediction, nobody else that is connected to the same dream account and the same Arctic catalog will ever know they run those predictions. There’s no built-in way for you to basically go back to the same centralized place, the same place where all the other semantic layer lives so that other people can benefit from whatever you did.

The Solution

So this is a good start. And it got us thinking. But obviously, this is not the best solution we can come up with. So what is the solution here? And the way in which we think about this problem is that we should never leave the cloud to begin with. The entire premise of Bauplan is that people develop locally because the cloud is complex, is low, is cumbersome, is expensive. But it doesn’t need to be. If the cloud was easy to use, like a simple command in your computer, it would be much, much better to do all of this in the cloud. And this is what we’re going to see in a minute.

So how does it actually work in Bauplan? So let’s– this is a slightly simplified version of the code that you find in blog posts, but just for pedagogical purposes. So you’re going to have your SQL query, again, that you can run in Dremio. You can find here is a totally normal SQL query. And there’s a bit of a comment on top of the SQL file which says to Bauplan that we need to use Dremio to run this query. And you can imagine, just by looking at the syntax, that, of course, different engines are possible. And then what you’re going to do is that you’re going to write your code in Python, in another file in Python, in your ID. And the only thing that you need to do is make sure that your Python function are decorated with the special Bauplan decorator. You see here an example of the Bauplan decorator. Again, even if you’ve never seen this syntax before, it should be fairly straightforward to understand what it does. This decorator tells Bauplan that this function, which are called data science, which were kind of like a shortcut of, hey, put your data sciences code here, use Python 3.11. And then it needs one package, which is called pandas, a very famous package for data wrangling as a pandas. So you write your code in your ID, whatever you want, PyCharm, Vim, Visual Studio Code myself. And then the other thing that you need to do to run all of this is to type Bauplan run in your terminal. And we’re going to see how that looks like in a second.

But before we go to the demonstration, I wanted to let you know what you’re going to see. So the first thing you’re going to see when you start Bauplan run is that the system is going to package your code, the one that you just wrote. It’s going to send it to the cloud. It’s going to tell Dremio, the account that is connected with Bauplan, your Dremio, to execute the query. It’s going to open a flat stream that’s super fast and with no serialization cost. We’ll move data from Dremio to the Python runtime in the cloud. So no egress cost and super fast. And Bauplan will build a container that contains the dependency that you just specified, in this case, pandas, and continue to run your data science code without you having to do absolutely anything. And all of this will run in the cloud by streaming back all the information in real time so that it looks like you’re running in your own laptop.

Live Demo

To see what it means, we recorded a short video. Let me start it here so you can see what the experience would look like. So this is my Visual Studio. This is the SQL query and the engine Dremio. And these are Python functions that, again, if you consider them, they’re basically the same Python function that you would find in the tutorials, spiritually. And all the things you find in Python is just normal Python. Then you do Bauplan run. The system will, you see, will print out the DAG. You see there are like six containerized functions that actually execute all the code. It’s going to go to Dremio. You see, it’s already done. You see that? That is the preview, the first five rows of the query that Dremio executed for us. And then Bauplan is executing all the other containerized functions, containing, training, data wrangling, and all of that one by one for you in the cloud. But it’s so fast and is telling you every second what it’s doing that if I didn’t tell you the deal was actually happening completely in the cloud, you probably would have no idea.

So in this particular case, we trained, I think, something like 20 million– a model with 20 million rows with EGBoost, including a SQL query and five containerized Python function end-to-end in something like 20 seconds. And most of the time, of course, is the training of the model, which has nothing to do with the platform. The rest of the platform, including containerization, data movement, spinning up and down resources, connecting to Dremio, is done automatically for you. You don’t need to know absolutely anything. And this is done in the possible most straightforward and streamlined way.

Why This is a Better Way

So now do you see it working? Why we believe this is a better way to do this? Well, if you remember what was bothering us with the local setup, none of those problems are here anymore. In particular, running Bauplan, you don’t need to know anything about flight or about anything connecting to Dremio or how different functions talk to each other whatsoever. It’s super fast, because now by running things next to Dremio and next to our data, we can exploit the bandwidth of AWS instead of being bottled by my own bandwidth. It’s also cheap, because transfer between these two is actually free. It’s already a cloud containerized Python environment. What does it mean? It means that if you want this to run on a schedule, you can just put the Bauplan function that you saw here on whatever orchestrator you want, and you don’t need anybody else in your organization to teach you how to use Docker, Kubernetes, or anything else. All of this will work in the cloud, because it was never working on your computer in the first place. So you’re never going to find a problem of, oh, it worked on my computer, but now I have no idea how to share it with all my co-workers.

And finally, and I don’t know, perhaps more importantly, if you think about this as a centerpiece of a layout, is a two-way communication. In fact, everything you write in Bauplan, including Python, data science, predictions, whatever you want, is written back to the same catalog that Dremio was reading for when running the first SQL query. What does it mean? It means that everything that you do in Bauplan is available to the rest of the organization, to the same semantic layer, to the same Dremio, to the same UI that everything else was already available. So this is a full superset of what you were able to do all in SQL, but you get the same governance, the same visibility, and the same semantics moving to a completely different language without losing absolutely anything. You’re gaining a new way to express yourself and way more functionality, especially in the data engineering and data science world, without losing anything of what makes the centralized catalog and open layouts approach appeal.

Summing it Up

So summing it up, we do believe that Bauplan is, right now, the easiest way to run multi-language, catalog-centric data pipelines with Dremio. Multi-language because you can pick and choose SQL and Python. You can even intertwine them and do some steps in SQL, some steps in Python. And it’s catalog-centric because, very importantly, everything that happens in Bauplan only refers to a common shared understanding of what tables and views are available in the entire organization.

And, of course, you may wonder, well, what if the actual blog post started with two manual processes that have been point and click in the original blog post? Which, of course, it’s great for a blog post, and it’s great for people to dip their toes in. But if you ever work in data engineering or data science at scale, you know that manual processes are never very good because they don’t work in production. So the next question you would ask is, can we get rid of these two manual processes and make the same type of experience as easy as the one we just saw? Can we put this same problem in the CLI? And while I don’t have time to show you the full story today, I can definitely tell you that it’s possible.

In particular, Bauplan supports CLI and Python SDK for all the common use cases you may have in a lake house. You want to branch out your data lake? It’s one common. You want to run a SQL query? It’s one common. You want to import your data from S3? It’s one common as well. It takes less than 20 minutes to learn the entire lake house because the entire lake house has been kind of squeezed into four commons that you can easily learn. I am at the end of my time, and I hope this was entertaining and informative for you all. If you want to learn more, especially if you want to do Python with Remyo, please reach out to us. If you’re curious about the scholarly research that’s behind all of this innovation, please check out our paper. If you want to see some of our open source contribution, check out our GitHub. And if you want to connect, just connect on LinkedIn.