Subsurface Summer 2020
Functional Data Engineering - A Set of Best Practices
Batch data processing (also known as ETL) is time-consuming, brittle and often unrewarding. Not only that, it’s hard to operate, evolve and troubleshoot.
In this talk, we’ll discuss the functional programming paradigm and explore how applying it to data engineering can bring a lot of clarity to the process. It helps solve some of the inherent problems of ETL, leads to more manageable and maintainable workloads and helps to implement reproducible and scalable practices. It empowers data teams to tackle larger problems and push the boundaries of what’s possible.
Maxime Beauchemin, CEO and Founder, Preset
Data engineer extraordinaire and open source enthusiast. Creator of Apache Superset and Apache Airflow. Maxime Beauchemin is the CEO and founder of Preset, a company offering hosting and solutions around Apache Superset.
All right, let's go ahead and welcome, everyone. Thank you for joining us today. We have Maxime Beauchemin, CEO, and Founder of Preset. Welcome, Maxime. Thank you so much for joining us for those of you who are just getting into our session. There will be a Q and A at the end. And if we are not able to get into your question, please make sure to join the Slack community, and we will be able to have Maxime for an hour, and he can answer your questions there. So with that, it is top of the hour. Take it away, dive in Maxime.
All right. Thank you so much. So my name is Max, and today I'm talking about functional data engineering and talking about a set of the best practices that are related to this topic. I'm going to be drawing some parallel between functional programming and this approach for data engineering. Cool. So a little bit of context for the talk. So before I jump in and start talking about functional programming and how it relates to data engineering, wanted to give some context. The first item is about me. So a tiny bit about me. So at this point in my career, I think I'm best known for being the original creator of Apache Superset, which is a data visualization an exploration platform, and Apache Airflow that is an orchestrator for batch jobs. So it's just this workflow orchestrator for batch jobs.
Recently, I started a company called Preset. So it's been already a year and a half or so. And Preset is offering hosted, improved Apache Superset as a service. A little bit more on that later. I come from a bunch of data-driven companies. So last I was at Lyft. Before that, I was at Airbnb, Facebook. Facebook Ubisoft and Yahoo. So I've been working as part of modern data teams that have been using data lakes for quite some time now. So that's the little bit about me. A little bit about Superset, so it would be hard to talk about me without talking about Apache Superset.
Superset is a database visualization, exploration, consumption platform that I started while I was at Airbnb back in 2015. And recently. With Preset. We're putting a lot of steam behind the project. And the project is really accelerating in all sorts of ways. So if you haven't checked out Apache Superset or you haven't checked it out in a while, I would really urge you to go and check out the project. I'll be on Slack later too. So if you have questions related to Superset data visualization, I'll be happy to talk about that. But the topic today is not that. It's functional data engineering.
Cool. So more context on the talk. So, this talk fits kind of on top of a blog post that I wrote about two years ago, I think, called Functional Data Engineering and Modern Paradigm for Batch Data Processing. So you can kind of think of that blog post as supporting material for this talk today. So, if you want to learn more or kind of digest it in the more blog post type format, you can revisit this later. That's the third of a little bit of a trilogy of blog posts that I wrote. One was the Rise of The Data Engineer. So that's already four or five years old, but I think it's still very relevant today. So it's really talking about the role of the data engineer and what they do? And what this is all about. Where they fit in as part of modern data teams? And I followed it by the downfall of the data engineer that talks more about what's really hard and challenging about being a data engineer.
So, few blog posts for more context. Also, context on this talk is that this is kind of all new. So first, this talk is a recycled talk from about two years ago, and I'm going to update it in all sorts of ways and really put more of the data lake kind of centric aspect and perspective here today. But it's also the methodology that I'm talking about today, which is applying some concept from functional programming to data engineering is not new at all. When I joined Facebook in 2012, they were already heavily using and data lake and the methodologies that I'll reference as part of this talk.
Cool. So functional programming. So before I get into data engineering, I just want to do a very short crash course and a refresher for people that are already familiar with functional programming to set the context. And the second phase, I'm going to talk about more directly about the parallels between the two, right. So let's start with functional programming. So without reading the Wikipedia definition here that I have. I want to just point out that functional programming is a different paradigm.
So it's a programming paradigm as opposed to, say, object-oriented programming. So it's a different way to organize and structure your code. And essentially, it's an approach. It's very coming from the function at the foundation of the way you structure and organize things. I'll talk about three kind of core principles of functional programming. One is the idea of pure functions. So in functional programming, you author functions that are pure.
And what does that mean? That means that those functions don't have side effects. So that means that if you give the same input to the function, you're guaranteed to get the exact same output. This is nice because it makes it easier to reason about these function. They can be easily unit tested, and it brings kind of clarity to the process in general. So here I have a little bit of an example. Probably the simplest code I could to write to kind of demonstrate a pure function versus an impure function. So you can tell very easily that the pure one or pure add one function is always going to give the same result. And in the case of the second impure, add one, you know that you're going to get different results every time you call it, regardless of the parameters that you pass, or you don't pass to it. When you think about it, object-oriented programming as part of its core, the idea is to write [inaudible 00:07:19] functions that will mutate your objects, right. So clearly, it's a completely different approach.
Now, talking about immutability. So another fundamental concept in functional programming is that your objects are thought of as immutable. So once you affect a variable with an object or with a piece of data, this piece of data won't change unless you affect it to another variable. So this can create... changes the paradigm in all sorts of ways and changes the patterns that you're going to use on top of it. But it's a nice guarantee that you have as a programmer. You'll see that once something has been affected in a certain scope, it won't change. And that provides a set of guarantees that you can build upon. Now, there's the concept of idempotency, which is very relevant to data engineering. I'll draw some parallels later. But idempotency, to read this quick definition or part of it, "Is the property of certain operations in a way that they can be applied multiple times without changing the result beyond the initial application."
So it's this idea that if you rerun a job or a function given a certain state. If you do this idempotent operation, it will bring you consistently to that same state, right. So it provides some really nice guarantees that are super nice to have for different reasons from the data engineering perspective. Now, before jumping to functional data engineering, I want to say I exposed here some really core kind of principles or some core ideas behind functional programming. I think what's really interesting about functional programming I didn't talk about is the patterns that emerge based on these kind of constraints and guarantees. And I won't talk about that today because that could be talking about that all day. And then, we're pivoting to talking more about data engineering and how some of these concepts can be brought into this world and create some clarity. By the way, I think I realized later on that... so I think my history around this topic is, I was doing data engineering for a very long time.
I was kind of applying functional paradigm to data engineering without necessarily realizing it until I started learning about functional programming down the line in my career and kind of put the two together and be like, "Oh, well, these two things, that's kind of... I've been applying a lot of these principles for a while." So I kind of brought this idea that the link between the two, as I was progressing, kind of on both sides of software engineering and as a data engineer. So I want to talk briefly about reproducibility as something that's really important in data engineering and in data processing in general, right. Reproducibility is foundational to the scientific method, right. If you can't reproduce results consistently, you haven't made progress, or science as not made progress.
Reproducibility is critical from a legal standpoint, right. If you have an audit and you're publishing some numbers. You're a public or private company, and you put some data forward, and you take important decisions based on this data. It's important to be able to explain how you got there, and the only way to do that is by being able to reproduce the same results as you've got them before. More fundamentally too. I think reproducibility is critical from a sanity standpoint, right. If you're a data engineer working all day, running jobs, authoring jobs, troubleshooting jobs. If you cannot have this guarantee that by re-running a job, you'll get to the same results. You're just going to go crazy. So I think it's important from that perspective. And the idea here is that the functional approach that I'm going to be talking about today, generally, if you play by those rules, you can guarantee that reproducibility.
So I talked about immutability and variables on the functioning functional programming side. Now I'm going to talk about immutable partitions as the atomic block of computation or of data in a data lake, right. So if you use HDFS, S3 and something like high of Presto, Dremio. These databases operate on a lake and operate very much at the block level. And by block, I'm thinking about block like partition, right. A partition is a atomic unit of a table of a mutation. And one of these modern databases that work on a lake, usually with parquet files or RSC files. So, one recommendation around how to implement functional data engineering is to systematically partition all tables. So you don't want to mutate your tables. You want to append new partitions to your table, right.
You think of your table as something that is made out of blocks. And consistently, as you process your data, you would be adding on new partitions. If you need to change the source data or the nature of a computation, you will have to rerun your data and then mutates the partitions that are affected by the change. But in general, it brings all that clarity to partition the tables and operate at that level. There's this idea too of having one task or one job in your batch processing framework that lines up with one part with one partition. And that brings a lot of clarity.
There's this idea, too, that if you are writing once and reading multiple times, which is typical of data warehousing and analytics type workload, then you can do that. You can pay that price kind of in a more expensive way where you create something like a parquet file, like an RSC file, to create something that is intended to be fast on read. Maybe a little bit more expensive to write, but cheaper to read as a result. And when you think about that, your ETL, conceptually, well, you might think of your ETL and your workflows and your batch processing workflow as a lineage of tables, like tables kind of flowing into other tables. Then you can start thinking about your ETL like it said, DAG of partitions, right.
So each partition has its own lineage pointing to other partitions as kind of demonstrated in the diagram here. So I'm not going to go too deep in this diagram here. But having this idea of having a low complexity score for each partition brings a lot of clarity because you know where the data is coming from, and you know that you can only reprocess the chunks that you need to reprocess without having to mutate entire tables.
Great. So now pure ETL tasks, right. So this is akin to the pure functions and functional programming. It's this idea that when you write an ETL job, whether it's an Airflow job or Luigi job, or a Dremio kind of transformation package. You want for these jobs to be idempotent. So that means if they fail halfway or if you need to rerun them for a reason or another, you can rerun them. And that's a great guarantee to have for the operators of the distributed processing system, right. Or even for the data engineers that will say, "I didn't get quite the result that I wanted. I need to change the source data. The computation. Now I know I can rerun the task and get to the same state."
Though deterministic, right. These tasks, so that means given the same input, in this case, partitions, they will output the same partitions. They have no side-effects. So that means you're not adding into counters. You're not appending. You're not deleting. They usually target a single partition. So I think that's a really core principle that then makes reasoning about your batch data processes. Having a clear like this task loads into this table, and this task instance loads into this partition brings a lot of clarity to the process.
And when I said, we don't do mutations. When you think about what fundamentally, what an update or thinking about the DML, Data Modification Language, as statements like UPDATE, UPSERT, APPEND, DELETE, these operations are mutational by definition, by nature. So what we generally do and what we recommend to do in regard to this approach is to always do insert override partition. So you're always inserting in a new partition, or you're rewriting the whole partition. You won't go and DELETE, APPEND, and, insert just a new row to the partition, right. And that fits nicely with the idea of a data lake where the blocks, the unit of a tenacity of mutation is not a row like in this traditional OLTP database. But much more a block of data or a partition. And then, generally, it's a good practice too, to have tasks that limit the number of source partition that they scan.
I could get a lot deeper into this. Maybe, we'll save some of this for the Q and A if people want to ask questions on this topic. But it goes to the complexity score. If you have a partition, but to compute this partition, you need to scan a wide number of partitions. That means that if you change any of the source partition, in theory, you have to recompute it. So you really want to limit the complexity score of your partitions and how many kind of blocks they depend on.
Maybe a parallel to functional programming here would be you don't want to have a function that receives too many parameters because it's harder to reason about, and it's more likely to have to rerun it if any of the parameters change. Here's another idea that comes from more or less this idea of functional data engineering. So, assuming that you have all of your raw data, and then this is what I mean by a persistent staging area. The staging area being the place in the data warehouse where you bring your raw ingredients and your raw data from your external systems, right. So that might be database scrape your events kind of pipelines lending into your data lake. So the idea here is to create a persistent staging area where your data lands consistently, and that you can trust that the data there will never change unless that maybe something went wrong and you need to change it, but you have this persistent staging with others and change.
And what's nice is if you have all of your computation and all of your raw data, you know that you can rebuild the data warehouse at will, right. When I'm talking about reproducibility before knowing yet, you know your computation, you know your raw data, you can get to the target states that you want. And that's a great thing for reproducibility. And it's generally a nice piece of foundation to have and to build upon. Also, the fact that nowadays, I think in data warehousing, 10, 15 years ago, people might argue, do you want a persistent or transient staging area now that data is so cheap to store storage is so cheap? There's no reason why you shouldn't do and have a persistent staging area. All right. So here, this is the section of the talk. And this is a condensed version of a talk that was longer.
I believe it was a 45 minutes talk. So, I'm going to have to zip through this fairly quickly if we want to have time for Q and A. And I invite people to direct me to some of these topics during Q and A. So I will brush off some of these sections quickly, and let's revisit them on demand. There's supporting material in the blog posts that I mentioned earlier. So I encourage you to look at the blog post if you want to dig deeper into these challenges. So the idea here is to talk about some core data engineering kind of problem and how to solve them using, essentially, partitions and functional data engineering principle. So the first one is this idea of slowly changing dimensions. This is a term that data engineers may or may not be familiar with, but there has been a fair amount of literature written on this.
And there's tooling as shown in the little diagrams there. I think we see SSIS, Informatica, and DataStage here, kind of showing a workflow of how to create a workflow that deals with capturing history in a slowly changing dimension. Here, I'm not going to get too deep into this because I want to save time for Q and A. But, essentially, here in what you'll find in the blog post is I described what a slowly changing dimension is and what the traditional approaches for these slowly changing dimension have been over time, and how to approach this from a functional standpoint. And the short story here is to snapshot the data, right. So when you load a dimension table. The idea being to create a full snapshot of the dimension. In this case, it could be, let's take an example like the user table or the supplier table.
So every day as you load up and new customer come in, or some customer change state location. You would instead of kind of mutating this table and adding multiple records to reflect the different changes in a customer's history of attributes. You would instead kind of create a full snapshot every time. And what I'm showing with the sequel here is that using this very simple principle of snapshotting. The data you can easily to get to, what is the latest attribute for this customer? Or what was the attribute of that entity at the time of the event? Sorry to kind of blaze through this. Another topic that is typically challenging for and around immutability for data engineering is late arriving facts. So, here I talked about an approach of partitioning based on event time. Instead of partitioning on event, I'm partitioning on event processing time and keeping this other dimension of the event time as independent from it.
There's much more to talk about here. I invite you to ask questions like Q and A or on Slack or to visit the blog post to learn a little bit more about this. Here I'm talking about this idea of self-dependency or past dependency. So that's the general idea that if you were to build your user dimension based on yesterday's one plus dimensioned, then you have a much more high complexity score, and that's not desirable, in general, because any change in history would be very primitive. It's really hard to recompute things from scratch because you would you have to compute sequentially a series of things you can't do like parallel processing and for backfills and things like that. So generally, here making the point that we encourage people to have an approach that does not use past partitions to create current or future partitions. And there are ways around that. I'm happy to get deeper into this.
Here I wanted to mention file explosion as a by-product of this. So when I was at Airbnb, we use HDFS as the backend for a data lake. And the NameNode was quite an issue there because we use partitioning so heavily with this approach. The good news is that if you use things like S3, GCS or Dremio, that's typically not a problem because they deal very well with the fact that you're going to have a lot of partitions and a lot of files in your lake. So something to mention, but not necessarily a real problem anymore, depending on your context.
But to conclude and leave a little bit of time for Q and A. I wanted to make the point that times have changed quite a bit. And before I was a data engineer, before the term data engineer existed, I was a data warehouse architect. And I read that the bibles of this, which were written by the two grandfathers of data warehousing. So Ralph Kimball, Bill Inmon, those books are still relevant in a lot of ways, but I think times have changed. And some of the design principles and patterns that they put forward are not as relevant anymore. Things have changed. We have limitless, cheap storage.
We have distributed databases. We have decoupled, compute, and storage. We've seen the rise of a read optimized store that use immutable file formats, right. And instead of having one big [inaudible 00:00:26:05], that would change that we've seen as instead of having a few data specialists really kind of doing the data strategy and pressing for the whole company. Now we have very large data teams. Everyone is a data professional data worker inside companies now. So I think these books need to be rewritten in some ways our new generation of books needed to come out, and I'm hoping functional engineering comes to mind in future books, future patterns.
So, last comment. These are... I think rules are good to learn. But they're meant to be broken. And that is true from the core principle of data warehousing that also applies to functional data engineering as I present it here. So, learn the rule, but please go and break them when it makes sense. Those are just guiding principles. And that's it. I would like to open up for Q and A. We have four minutes left. I apologize. I wanted to have more time for Q and A. I'll be on Slack answering questions for the next hour, at least. So please hit me up, and thank you, everyone.
Thank you, Max. Appreciate... A great presentation. Lots of kudos here in the announcements. For our viewers, we are open for open Q and A. You can just click the button in your upper right-hand corner to share your audio and video, and you'll, automatically, be put in a queue. So if you have questions, please go ahead and Max is about... "Please pass it to Max. What's his view of Apache Iceberg?" We've got that written in there.
And I think that Metastore was great contract that allowed tools and systems to [inaudible 00:28:46], but now it's hurting us because it's not as advanced as it needs to be to build the next generation of tools. I think we've seen Iceberg come up and solve some of the problems that we have around the Hive Metastore. So, really excited for this project, and this approach and this idea of keeping track of which partitions are getting new data in time and keeping the ones that are getting mutated in the background so that you can query different points in time of a table.
All right. We have a couple of questions here related to partitions. Let's see here with, Oh, keeps going up. "With Delta and iceberg. Do you still need to worry about partitions and having to always overwrite the whole partition? Is that view a bit outdated given the new technologies?"
Yeah. There's I don't know. So there's a hudi too. There's the idea of like, Hey, how do you enable kind of a transaction log inside of a lake, which is, I think, a really interesting idea as well. To me, I'm in the school of thought that the clarity and the simplicity of the model is really important, right. To have this guarantee that generally a tasks mutate one partition as a strong foundation to build upon. I know this is very like batch-centric way of thinking, but generally, I think those guarantees are great. And I don't think those should be rules where it does make sense to use something like hudi. A fast changing dimension, as opposed to just slowly changing dimensions. I think it makes sense to kind of break out of that pattern when the solution requires to. But it's nice to have that foundation and that pattern in most cases.
Thanks, Max. And that is it. Sorry, folks. We are out of time, but Max will be in the Slack subsurface community afterwards. So, please continue the conversation there. Thanks again, Max, for a wonderful presentation. This was awesome. Thanks again.
Thank you, everyone. See you on Slack.
Bye. All right. Thanks. Thanks again, Max. That was awesome. Lots of great comments. People are wondering if you're going to write a book.