Subsurface LIVE Sessions

Session Abstract

Airflow has continued to evolve over the past six months as the most important new innovations in Airflow Airflow 2.0 and 2.1 have been released. Astronomer founder/Airflow Committer Ry Walker will walk you through the most important new innovations in Airflow, and preview what the team is working on.

Video Transcript

Ry:    Hi everyone. I’m Ry. I’m an Apache Foundation Member, an Airflow Committer, and one of the founders of Astronomer. Airflow 2.0 was released about seven months ago in December. And you might be wondering what are the next big things that the project and the community is going to focus on. So in this talk, we’ll cover some [00:00:30] of these recent developments in the Airflow project and a few of the exciting ideas we have for the future of Airflow. So I’d like to go through everything fairly quickly so we have some time for Q and A at the end. So before we get into it, I just want to provide a super quick overview of what Airflow is at its core, for those of you who might not be ultimately super familiar with it. Airflow is a platform for programmatically [00:01:00] offering scheduling and monitoring workflows. So this involves using Python code to create a DAG object and then using operators to form chains of tasks.So operators can be custom, or they can be off the shelf. The Airflow community has currently contributed about 445 pre-built operators to date. And those are all made available for your [00:01:30] consumption, but there’s nothing stopping you from creating your own constructs to wrap these tasks in. So to make that a little bit more concrete, here’s an example DAG. DAG stands for Directed Acyclic Graph that uses Python, Ray and Pandas to build a data frame in Pandas, do some calculations in a distributed manner, fanning it out to multiple workers and then pulling all those results [00:02:00] together and writing them to some destination. So again, DAG stands for Directed Acyclic Graph, directed means that all the tasks must be connected together. And acyclic, means that the tasks can’t loop backwards. So there’s only one direction for the arrows. And since Airflow can orchestrate tasks that interact with really any API or command line tool, the potential use cases are very broad.The most popular use case for Airflow [00:02:30] is in data pipelines, usually moving data and prepping it for analytics or for data science. So what’s the current state of the Airflow project? By all measures, it’s a very successful OSS Project. Airflow was started back in 2014 by an Enterprising Engineer, Max Beauchemin just before he started working at Airbnb. In 2015, they open-sourced it through Airbnb. [00:03:00] And it started incubating as an Apache project back in 2016. Incubation in Apache is a pretty challenging process. You have to demonstrate the product that they… The project is very well alive, that it can grow, that has a diversity of contributors and that it has good practices and processes to run it as an open-source project. And through the hard work of the community, Airflow was able to graduate as a top level project about two years later, December of 2018. [00:03:30] Since then Airflow has changed a lot.It has a very active, vibrant growing community. And we’ve had many, many releases. A lot of companies have adopted Airflow and a few companies have started building product offerings around Airflow, including my company. In December of 2020, we hit a pretty important milestone of Airflow 2.0, which was several years in the making, but we got that out. And since then we’ve [00:04:00] really been working to iterate. So Airflow is one of the top five most active Apache Projects by number of commits in 2020. It’s a big deal because there’s lots of large projects in there like Spark, Kafka, Hadoop, Flink, and Airflows on par with them. We’ve had more than 1500 code authors, contributing code to Airflow, which is on par with Spark. We have a very active community [00:04:30] driven release cycle, and this is really great for our users. So as a result, Airflow has become very popular.It’s used by thousands of companies around the world and by our measure, it seems to have become the most popular open source workflow orchestrator, really with no signs of slowing down. And here we can see a little testament to the popularity of Airflow. This chart shows the number of downloads per week of the Airflow packages to be up high. The red section is Airflow 2.0, and [00:05:00] it’s exciting for us to see that the community is really beginning to adopt Airflow 2. One of the challenges with adopting the latest version is there were a lot of changes that require some form of a migration and a lot of times once people get data pipelines running, they don’t like to mess with them. So we’re excited to see the uptake, but it’ll take some time as users migrate. We can’t track [00:05:30] exactly where everyone is that’s running Airflow, but we can see a little bit of this by looking at visits to our websites, communities global, obviously Bay Area, New York, London, Moscow, Beijing, São Paulo.We have big user groups in all those geographies. And yeah, the main thing we would say is we really do welcome community input to the project, especially any sort of contributions are greatly valued. [00:06:00] Airflow is a great technology, but we believe that the community is actually our greatest strength. Community over code is a motto from the Apache software Foundation, that’s really embedded in the Airflows DNA. So we really strive for transparency and communication and decision-making. We’re also very focused on making sure that Airflows management committee known as the PMC is diverse and has many views represented [00:06:30] not only from a vendor diversity standpoint, but also the diversity of individuals, backgrounds, and experiences. Finally, I want to just mention, we do recognize non code contributions. We value contributions to documentation, our website, events and processes as much as we value contributions to our code. So finally, one of the projects that I’m involved with is the Airflow Summit. And just to mention, we just close that out last week [00:07:00] group, incredibly, we had nearly 10,000 people register attempt, which is a 67% increase.Yeah. And finally, this is my last slide on the state of Airflow is code contributions to Airflow have continued to grow even beyond becoming a top level project, eclipsing the early days, pretty dramatically as we move into the Airflow 2 era. So I’ll start by what changed in Airflow [00:07:30] since the Airflow 2 era has begun. And I’ll just walk you through a few of the features that I find pretty cool and interesting. So we’ll start with this task flow API. The task flow API is a… Prior to Airflow 2, you had to set inputs and outputs of tasks, if you want to do data sharing across tasks. So if you wanted to do an extract transform load, you had to basically [00:08:00] make the decision where you wanted to stick your output and explicitly call that output as an input of the next task, which was great because it gives you lots of flexibility.But if, generally you want the outputs of one task to be the input to the next, it felt a little bit like you’re a little extra typing and also left some room for error. So, in Airflow 2 you can nest the functions [00:08:30] and Airflow automatically connects inputs and outputs when you do this. So, you can basically chain these Python codes together, and it magically will provide the output of one task as the input of the next task. So not only does it save a code, I should say, save the abstraction, but it’s also just a shorter [00:09:00] syntax, especially for simple tags. So you can just define the three methods and then chain together by nesting them. In Airflow 2, I should say in Airflow 1, we had a very sparsely populated rest API. It was marked experimental and in Airflow 2, we now have a full API across all the objects with really amazing documentation.This is a little screenshot from our docs website. [00:09:30] Anything you can do in the Airflow CLI or UI, you can now do via this very robust Airflow API. Another big one was companies. A lot of the big companies considering Airflow early on were worried that Airflow 1 had a single point of failure in the scheduler. The scheduler’s job is to basically manage tasks state across all of the active running DAGs. So if it died, [00:10:00] no work would get scheduled. So that’s not a pretty thing. In Airflow 2, we modified the architecture to allow you to spin up multiple instances of the scheduler. They communicate with each other and work together to handle the duties of managing tasks states across all the active tags. So, yeah, this is the HA setup and… Oh, sorry, let’s go on here.No. Okay. And not only did we make [00:10:30] the scheduler highly available in Airflow 2, we also made other performance improvements so that the scheduler now is pretty lightening fast. So a single scheduler is now 10X faster in terms of Task Lag. So that’s the time between… We measured that as the time between the first task ending, and the next task starting. We cut that down in some cases [00:11:00] more than 10% faster, or I’m sorry, 10X faster. And so combining those two ideas, really the scheduler is no longer a real or potential bottleneck for Airflow. So, it’s always nice to remove a bottleneck like that. Another nice feature that we rolled out in 2.0 is something we call task groups.In this case, you can decorate [00:11:30] your tasks with this grouping mechanism, which allows you to fold and unfold groups of tasks to the Airflow DAG viewer UI, which is what we’re seeing a little animation of here. So, it’s just a nice way to provide a UI hint of sorts. It doesn’t change how the tasks execute. It’s just a UI improvement and it makes it much easier to reason over a larger, complicated DAG [00:12:00] that has many, many tasks. Sometimes DAGs have upwards of hundreds of tasks. So, community really was looking for a way to be able to fold and unfold parts of the tree.We also refresh the UI and Airflow 2. Again, the project was born back in 20 or yeah, or 2014, 2015, and didn’t have a whole lot of [00:12:30] UX talent applied to the project through many of its years, but we’ve made a lot of UX improvements and visual cleanups as we had true front-end developers joined the project. So, all these little improvements really add up to a much more polished product. I don’t have a screenshots of it, but just say, download it and give it a try. You’ll see, it looks nice. Airflow 2 also brought along a concept called Smart [00:13:00] Sensors. It’s allows organizations who fire off lots of third party API calls and pull for results. So [inaudible 00:13:10] use the sensors where you, maybe you send a job to Spark or Databricks, and you’re waiting for the results to come back.By default if you have a task that’s just waiting for the results. It’s taking up a valuable slot in the array of [00:13:30] workers that Airflow’s controlling and that’s not very efficient. So, Smart Sensor is basically a way to roll up all of these weighting sensors into a single process that continually loops through all of the weighting sensors. And if it finds a positive results, in other words something’s completed it hands control back to the originating DAG. So this feature, basically saves [00:14:00] a lot of compute resources and it’s actually pretty meta in the sense that it actually uses Airflow inside of Airflow. There’s a feature actually uses an Airflow DAG to accomplish its goals. So you register these weighting sensors in a essentially a database table, and then Airflow is running another DAG to run through all those and keep checking them until it finds a hit. So it’s a pretty cool, pretty cool use of Airflow inside of its own product.[00:14:30] In Airflow 2.1 actually just came out more recently. It doesn’t have quite as many features because it came out fairly close to Airflow 2 and Airflow 2 was waiting a few years to get released. It includes a lot of fixes. A lot of the feedback got from Airflow 2.0 is incorporated in Airflow 2.1, obviously, and it makes Airflow 2 much more [00:15:00] ready for prime time. Now we were pretty confident that it’s pretty solid at this point, but we did introduce a few new features in 2.1 as well. One improvement is the DAG Calendar View. And so the DAG Calendar View provides visibility over the full state of the DAG by displaying the aggregated DAG runs states in a calendar. So this makes it possible to monitor the state [00:15:30] of thousands of DAG runs in a single view. That’s concise and really easy to understand. It’s particularly useful if you’re running a large backfill, let’s say you’re trying to re run a certain DAG over a year’s worth of input.The Calendar View is a very useful tool. Another cool thing we’ve done in Airflow 2.1, we brought along some great security features [00:16:00] such as auto masking passwords and sensitive information in both the Airflow UI, and then the event logs. And yeah, so that’s brings us up to, a lot of the things we’ve accomplished in Airflow 2, but we’re actually in the process right now of working on Airflow 2.2 and beyond. So I thought I would talk a little bit about where we’re going with that. What are some of our objectives and [00:16:30] projects? So a key objective I would say is that we believe that Airflow is being used on a very wide set of use cases, but it’s not clear that Airflow welcomes all those use cases. So one of the things that project wants to do is clarify our position that Airflow should be the go-to orchestrator for everyday to workflow.And that’s saying a lot, but we’ll talk a little bit more about what we mean by that in a few minutes here. [00:17:00] We’re also super committed to making DAGs a joy to write. And I know that sounds challenging, but we’re very committed to improving the DAG authoring experience. And this can come in many forms, we’re considering all ideas, but at Astronomer. For example, we recently created this searchable registry website, which makes all the pre-built integrations and sample DAGs that the community has authored more discoverable, available via search [00:17:30] interface. So, that’s at one, this one’s a an interesting one AIP, which stands for Airflow Improvement Proposal, AIP-39. We’re working on expanding DAG scheduling beyond the Chrontab format. So if you’re familiar with Chrontab, it basically have a row of a six numbers that represent a schedule [00:18:00] and you can get, it has a little bit of syntax there that lets you get creative with it, but it doesn’t quite meet the needs of every DAG. For example, if you wanted to create something that ran every Sunday, except for a certain Sunday and in July, for some reason, you would have to put the logic for that inside the DAG and that’s not ideal. So here we’re basically exposing the [00:18:30] ability to create a Python class that implements this timetable interface and allows you to basically define a Python class to contain your custom timetable logic.Airflow Improvement Proposal-40. This one’s pretty cool. It’s a build on the Spark sensors. It brings Airflow more into the async world. [00:19:00] So the idea here is that smart sensors feature in Airflow 2.0 was a great step forward, but it only applies to the Airflows sensor operator. So it’s a pretty limited use case in AIP-40, which is… We’re basically generalizing this so that any sensor or operator can choose to defer its execution based on an asynchronous trigger and where [00:19:30] all the triggers then run in one or more processes for efficiency. So, it’s really an opportunity for us to continue the progress we’ve made with smart sensors, but make it even simpler to reason about. So, we suspect that if this is implemented in the most popular operators, we’d probably be able to reduce the amount of resources, some Airflow deployments use [00:20:00] by over 50%, if all of their idling operators were using this mechanism. The Airflow CLI doesn’t use the new API yet.We were able to get it done, but yeah, we’re working now to make sure that our CLI uses the API. So that’s a prep work and project. Dynamic DAGs is another really interesting topic. Airflow, doesn’t currently support some of the [00:20:30] most dynamic forms of Dynamic DAGs and we aim to change that. And so what do I mean by, Airflow… We say Airflow can create Dynamic DAGs because you literally construct a DAG via Python code, but there’s some use cases such as for example, let’s say you wanted to retrieve a list of files and then kick off an Airflow task for each file. This is like a fan out strategy. You can’t really do [00:21:00] that right now with Airflow. You can fan out within the context of an individual task, but you can’t create a number of tasks, task objects today with Airflow, the DAG structure has to stay the same between backgrounds or executions of the DAG.And so, you also don’t get the benefit of parallelism unless you can split that workout as separate tasks. So it’s on our list [00:21:30] to support this. The task group feature that I mentioned earlier shows us some hints on how we could implement this in the UI. So yeah, we’re excited about this dynamic use case. A second kind of Dynamic Dag, I call it dynamic, dynamic Dag that we don’t support you at this parameterization. So if you want to run a certain template DAG, let’s say, but with more than one or [00:22:00] many configurations, you can’t do that right now in Airflow. So you can achieve this by using a DAG factory, create, have a Python file that generates many DAGs, but it doesn’t…Yeah, yeah. Each of those are separate constructs, so it’d be nice if this could be managed in the UI and of course, still be able to do this via files so you can track it [00:22:30] and get, but we aim to make it possible to without having any work around, run the same DAG multiple times with different parameters inside of Airflow.Another important type of DAG is an event Triggered DAG. So this is in a case you want to use Airflow tooling to build a DAG, but you really are only looking to run [00:23:00] it as a response to some external event. So currently Airflow expects a schedule with the DAG hearkening back to its origin as a periodic batch processor. And you can say none in terms of the schedule, but the fact that it expects that is a, makes you feel like perhaps you’re not using Airflow the right way, if you’re ignoring the scheduler. So, while you can trigger [00:23:30] DAGs today outside of their schedule, there’s an API end point for that it feels like maybe it’s an anti-pattern or a code smell. So. We plan to make Event-triggered DAGs, really be a first-class support, a first class supported use case and Airflow.And lastly, we are working on making Airflow aware of DAG versions. So today, when you deploy a new revision to a DAG, the old version [00:24:00] is effectively deleted or forgotten. So you can’t really go back and save like which version of the DAG ran three months ago, if you were looking at a certain run. So in this new scenario, Airflow will become aware of all the previous versions of DAGs and act accordingly. So more to come on this as we get into it. And then to close out my presentation, I’ll just do a really short commercial for Astronomer so you guys understand what we do a little [00:24:30] bit really quickly. At Astronomer, we’re working to make Airflow itself better. We have 20 of the top 50 Airflow contributors are working at our company.We’re also working to make Airflow run better as evidenced by the scheduler improvements that our team contributed, but also we’ve built out helm charts and other things to make it easy to get Airflow up and running inside your company. And we’re making it easier [00:25:00] for organizations to adopt Airflow. So, we have two core products. First, called Astronomer Cloud, and it’s a fully managed Airflow as a service, SAS offering that we’re putting a lot of energy into. We also have a commercially supported software called Astronomer Platform. You can install on your Kubernetes, provides Airflows of service to your internal users and teams and it aims to reduce your Airflow dev ops time by [00:25:30] up to 90% while providing a safety net to our team of Airflow, PMC members, competitors, contributors, and experts of all kinds. So, with that, I’m happy to take any questions.Speaker 2:    So, let’s open it up for Q and A. We have a few questions in the Q and A bar. We also have Arvish Kumar [00:26:00] who wants to join the audio video. Should we take his question?Ry:    Yeah, I’d say so. I’d say so.Speaker 2:    Okay.Ry:    That’s scary. I appreciate that he’s doing that.Speaker 2:    That is odd. It didn’t show up. Okay. While that’s happening. Here’s a question from Ernst. Do you know some large traditional enterprise converting from commercial drop scheduling solutions towards Airflow?Ry:    Yeah, definitely. In fact, we’re building a team on our [00:26:30] side to help companies with migrations, traditionally tools like Control-M and Uzi is another good example. Yeah. There’s a lot of tools that these jobs have been written in, that people are converting. We’ve got a lot of uses to share them on a one-on-one basis. But I think that there’s a lot of that going on.Speaker 2:    [00:27:00] Anonymous is asking, would it be crazy to use Airflow as an alternative to make?Ry:    I mean, people are doing some crazy things with Airflow. I’d say it probably is crazy. Yeah, sure. Like CI/CD is another really interesting thing where the worlds of CI/CD and Airflow are overlapping, as CI/CD tools become more advanced. Sometimes you can run data pipelines in a CI process. And so, [00:27:30] yeah, I would say, at the end of the day though, we have… I just did a call or I’m sorry. I talked with Jason from PayPal last week where they use Airflow to basically monitor their fleet of 9000 servers and basically detect when drives are bad and file tickets with JIRA and all this kind of stuff. I would say, not traditionally a use of Airflow but they find it super nice to basically automate [00:28:00] the job that otherwise a human would have to do.Speaker 2:    Another question from Ernst, I’m just going up the list in chronological order, which IM Systems does the Astronomer integrate with?Ry:    Yeah. I mean, it depends on by Astronomer there are software and yeah, it runs… Obviously we do a lot with Amazon, the primary clouds. And so, [00:28:30] there’s different levels of IM integration too. There’s like, at the processes running standpoint, and then also the tasks being able to use IEM to do jobs, for example. So we have integrations at both of those layers, but I would say our support for Amazon is it really just like goes down the curve of the most popular clouds. We were also very good with Google and Azure as well.Speaker 2:    [00:29:00] I think time for a couple more, how is Astronomer different than Google Cloud Composer?Ry:    Yeah, that’s a great question. I would say that the primary differentiator at this moment is that we do all the releases for Airflow. We’re driving the roadmap and as time goes on our cloud, we’re looking to add additional features on top of Airflow and some of those will be coming out here in [00:29:30] the beginning of next year. But at the end of the day, both of us run Airflow reliably for our customers, I’d say the biggest difference… The most important difference between Composer and Airflow and Astronomers that we’ll run in Azure and Amazon for you as well. Now, Amazon has a managed Airflow service as well. I think the user experience in our product is superior. We really focus on that. Whereas, [00:30:00] the other ones, I feel like it’s a little bit… You have to be more of a dev ops person to use those tools whereas we give teams the ability to create a new cluster and inside of a nice, simple UI, you don’t have to go into cloud console to do the work.Speaker 2:    Thanks, Ry. We’re at the 15 minute mark. Let’s have one more question. What are some tentative timelines for 2.2 and beyond [00:30:30] if there are some great things being planned?Ry:    Thank you. Yeah. 2.2. I believe, the goal here is to get releases out very frequently. I think we want to do monthly patch releases and then quarterly point releases. The another question is when is 3.0 going to happen? And it really just depends on how much we break things in the process, but yeah, you’ll see these things coming out fairly quickly. [00:31:00] And obviously if you like, some of those features, let me know and we’ll make sure we prioritize them accordingly and yeah, because all the planning here is done openly in our github Airflow.Speaker 2:    Cool. Thank you.Ry:    Yeah.Speaker 2:    So that’s all the questions we have time for in the session. If we didn’t get to your question, you can have an opportunity to ask it in the Subsurface Slack Group. Okay? So, Ry will be there. A couple of quick reminders, please [00:31:30] fill out the short Slido session survey on the right-hand side of the page and be sure to visit the Expo Hall, to check out the booths, get some cool demos on the latest tech and win some great giveaways. Thanks so much. You enjoy the rest of the conference.Ry:    Thanks everyone.Speaker 2:    Thank you, Ry. See you later.Ry:    All right. Thanks man. See you.