May 3, 2024
Automating Reflections Management with the Dremio-Airflow Plugin
Dremio is becoming an integral part of the OLAP layer in modern data platforms due to its high-performance and cost-effective architecture. Leveraging Dremio’s Reflections query acceleration enables us to achieve sub-second query latency. At Prevalent AI, we use Dremio as the query execution engine and Apache Airflow as the orchestration engine for our data platform.
We wanted to automate Reflections management in Dremio, including creating, updating, and refreshing Reflections. With no native plugins available for Dremio in Apache Airflow, we developed the provider package to automate these processes. Thisallows for seamless orchestration of Dremio with Airflow, benefiting data engineers planning to integrate Dremio into their data pipelines.
Topics Covered
Sign up to watch all Subsurface 2024 sessions
Transcript
Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.
Muhammed Irshad:
Hello everyone, welcome to the session, first of all, thanks for joining the session. So this session is all about a plugin, which we have developed in Airflow to automate reflections management in Dremio. So if you are not familiar about these terms, do not worry, I just quickly explain it. So reflections are one of the hallmarks of Dremio, which makes the queries run faster. So when you create reflections on Dremio, it actually creates a relational cache layer of the data source, which is highly optimized. And when you fire some queries to the data source, Dremio intelligently delegates these to the cache layer so that the queries runs faster, could be if you have the proper reflections in place, the queries could be 10 to 100 times faster.
If you want to know more about reflections, so there was a session from Benny Chow, which explained reflections in detail, so do check it out. So Apache Airflow, which is a very popular orchestration engine, which is open source. The beauty of Airflow is you will be able to define data workflows as pure Python code, that comes with a lot of flexibility. And Airflow has a highly pluggable architecture, where you will be able to plug any things you want to bring into Airflow. So that is one of the things which we have developed.
So with that, let me introduce myself. I am Mohammed Irshad, Big Data Architect at Prevalent AI. So I have 12 plus years of experience working with different big data projects. Currently, I am part of the architecture team here at Prevalent AI, which manages the data platform we have. So to my co-speaker.
Sanchit Sreekanth:
Hey, everyone, my name is Sanchit, and just like Irshad, I do work at Prevalent AI, and I am an active developer in the platform engineering team at Prevalent AI, where we’ve helped build a cybersecurity solution platform. And as part of this, I have almost three years of experience in the big data world, and some of the projects that I like to develop, one of them is the plugin that we are just about to demonstrate. Yeah.
Muhammed Irshad:
Okay, so let’s look at the agenda. So in first couple of slides, I want to explain why we need to automate reflections management in Dremio by explaining some of the use case we have in Dremio. And then Sanchit will do a nice demo of some of the cool features which we have built with this plugin. And then I will continue to explain the features and roadmap we have in place for the plugin. And I will explain how you can try this out on your own.
About Prevalent AI
Okay, so before I get into the details of the topic, let me try to introduce Prevalent AI briefly. So we build the cybersecurity analytics with AI power, okay, that’s what we do. So these days, organizations have huge volumes of data. So this can be generated from different application organizations built also from the within the logs which are generated from different tools organizations use, okay. What we do is we fuse together any data available within the organization to build meaningful insights so that the organizations would be able to better manage their cybersecurity risk posture. So how we do it, we have a platform, which we call SDSS security data science platform. So in summary, we have expertise of the three things mainly. So first one is the cybersecurity domain knowledge and the big data technologies plus advanced analytics, okay.
SDS High Level Architecture
So with that, let me try to explain the high level architecture of the platform. I know there are a lot of things here, but I am not going to explain everything. But I just want to focus on two key things. First one is the role of Airflow in managing our data pipelines, okay. And also the role of Dremio in the presentations layer, which is shown over here on the presentation layer section to act as a data accessibility layer, okay. As I mentioned, we have the capability to ingest data from various cybersecurity tools. So once we ingest, this is the sections which is explained here under that pipelines. This is the processing logics we have, which is the heart of what we do. So Airflow comes here and Airflow orchestrate all of the data pipelines here. So we have 800 plus data workflows, starting from parsing the data, cleaning the data, also running the processing logics we have built. And if you need to automate some of the quality data quality checks, also we are making use of Dremio, sorry Airflow, okay.
Once that is done, we have our goal layer populated, which is very complex and highly dynamic in nature, because we have, we are dealing with a large volume of data, plus the relationship between this data source are highly complex, okay. So we need a highly scalable, faster query engine in our presentation layer. So that is why we choose Dremio for. So the key consumers of this goal layer are primarily the solutions we built, which is shown under the solution section here. Also in case of some self-service dashboards, we may need to build using BI tools and the section which we explained under the target applications here, which is mainly in case if we need to publish the data into any third party applications.
In addition to all of this, of course, our data scientists and the data analysts need access to this data layer for doing EDA and exploratory data analysis, etc., okay. So that is a high-level overview of the platform. So with that, let us look at some of the use cases which we solve with Dremio. First and foremost thing is, of course, the analytics and visualization on huge volume of data without compromising on concurrency, okay. And of course, some ad hoc queries, especially in exploitative data analysis, which should be faster. That is also we use Dremio for and we leverage the capability of self-service analytics capability of Dremio to quickly build solutions without much engineering overhead. And another cool thing which we have built over the platform is a conversational capability, so that we will be able to ask questions in plain text, so that our data science models will convert that into the proper SQL queries and then delegates to Dremio, Dremio gets the results and then things get visualized in the dashboards. So that is one thing which we are implementing with Dremio. Of course, some enhanced accessibility for our data analytics, data scientists. So this is mainly to leverage the Apache Aero, Apache Aero based flight SQL endpoint of Dremio, so that our data scientists would be able to quickly now make use of the data, okay.
Need for Automating Reflections Management
So let us look at why we need to automate reflection management. First and foremost thing is we need a programmatic way of defining reflections, okay. Because Airflow has a capability to define reflections using its nice UI, where you can define all the things for the reflections and you can click save, but that is not okay for us because it needs a manual intervention. So we need a programmatic way of defining reflections, plus we need to manage the reflection refresh integrated within our data pipelines, okay. So Dremio comes with an option to schedule reflection refresh, again, configurable from Dremio UI. But since we have Airflow manages the whole of the pipeline, we need it to be triggered from Airflow.
Solution Design
Another thing is, of course, when you do application development, you have to deploy the application into different environment, maybe dev environment, pro, etc. In that case, maybe as part of your CI/CD pipeline, you have to move the reflection defined in one environment to another. Another thing is we need to have the reflections version control so that in case of any failures or something we can quickly revert the things, okay. So those are a few requirements we have to automate reflection management. So now let us look at how we implemented it, okay.
So we have, at the top we have Airflow, which manages the whole pipeline. And then we have Dremio comes here in our presentation layer. So Dremio has a nice REST API to interact with all of the services Dremio provides, be it running queries, creating reflections, refreshing, all of the things has got a REST API, okay. So we need a bridge between Airflow and Dremio’s REST API. So we looked at various options and then we found this option called provider packages, which is an option which is introduced in Airflow 2.0 version, which helps us to interact with external systems. So we will be able to define the logic as an independent deployable package. And if you look at the community, for almost all these technologies which are listed here, these are just a few examples, there are almost 100 plus provider packages available within Airflow for various technologies, example, Databricks, Treno, Impulse, Snowflake, everyone has got Airflow provider packages. But unfortunately, there is no provider packages available for Dremio, okay. But do not worry, so we have you covered. So we have developed this within Prolan and happy to open source it, so that the community will be able to leverage the capability of Airflow and Dremio together.
Advantages of Dremio-Airflow Provider Package
Okay with that, so we will talk about some of the key advantages we get with Dremio Airflow provider package. So again, all of the things here mentioned are being already discussed in the previous slide, programmatic way of defining reflections, refresh, etc, deployment, version control, everything will get satisfied, since we have, we can define workflows within Airflow as pure Python code, okay. In addition to all of that, we have another advantage of being available as open source, so that we can make use of the power of the community, so that the community can add support and things for the provider package and as and when Dremio releases new versions, etc, the community will be able to support the plugin, okay. So those are few advantages we get with this plugin, okay, now let us see some cool demo from Sanchit, which showcases some of the nice features we have developed with this plugin. Over to you Sanchit.
Sanchit Sreekanth:
Dremio-Airflow Provider in Action
Thank you, Irshad. So I hope you all are able to see my screen, yeah, it looks like it. So I am going to get started with this demo. So let me lay some groundwork for what I am about to do, I am going to try and replicate a data pipeline run in my local system, which you would normally have in your organizations on terabytes of data. So let us say that all of your data is processed, and analytics have been run on it, and you have stored it in your data lake. What are you going to do next? You are going to use an OLAV player like Dremio to power your dashboard. So first, what you would do is you would come to Dremio. So I have a local instance of Dremio over here, and as you can see, I have not configured any sources or anything yet, and I am going to be doing it shortly using the plugin. So you would come here and you would add source, and you would add whatever source you want. I am going to be using Hive 3.x because my data lake is in Iceberg, and I have written it using the Hive catalog of Iceberg. So yeah, I can do this using the UI, but like Irshad mentioned, we are going to use the power of Dremio and Airflow using the plugin. So over here, I have a local instance of Dremio, sorry, of Airflow, and I have two DAGs written over here. So let’s go into the first one called create source, and again, don’t worry, I’m not going to go too technical about Airflow, but all you need to know is this is how a DAG looks like. You will have certain tasks, these rectangular shapes that you can see here are certain tasks, and each task is basically an instance of an operator in Airflow.
So an operator is basically something that Airflow comes up with to do any task. Airflow has a lot of inbuilt operators, but this operator, if you can see the name, it’s Dremio create source operator, and this operator, as the name suggests, comes from our open source plugin. So like Irshad mentioned, this is basically a configuration as a code. Airflow is basically a configuration as a code. So if you look at the code, this is the simple Python, it’s as simple as this, you just need to write a few lines of Python code like this to create the DAG that I just mentioned. So again, don’t get bothered about all of this, I just want to bring your attention to this little piece over here. As you can see, it’s a class called Dremio create source operator, and this, I have imported it from line number four, as you can see from my provider package, and one of the parameters over here is the source spec. So what is the source spec? This is basically the spec that you tell this operator to create in Dremio, and this corresponds to exact same as that of the Dremio API standards for defining a source. So all of this, if you have sources like Snowflake or anything else that you want to configure, you can just refer to the Dremio official documentation and give the source spec over here.
So in my case, I’m going to be giving the type as height three, like I just mentioned, I’ve given it the name, I’ve given it some host name and port, and okay, so I’ve written this code, Airflow has given me this DAG, so let’s go ahead and trigger this. So you can see the status of the tasks right over here in Airflow itself. So this one is currently in running state, and whether it’s a success or a fail, you can see it right here. So this has become a success, so let’s go back to Dremio and see what’s happened. All right, so my source has been added, and as you can see, we’ve automated all of this using the plugin, and okay, let’s go one step further, let’s see what’s inside this. So we have two schemas over here, Iceberg and Parquet. I’m going to go into the Iceberg where I have my Iceberg table called daily product sales. So let’s just run some query on this just to, let’s just see what this table has, right? Okay, so I’m going to run a sample query like this where I’m just going to take the sum of price per units from this and group it by date. Okay, so there are three dates over here, and if you go into the jobs tab, you’re going to see that this query is not accelerated because, of course, there are no reflections in this because I’ve just created this table, right? So I’m going to go ahead and create some reflections, but not from the UI where you would normally go over here onto the reflections and create it. I’m going to, once again, rely on my provider package. So I have another tab for this called CreateDremioReflection, and if you go into this, again, let’s see how this tab looks like. We have two operators here this time, and this time it’s of the type DremioCreateReflection operator, as you can see right here. I like the name, so it’s used to create reflections.
So again, I want to just show you why this is so good by just taking you through the code I’ve written. Again, don’t get focused on all the technicalities, I just want to bring your attention to this class over here, and a few of the parameters that we’ve given. I’ve given the source, which is himetastore.iceberg.dailyproductsales, and exactly how Dremio mandates you to give it. I have my reflection spec here where I give the spec for the reflection that I want. Again, similar to the source, it corresponds to Dremio standards. I have the refresh settings where I’ll tell Dremio how I want to refresh the dataset, and again, another task like that, similar for the aggregate reflections, where I have given an aggregate reflection spec here as an available.
Okay, so now let’s go ahead and trigger this and see what happens. So the good thing about this operator is you can actually leverage airflow input functionality to get the status of your reflection reflections. So here you can see the state that’s coming in, it comes to running, it goes to deferred. So whether it’s like a success or failed or skipped or anything else, you can see it right over here. So as you can see, the task is success. And one cool thing is that if your reflections take a long time to run, you can see the status as still running right over here. Since mine is a very small table, it’s just gotten into success pretty fast.
Okay, back to Dremio. So let’s see what has been automated here. So if you see the reflections, I have two reflections, which I just said. One is my raw reflection with all the fields available here. And one is the aggregate reflection with the measure fields as date and price per unit, which is exactly what I gave in my reflection spec. So this is, again, how you can just automate your reflections without touching Dremio at all. Okay, I want to take it one step further. I want to add a new column. Let’s say, I’m just going to run it in Spark. I’m just going to add a new column called product type and I’m going to add some more data into it. So how would you go about this in Dremio? You would, again, you would come over here, you would click on settings, you would go to reflections, you edit it, you would click refresh, trigger this refresh now, et cetera, et cetera. You know how it works. So what I want to do is I’m just going to go over here. I want to clear this task, which is basically me running the task again. So that’s another thing this operator does. It not just creates a reflection, it updates the reflection, it triggers the refresh for you automatically, and it does all of the heavy lifting for you. And again, you can just go over here and see the logs for what all has happened. And again, if your task fails, you can see the logs for why it failed, and you can leverage all of the Airflow functionality on top of this. So okay, let me go back here to Dremio.
Let’s see what has happened. Okay, so if you go to reflection, for the raw reflection, you can see that the new field has automatically been added. So you might be wondering, okay, how did this get added? I didn’t change the reflection spec. So that is where I want to bring you to another feature that we’ve given. Going back into the code, you may have noticed this auto-inference feature here. So this is basically, if your data set has constantly evolving schema, and you don’t want to come to Dremio and always keep on updating the schema whenever the reflection spec, whenever the schema changes, you can just go to and set this auto-inference as true. And if you do it as true, Dremio, the operator that we have defined over here will automatically take the updates available in the data set and put it into the display fields.
So I’ve given auto-inference true for the raw reflections, but I’ve given it as false for the aggregate reflections, which is why the raw reflection automatically got updated without me even touching Dremio, and the aggregate reflections stayed pretty much the same. Okay. So let’s run the query. And you can see now that the results have been updated. I’ve got my new, the date field, which I’ve added over here. And again, you can go into the jobs tab, and you can see that the query has been accelerated and it has used that aggregate reflection, which I’ve created.
Okay. So before I wind it up, I just want to show you the importance of this. So again, like Irshad mentioned, all of this is just code. So suppose I’ve written my Python code like this. You can just use your existing CACD pipelines in your organizations to promote it from environments, let’s say from dev to QA or QA to prod. You can do all of that using your existing pipelines, you don’t have to do anything new. And of course, like version control for your reflections. So I mean, for this demo’s sake, I’ve just given the reflection spec in my code itself. But if you maintain some good JSON files, and maybe get this like your version control, you can use the version control that you have to track your history.
And of course, like, as you noticed, I didn’t use the Dremio UI at all, I controlled everything from creating the source, updating and creating the reflections, triggering the refresh, all of them right from my flow instance here. So it allows you to program, since all of this is programmatically defined, it allows you to make your workflows easier. And all of the heavy lifting has been delegated to this operator, and it’s been bundled up pretty neatly inside this task over here. And again, you can use the power of Airflow as well here. So let’s say if your task fails, you can use the Airflow notification system to notify you that your reflection refresh has failed. You may have noticed that I manually triggered this right now, but you can use Airflow functionalities to automatically trigger this as soon as your pipeline runs. So again, the possibilities are endless, you can just customize this to however you like for your organization. So with that, I’m going to hand it over back to Irshad, I hope this demo was helpful. Back to you, Irshad.
Muhammed Irshad:
Features & Roadmap
Thank you, Sanchit. That was really nice. Okay, so let’s look at the features we have developed and the roadmaps, okay? So as Sanchit showcased, reflections management and data source management are currently supported. And we are thinking to have a feature to query via the AeroFlight endpoint, because you will be able to directly query using the JDBC endpoint because Airflow already has that operator, which is directly available, you can completely do that. Another thing which we are planning is the workload management, let’s say, especially in case of, if you want to create a new data engine and delegate reflection refresh over to the data engine, you can make use of this particular feature, okay?
So let’s see how you can try it out. So we have all these things available in our public source code repository, which we have linked with this slide. Also we have uploaded initial version of the package into Python index. So you can just type in install it in your environment with pip install. And we have a Dremio playground which used by Sanchit to showcase a few things, you can use that as well. And also we are planning to submit a PR to official Airflow repo to make this available as Airflow source code repository. So that is all. Yep. Thanks. Thank you, everyone. Yep.
Sanchit Sreekanth:
Thank you.