Subsurface LIVE Winter 2021
Analytics Engineering in Data Lakes with dbt
Within the past few years, a new persona has emerged on the modern data team: the analytics engineer. On platforms that seek to enable the intuitive workflows of data warehousing in the cloud data lake—powered by engines like Dremio, Spark and Presto—the analytics engineering toolset, including dbt, is a natural fit. By writing all transformation logic in SQL, critical business rules are accessible to the greatest number of people; by templating that SQL with Jinja, storing it in version control, wrapping it with automated tests and documentation, and persisting valuable metadata, the analytics workflow gains the rigor of software engineering principles.
Fabrice Etanchaud, Lead Dev, Maif-vie
Fabrice Etanchaud is a lead dev at Maif-vie, a French life insurance company. He has developed data services for fifteen years, mainly in the environmental remote sensing and intellectual property realms. Fabrice originally fell in love with analytics through his work in enterprise information systems.
Jeremy Cohen, Product Manager, Fishtown Analytics
Jeremy Cohen is the Associate Product Manager at Fishtown Analytics, committed to the development of open source dbt and the sustainable growth of its community. Other hats he has worn include analytics engineer, consultant, course instructor and package maintainer.
All right. Hello everybody, and thank you for joining us for this session. I wanted to go through a little housekeeping before we start. First, we will have a live Q&A after the presentation. We do recommend activating your microphone and camera for the Q&A portion of the session so our speakers can see you just like you can see them. Just simply use the button at the upper right of the screen to [00:00:30] share your audio and video. You’ll be automatically put in a queue and I will queue you up to ask your question. With that I’m delighted to welcome our next speakers Fabrice Etanchaud, Lead Developer at MAIF-VIE and Jeremy Cohen, Product Manager at Fishtown Analytics. So with that, Jeremy, please go ahead.
Thank you so much, Louise. Let me just confirm everyone’s able to see and hear us okay. Hi. [00:01:00] All right. We’re in business.
Fabrice and I are so excited to be speaking with you all here today to talk a little bit about analytics engineering, bringing it to the data lake, the data lake house even, and doing all of this with the tool set that’s made available to us with dbt. So please, let’s start off with just a little bit of introduction. Do you mind if [00:01:30] I ask you to go first? Fabrice, I think you might be muted if that’s all right.
Sorry. My fault.
Thank you, Jeremy, for inviting me. I’m Fabrice. I’m working in MAIF-VIE. It’s a French life insurance company. [00:02:00] And I started working with data three years ago. And as a former software engineer, I had always been concerned with quality. So, how can we bring quality to data transformation? And it means in fact… Oh, sorry. [QuadCast 00:02:29] [00:02:30] and Modularity and dbt can help you bring the best qualities to your data transformations.
Thank you so much Fabrice. And yeah, this is good. We’re we’re doing it live here at Subsurface LIVE. So, maybe some wrinkles along the way, but you’ll get to the ride along with all of us as we go. And Fabrice will be sharing with us a demo of some real data, a real [00:03:00] Dremio account, a real dbt workflow a little bit later.
But before we get there, I just want to provide some of the context, some of the background for what’s going on here, both in the ecosystem, in the tool sets, we’re using and the kinds of workflows we’re trying to enable. So, my name’s Jeremy Cohen, a product manager at Fishtown Analytics. We’re the makers of dbt. I, in particular, to get to work on dbt core. It’s like the open source [00:03:30] critical core elements of all the work that we do at the heart of the modern data stack at the heart of the extract load transform ELT stack. Which is really made possible by the advent of the modern data warehouse is what you’ll see on this slide here, but so to the modern data lake house, the modern query engines of which Dremio is very much a part.
And it’s the ability [00:04:00] unlike what existed five, 10 years ago, to load, extract raw data, bring it into a centralized place or a centrally accessible place where it can be accessed with SQL, having raw data that’s then transformed in an opinionated modular, repeatable way in SQL, and then powering or using that transformed data to power BI, data science, and all pipelines, whatever downstream use cases may exist. [00:04:30] If all we did was hook that raw data up directly to downstream use cases, we end up with very, very brittle pipelines. If all we do is have hundreds or thousands of lines of SQL or even worse, not SQL, inaccessible to our business users that sits in files to be run at unpredictable, unobservable schedules, all we’ve done is create very brittle pipelines. So, [00:05:00] along with the modern data stack, along with the ELT paradigm comes needs for things like being able to snapshot slowly changing dimensions in warehouse, to find those transformations in a way that’s modular and reusable, test those transformations and the source data that’s fed into them.
So, that all the assumptions that are underlying the opinionated business logic we have in our project, those assumptions actually bear out empirically. Being able to deploy it on a schedule, [00:05:30] no more waking up early in the morning to run these pipelines, babysit them, make sure that the report looks okay before it arrives in the CEO’s email inbox Monday morning. And all along the way, being able to document and describe those data assets and processes of transformation so that it lives in a place more resilient than the top of our heads. And of course, the entire process doing so in a way that is in code, [00:06:00] backed up by code, version controlling that code, and [inaudible 00:06:07] and logging to go along with the automated deployment and testing. These are things that we believe need to happen to have antifragile pipelines, to have resilient and future-proofed data transformation.
There are things that we’ve baked into this open source tool, dbt. You don’t have to use dbt to do these things, just the way that we [00:06:30] believe one could, one should. This larger workflow though, and this need for centrally opinionated transforming data in a way that is resilient, this is the work that we call analytics engineering. And there’s a totally new person out in the world called an analytics engineer who’s responsible for making it happen. This person does not replace or even really displace the work of more traditional data engineering and [00:07:00] data analyzing, the work of a data analyst.
I’ve worn two of these hats in my career at Fishtown Analytics. Prior to being a product manager, I was a data analyst and then an analytics engineer. We always believe in moving up the stack, taking the skills that we initially had to make ad hoc reports, and then putting them into something a little more modular and a little more reusable. Core data assets defined in SQL, under girded by software engineering practices, [00:07:30] and backed up and reinforced by auto-generated data documentation. That’s the work of analytics engineering, and it compliments and serves as a multiplying force for the work of more traditional members of the data team.
So, I’ve talked about dbt already a little bit. You’ll get to see it in practice with Dremio in Fabrice’s demo. What is it really at the most literal level? It’s an open source Python program that’s [00:08:00] executed from the command line. At the much more metaphysical level, it’s the conversations that we have in the community of folks who use this software every single day, analytics engineers at the top of their game. This is the tool that undergirds all the work that we do.
What is that work? In a very tactical and tactile sense, it’s writing SQL [00:08:30] and some YAML to backup and configure that SQL. Defining testing, documenting, and reusing those defined data assets, those defined code snippets. Simply by writing SQL that selects one model from another, dbt is able to infer a DAG, a directed acyclic graph of transformations and run those models in order. All of this is to say, it’s the power of a coding framework [00:09:00] where we can cut out a lot of the boiler plate SQL, a lot of the DDL and DML that trips up much more brittle pipelines. But without the limitations of a drag and drop graphical user interface. Without the limitations that attend other more traditional enterprise data transformation pipelines.
You see in the animation here kind of what I’m talking about where just by writing the business logic, just by writing the SQL, that defines fact orders [00:09:30] here, by using this magical Jinja ref function that dbt knows to compile based on the environment we’re running in and uses to infer the dependency graph here where all orders and stage customers are now established upstream parent nodes that power fact orders. At no point do I need to explicitly describe the relationships in this DAG simply by using the ref function. Along with all the other powerful Jinja functions that dbt [00:10:00] has at its disposal, I’m able to benefit from these core abstractions, these concepts that scaffold the work I’m doing, and then focus much more intently on the actual SQL on the transformation I want to write.
I mentioned Jinja a little bit in the last slide, it goes beyond just the ability to use these kind of one-off magic functions like ref. It goes toward our ability as template writers, as SQL writers, [00:10:30] knowing that it is an oft repeated and very declarative language to repeat ourselves a little less to write dryer code, code where we’re not repeating ourselves over and over. That goes all the way to being able to use macros, almost like real functional programming that we can pull in and call in various models. So that we’re never repeating code even across different files. And it goes all the way to using packages that are open- [00:11:00] sourced and maintained by members of the community.
So, if [inaudible 00:11:08] written I can benefit from and important into my project. And in that way, stand on the shoulders of giants as so many software engineers are used to doing.
I’ve talked around it a little bit, but it’s worth stressing this point. I believe one of the strongest pieces about dbt is the people around it, the people using it and making it better every single day. [00:11:30] The community has not just advice, not just the ability to help answer and debug common errors and problems, but also opinionated best practices for how we do this work. We see the same problems over and over again. Folks who out of the generosity of their spirit and also from having benefited before publish and share the way that they’ve been able to do this before. Sometimes, that looks like code. Sometimes, it’s packages and playbooks that you can copy paste or install [00:12:00] as dependencies in your own project. But regardless it’s benefiting from being part of a class of people who recognize all the work that we do in common.
Today, that looks like over 3,500 companies who run dbt in production and more every day. It looks like over 10,000 individuals, data professionals, in our Slack Channel, in our community. And I’m so excited to see what the next year is going to [00:12:30] bring us and see us in terms of the continued growth of the community.
We can also acknowledge that there have been some aspects of the community that I also hope will change going forward. Historically, most of our users have been concentrated on conventional data warehouses like Redshift, Snowflake, and BigQuery, all good pieces of technology, all really exciting stuff, but there are so many folks out there working on data lakes, working [00:13:00] on with lake house and query engine technologies like Apache Spark, like Presto, like Dremio. And we think that the principles underlying analytics engineering work underlying this change to the ELT stack powered by dbt. These things are just as possible there. As long as you can write the SQL for it, you can do it. So, the advent of analytics engineering is no way limited to the confines of the conventional data warehouse. It works in the data lake as well.
[00:13:30] So, let’s talk a little bit about dbt and Dremio together. First of all, it’s only possible because Fabrice has created, written, open-sourced, and published a plugin for using the two together following dbt’s adaptor plugin framework. Dbt has access to everything Dremio can do in its capacity as a SQL engine, a very powerful SQL engine, that gives dbt models access to all the data that’s addressable in your data lake. [00:14:00] At the same time, Dremio’s metamorphosis to a cloud data lake house with Apache Iceberg, with the ability to update data in place… This is only going to make the integration more compelling and give dbt more powerful materialization strategies to leverage.
This is something Fabrice and I talked about, and I quite like it, using dbt as a way, above all, to [00:14:30] separate environments and be able to work both in development in a really opportunistic way, a user’s home sandbox, and then take the exact same transformations, the exact same logic, and run it as SQL with automatically templated environment, namespaces, and do it in production as well. In either case it’s the same tool, it’s the same logic, no changes needed.
So, really the question [00:15:00] is how can we do these things in Dremio without using something like dbt? How can we manage environments? Propagate changes? Automate testing of our queries? Develop collaboratively? Automate updates of these tables with these more complicated materialization of strategies? Have all of this logic stored in version control? So, I want you to think about that as I hand it over to Fabrice for a dbt Dremio demo. Show us the work in 3d.
Thank you, Jeremy. So, I’m going [00:15:30] to share my screen. Here we are.
[00:16:00] So, we are going to run dbt on Dremio. Or you can see Dremio, yeah? Here is my user space, it is a shared space. We can consider this a production space. Here’s the data, the sources we will use during the project. [00:16:30] So there’s the data, there’s our French open data about real estate transactions lender use at transaction time since 2015. The project will be in fact to start from the sources and at the end, we’ve a star schema. So, the end users [00:17:00] can use it in Dremio to connect to their favorite BI tool and make some reporting on top. As Jeremy already said, there is a simple dbt project inside. You will find SQL files and YAML files. SQL is there to express its transformation and [00:17:30] YAML is there to configure or document the models. We can quickly see the data.
So, each line is a real estate transaction with a time, date, [00:18:00] the price was paid, the place, and [inaudible 00:18:07] as well as the type of premises. If it’s a loft, a house, or a building plot.
Along side, we have another [00:18:30] data set. Sorry, I lost…
And our data set with the list of municipalities in France with [inaudible 00:18:49] on top. So, for each community, you can get the department, the region in France. So, with this data, we are going to build [00:19:00] a star schema.
Here are the models that we will use in dbt, in green, at left, you can see the true sources data sets. At the end, in orange, we see what’s called an exposure. It’s in fact a group of dbt model. That [00:19:30] are mandatory to express, to be used in the dashboard, for example. Here you are four tables. A fact table and three dimensions. And here, two staging models. Staging here is used to clean data and to rename columns from a sources data sets. [00:20:00] Here will you run free models. Those models are what are called seeds in dbt. Seeds contains reference data. That’s how that will be versioned alongside the code in git with a dbt project. Where did I find [inaudible 00:20:26] data? I found [inaudible 00:20:27] in the documentation.
[00:20:30] Sorry, I have problem with my connection. Yeah. Right.
You have some reference table in the document. [00:21:00] You can see here. I just took the values and put them in CSV files.
[00:21:30] So, now that you have seen our project. You have seen with Doug, Jeremy already mentioned. We are going to run dbt. First, we’ll we’ll populate the seed tables.
All right. [00:22:00] Go back to Dremio. We are in the so-called unmanaged mode. So, now you can see in my user space, I have three data sets generated by dbt, which contains the reference data taken from a CSV. [00:22:30] Now, we’re going to run the real models.
I’m [inaudible 00:22:51] staging. And finally, we must task schema tables. We’d go back in my own space and you can see [00:23:00] the remaining datasets. So, here is the fact table with data cleaned, with column names.
Change. The big thing is, is that we are just now in so-called development mode. When you have finished [00:23:30] your development and everything is good and tested, I will show you a little bit later, how to test your models. You can switch to production just by changing the environment in which dbt is going to generate the models. Which switch…
Sorry, let’s go to inside the managed environments. [00:24:00] And now seeds are going to be created in the shared space. Yeah. Invert reference, subspace. We’ll do the same thing for the models.
And [00:24:30] you can see the two staging tables and the star schema.
The last thing, [00:25:00] at the beginning of our presentation, I talked about correctness and reliability. And in fact, dbt is very strong for that because in the description of a model, you can embark condition, that most old on the columns or on the lines [00:25:30] on the tables. And just by doing dbt tests, dbt will just like rejuvenate in Java, play all the tests. Each test is converted in a SQL statement executed against the [inaudible 00:25:47]. And you can see the results here.
Jeremy, I think a…
[00:26:00] Very good.
Tres bien. Merci. [Inaudible 00:26:08]. I would just like to call out that the DAG you were showing at the very beginning, the dbt doc site, was also automatically generated using the dbt docs generate command. And so all the information that’s here, both visual, and then also any kind of descriptions that you’ve added to a specific model, all of that is actually defined in the SQL and YAML [00:26:30] code inside of your dbt project. No front end development necessary.
[foreign language 00:26:39]. Thank you.
I’ll share my screen once again, just to close us out. And then, we’d be delighted to take your questions. I see there are some really good ones. What we got to see Fabrice do was develop a really simple [00:27:00] transformation chain building up from source data and some static data that you grabbed from a PDF. I didn’t know that. That’s very impressive. I just found that out. Then, we took that transformation chain, put it into production, same DAG, same code. Refs compile differently just based on the target. We were able to test some crucial assumptions and find some that weren’t up to snuff. And we were able to view auto-generated documentation.
If you want more, I really highly recommend [00:27:30] that you read our docs docs.getdbt.com. Not to be confused with the documentation of Fabrice’s project. Though, you could also check that out by installing the dbt Dremio plugin from PyPI, cloning Fabrice’s demo project, which I also linked in the hop in chat. And if you are someone with some Python chops who wants to help see this project to success, you can contribute. And I’m sure Fabrice would be happy to review [00:28:00] your contributions to the Dremio dbt plugin, which is also open source. I hope I’m not offering anything Fabrice that that’s too much. Thank you all so much. Merci.
All right. Thanks Fabrice and Jeremy. A great demo. Great presentation.
We’re going to go ahead and open it up for Q&A. I think we have time for just a few questions. A quick reminder. If you have a question, use the button in the upper right. [00:28:30] If you’re having problems with that, do go ahead and ask your question to the chat. We’ll try to get to as many as we can. All right.
So, let’s see some of the folks who are queued up for the chat… Click their cameras here, see if anybody’s still available. Well, it looks like some of them have dropped off. So, while we’re waiting for some folks to queue up here, I’m going to ask a chat question.
So, the first one’s from Bruno. Bruno’s asking so a pure SQL pipeline could be built and orchestrated only with dbt, [00:29:00] right?
Absolutely. In fact, dbt wants you to write all of your business logic in SQL. We think it’s important. It’s an opinionated stance we have. It is the lingua franca of data. It’s what’s most accessible to the greatest number of people at your organization. Dbt is agnostic about its own deployment. It could be dbt run as a cron job. It can be an airflow task. It can be scheduled through our commercial offering, which is dbt [00:29:30] cloud. Though, not yet for some of the open-source plugins like dbt and Dremio together. So, it really does give you total flexibility of how you want to ultimately deploy it. But yes, dbt can absolutely define an entire pure SQL pipeline. In fact, we do that dozens of times ourselves and thousands of times per day, those kinds of pipelines are running production for companies all over the world.
[00:30:00] Great. All right. It looks like we have a question from [Shinevasan Manchuuran 00:30:09]. Shinevasan, if you want to go ahead and unmute yourself, you can ask your question.
Well, it looks like he’s there, but quiet. All right, we’ll take one more from the video chat. [00:30:30] Nope.
In the meantime, I just want to say thank you to all of Fabrice’s fan club for being here. It was really great to see.
Yeah, that is fantastic.
In the chat, I didn’t know you came with an entourage, Fabrice.
Louise: All right. So, just a couple more questions, then.
Fabrice Etancha…: [inaudible 00:30:48]
Louise: Oh, go ahead. For Greece.
I’ve been speaking of Dremio and dbt for more than three years to these people. And I think [00:31:00] it was very long for them.
All right. So, just a couple more questions before we wrap up here, a question about why Dagster instead of Airflow? Not the familiarity you hear with Dagster.
Louise: Not sure that’s… [crosstalk 00:31:21] answer.
Again, as I said… Yeah, as I said before, dbt is really the open source tool, dbt is agnostic as to its own [00:31:30] deployment. It’s just something you can invoke from any command line. So, there are some pretty compelling integrations with Airflow as a deployment mechanism. And I know Astronomer… Some of the folks who work on managed Airflow instances have been working on a real first-class dbt integration. Also, the folks at Dagster have a real first class dbt integration, and they’ve been publishing a bunch of blog posts in that vein. The appeal of using one of those pieces [00:32:00] of software with dbt and Dremio together in particular is because the plugin is open source and under active development, you really have total control over which versions you can be using, how you want to be invoking it and so on.
Okay, great. And we’ll take just one more question before we wrap up. Is it possible..? Let’s see, actually there was a good one here.
Can we download and use dbt Dremio in production?
Jeremy: [foreign language 00:32:29]? [00:32:30] Fabrice?
Yes. If you keep to use, I think you can. There’s still a lot to be done on the tables and the reflections materialization. Then, there’s a lot of work on Dremio’s side related to [inaudible 00:32:57]. And I think we should [00:33:00] start from the solution instead of keeping with a standard bucket file format.
Yeah. So, as Dremio adds really exciting new features and functionality around being able to update tables in place and so on, I think that there’s more and more complex materializations that dbt can also then leverage. But yeah, I think that’s where… Go ahead.
Yeah. It’s [00:33:30] opening up a new field to implement snapshots and incremental materialization also.
[foreign language 00:33:39]
I’m just seeing, Louise, a couple of questions that feel related in the chat that I could quickly address.
Louise: Yep, go ahead.
To Prakash’s question, what the cost is… Free. It’s the cost of the infrastructure you need to deploy it. dbt is totally open source and Apache 2 licensed as are all of the plugins. [00:34:00] What is available in the commercial version of dbt cloud is it’s the most reliable and differentiated way to deploy dbt in production. So, it’s an interactive development environment, like a browser-based code editor, with the ability to run dbt models interactively, get full access to logs and artifacts, and just all the features that are being built by the same people that are building the open-source tool. The only limitation [00:34:30] is that as of today, only the core database plugins, so, Redshift, Postgres, BigQuery, and Snowflake plus Apache Spark and Databricks, in preview, are supported by dbt cloud.
That said we’re very, very excited to be adding many, many more this year. And so I think some are on the horizon. And if lots and lots of folks use dbt and Dremio together, it may very well be before the end of this year that you can run Dremio pipelines from within dbt [00:35:00] cloud.
All right, great. I think that’s all the time we have for questions for today. Just a reminder, the speakers will be in Slack after this. So, if you still have questions, that’s a great opportunity to just look for their channel. I believe it’s Jeremy Cohen and Fabrice, his name actually in the Slack channel. So, go ahead and search for it in there. In the meantime, the next sessions are about to start here. We have a session on data-.