Gnarly Data Waves
Episode 45
|
February 20, 2024
Next-Gen Data Pipelines are Virtual: Simplify Data Pipelines with dbt, Dremio, and Iceberg
Learn how to streamline, simplify, and fortify your data pipelines with Dremio's next-gen DataOps, saving time and reducing costs. Gain valuable insights into managing virtual data pipelines, mastering data ingestion, optimizing orchestration with dbt, and elevating data quality.
Traditional ETL processes are notorious for their complexity and cost inefficiencies. Join us as we introduce a game-changing virtual data pipeline approach with Dremio’s next-gen DataOps, aimed at streamlining, simplifying, and fortifying your data pipelines to save time and reduce cost.
In this webinar, you’ll gain insights into:
- Simplified Data Pipeline Management: How to use Dremio for data source branching, merging, and pipeline automation.
- Mastering Data Ingestion and Access: Learn how to curate data using virtual data marts accessed through a universal Semantic layer.
- Better Orchestration with dbt: Discover the benefits of orchestrating DML and view logic, optimizing data workflows.
- Elevating Data Quality: Learn techniques to automate lakehouse maintenance and improve data integrity.
Watch or listen on your favorite platform
Register to view episode
Transcript
Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.
Opening: Traditional Chain of Data Movement Pipelines
Alex Merced:
With no further ado, let's begin our feature presentation: “Next-Gen Data Pipelines are Virtual: Simplify Data Pipelines with dbt, Dremio, and Iceberg,” the big theme here––and I will be your presenter today––is about how difficult things are, and how easy they can be. I'm going to paint that picture for you. First, let's just talk about how things are. So right now, you generally have your data sources through complex layers of data pipelines. You land that data in the data lake, and it's not just generally single pipelines––you might have chains of pipelines because [you have] one data movement or one data transformation, is dependent on another––so you end up having these large layers of changes. But then, still, a lot of your data ends up ending up in the data warehouse for a variety of reasons for many different use cases. So you have an additional number of layers of pipelines. And that means more code you have to write, test, and deploy, that means there are more costs and compute, which means you're using up time, you're spending money, and you're creating duplicates of your data, all the service, all the data for your clients. This can get expensive. And, it can be better.
What’s the Problem?: Broken Pipelines that Require Tedious Backfilling, Angry Consumers from Late and Inconsistent Data
Alex Merced:
So what's the problem? The problem is, when you have more and more pipelines, all these layers of pipelines are more likely to break, and that's going to require tedious backfilling, having to figure out what the error is in that code, retest it, redeploy it––it can get tedious, and set huge delays in getting the data to the people that need it. So then people end up getting data that's late, or that's inconsistent because those problems weren't caught. And well, that just leads to some bad insights.
Cost of these Movements
Alex Merced:
So on top of the lack of value you get in the time that's lost, and the lack of quality of data that may occur from too many complex data movements, you have the storage costs, you have the compute processing costs; you have the network and Egress costs as you move data across different regions; lost productivity in all that time that is wasted; regulatory fees, when you have PII that should not be there or is visible to the wrong people; data model drift, when you have 20 copies of data, and things move away from where the original data model was, and the cost of those bad insights. What if you make bad decisions that have negative impacts on company value because you were looking at the data wrong?
The Dremio Approach (ZeroETL & Virtual Data Marts)
Alex Merced:
So Dremio tries to simplify this, because the idea is, that if you have fewer pipelines, if you move the data less, you can curate the data faster and also have less costs. So Dremio brings you this shift-left paradigm, where essentially, we're going to focus everything more on the data lakehouse. So we’re going to focus on operationalizing your data lake and then your long tail of other data sources, which may be some databases here and there, but connect all that data to Dremio. Generally, you would just land your data in a table format in your data lake––Iceberg and Delta Lake can both be read by Dremio––you land that there, you connect to Dremio, and then you're able to serve that data efficiently to your use cases without the additional movement into a data warehouse and just delivered directly to your clients. And you get the benefit from that federated data, so you can connect data sources directly to Dremio, such as databases. So some data you may not want to move directly into your data lake, or you may be getting it from third-party share, using third-party sources of data that you can't move, you can connect those sources directly to Dremio. So that gives you some more zero/low ETL files; you get to curate a semantic layer that allows you to govern that data, control access to that data, and do column and row masking on that data, so that way, when whoever has access to particular views of that data will only see the data they should have access to. And none of this requires you to do any additional movements or additional copies of the data.
Then, to optimize performance where optimizations may be needed, there are Dremio's data reflections. Oftentimes figuring out when you need these is much easier than ever. Dremio now has a reflection recommender that will help identify query patterns that could use a reflection or what the specific reflection needed is, creating a very powerful ability to manage your data lakehouse, because you don't have to move data much. And when there are bottlenecks, Dremio's going to help you identify those bottlenecks, and recommend solutions to fix them.
Fewer Pipelines, Less Backfilling, Freshest Data Sooner, Data Consumers Getting Fresh, Correct, and Fast Data
Alex Merced:
So the benefits of this Dremio poach are fewer pipelines, and less backfilling; fresher data sooner. Your data consumers are going to be much happier because you're getting that fresh data faster. It's going to be correct because you're going to have much more room to test and validate that data. And then you can also save the company money––fewer copies, less storage costs, fewer pipelines, less compute, less data movement, and less network costs. With less movement, the pipelines are more likely to remain fresh. You're going to have fewer copies of the data in one place overall. You're just going to have a better story.
Cost Benefits of this Approach
Alex Merced:
And if you want to see a really good example of this, a couple of episodes ago, Standard and Poor's discussed implementing a data lakehouse in their practice. They saw all these benefits––lower costs, and easier ability to deliver the data overall. It was a lot of positive. And again, you can hear straight from them by going a few episodes back––actually, the previous episode to this one. So far, now, we're going to start getting into showing you how some of these things work. But nothing beats getting hands-on. So if you scan this QR code here, this will walk you through a tutorial where you'll actually deploy Dremio right from your laptop using Docker, see a lakehouse firsthand, create your object storage with MinIO, and just see the experience firsthand. But if you want to get started with Dremio, you can always scan this QR code here, which will take you to our Get Started page, so you can do a production deployment, whether it's through Kubernetes using K8's, or through Dremio Cloud, and have a cloud-managed deployment.
So with that, the first thing I'd like to demonstrate is the benefits of Dremio's integrated catalog, which is part of its lakehouse management features that allow you to do branching and merging. So this is going to allow you to be able to isolate ingestion and allow you to give space where you can do a validation of data before you publish it, allow you to roll back your catalog when disasters happen, and lots more. So, that's cue it up, and I'll see you right after this demonstration.
Demo 1: Data Lakehouse Management Features
Alex Merced:
So what I'd to show you now is just the epitome of our data lakehouse management features. How powerful the Dremio catalog is, because not only does it offer you automatic table optimization, and automatic table cleanup––so that way, you just set it and forget it and your tables are just going to work and be fast and be nice and crisp––but also you're going to get git semantics powered by the open-source project, Nessy, allowing you to do branching, merging, and tagging of changes in your data catalog which is going to enable you to isolate ingestion work, allow you to create zero-copy cloned environments, allow you to easily roll back when there is a disaster, makes your life a lot easier.
So we're going to walk through this query right here. But first, let's get it running. So just to give you an idea here, notice, that I have these single table names. Well, I want it to be run from this particular folder right over here. So I'm going to go open that folder up in my catalog. february-get-started-gw––now, what I would have to do normally is, I'd have to drag this over, which is still pretty convenient, and then I would get that whole fully qualified name there. But I don't want to do that, it’s just a lot of typing, and then, I would rather make it simpler and keep it this. So how can I do that? Well, there is an option right here in the SQL editor that is going to allow me to set the context for the query. So I can say, I want this to be done in this folder. Now I can go hit run on the query, and then every query will be run in the context of that folder. So it is assuming that a fully qualified path before the name of the particular table that I am working with, which is super convenient. Keep in mind, notice, that I am running multiple queries in the same window, and how convenient that is that I don't have to just run a query, wait for the run, then do another query, and type it out––I can put them all in the same window. I can also create multiple tabs to work on multiple sets of queries at the same time. And then I can always save those as scripts or views, giving a robust easy-to-use SQL editing experience in the Dremio user interface.
I'll click run, so this way, this begins running. But it's not just the Dermio interface that you can do work from––Dremio has several different ways you can connect to the data, whether it's JDBC ODBC or Apache Arrow flight, the Dremio's REST API––so that way you can always connect to Dremio to have access to all your data for whatever that use case is. So this is going to be just about wrapping up, so what we are doing in this query is, we are creating two tables. So these first 2 queries, create a table and a staging table, the idea is that I want to ingest new data that's going to be in that staging table, but I don't want the data in that staging table exposed to my data consumers. So what I'm going to do is I'm going to create a branch––well, first, we're going to insert some data into that staging table, so that we can ingest it––but then I'm going to create a branch. So here I see that I created a branch successfully, and then I'm going to switch over to that branch so that way any future transactions occur in that branch, which means those changes are not going to be visible to queries that query the main production branch. So that means all these changes that I'm doing to my data are not visible to people who are sending in their normal day-to-day queries, which means any data that isn't validated or isn't clean, isn't going to be exposed to them. So you're not worried about sending them inconsistent data.
So now that I have done that, I can now run my merge into statement, which is query number six, so that's this query right here. So notice as I click on the query tab, it highlights the query in the text. So that way, I know exactly which query that represents. There are just all these nice little quality-of-life touches. So I can see that the merge into the statement was successful, it merged into three records, which either meant it inserted a record or updated a record. So then, after that, we're going to run some validation––so first, we're trying to see if we have any negative sales counts, in this case, we do have one. We're going to check if any of the sales have incorrect dates, which means they have dates in the future that wouldn't make sense––none had that, fortunately. So let's assume that we made all of our validation checks, and all of that was successful. If I didn't pass my validation checks at this point I would go back and clean up my data and do what I need to remediate that data. But assuming it's successful, then what I can do is do a merge. But before we merge, I want to show you that the data is not available in the main branch. So this first query, query nine, is a query on the main branch, and you see that the table is empty because we haven't integrated the new data yet. But if I click on query ten, that's querying the branch, I see the three records here. So you see, I'm querying the same table, the only difference is I'm doing it on a different branch. Then, if I go to query eleven, what I'm going to do, is I'm going to go back to my main branch, and I'm going to run a merge statement, merging the commits from the branch we've been working off of back into the main branch. And now, when I query both tables, where I query it from the main branch or the branch that we created, the data is the same, because we pulled over those commits. So I've now published my data. So the point is, it's much easier to do a write, audit, publish pattern––you can almost call it a branch, audit, merge pattern, or BAM, and, basically do your injection work much more easily. This is all powered by the Dremio catalog, which is powered by project Nessy, which means it's not just Dremio that gets this advantage––any tool that can connect to a Nessie catalog can connect to your Dremio catalog. So you can take advantage of this merging and branching in Apache Spark, Apache Flink, in other tools that support Nessie catalogs. So basically, wherever you are doing your ingestion work, you can take advantage of this, and then still be able to see all those branches here in Dremio. So over here, when I take a look at this Demos catalog, I can see all the branches I created, and I can browse the data based on the particular branch, and the particular commit, to see the data as it was in the past.
And Dremio provides a nice UI for auditing your catalog. So if I go here to my Arctic catalog, that is these Dremio catalogs––I can go to any particular catalog, in this case, Demos, and I can go to a particular branch. So I'll go, hey, I want to go take a look at my––this is the branch we just created. But let's say I want to go to the data here, and I'll take a look at the data at a particular branch. I can choose which branch I want to take a look at the data in––I would just choose this branch right here, and now I'm browsing the catalog as it was on that branch. I can see what commits have been made. So I can see, hey, I want to see the commits on this branch, because I switched over to that branch. and I can see those commits. I can see who made the commit. I can get the commit ID so that I can roll back to those previous commits. So I get this nice UI that gives me visibility and observability into what's going on in my data catalog. It’s pretty cool. So now I have that observability from the Dremio catalog level which allows me to know who made changes, and then I have the individual transactions that you get from an Apache Iceberg table history, which also gives you some other levels of flexibility to those tables. So overall, very, very cool.
So that's a bit of a tour of the Dremio Cloud Lakehouse platform. It is very robust. There is so much more you can do, but the best way to see all of that in action is to get hands-on with it. So whether through a Docker container, through Kubernetes, or creating a Dremio Cloud account, I recommend that you get hands-on with Dremio and try it out for yourself and apply it to one of your use cases with data that you to work with, and see in action, and I assure you you will what you find.
Demo 2: Learn How to Curate Data Using Virtual Data Marts Accessed Through a Universal Semantic Layer
Alex Merced:
Now, the next demonstration I like to talk about is our semantic layer. So with the semantic layer, what you can do is you can curate virtual data marts. So instead of copying all this data into a data warehouse and creating more copies of the data within disparate data marts for your different business lines, you could do all that modeling directly on the data lake, so you can create your Snowflake schemas and your Spark schemas in the same way you would in the data warehouse, but on your data lake, using Dremio's semantic layer. This could also be used to execute medallion architectures and to execute data mesh. So whatever architectural patterns you want to follow, you can do so in an easy, accessible way that's going to be self-service for your end users, and also easy to govern. So let's take a look at how that works.
Now, when you're looking at the Dremio UI, a cool thing you can do is create the queries that you plan on maybe running multiple times and save them as a view. So I can have two choices––I can save them as a script that I can run again, as you can see here, I have many different demo scripts that I have saved for being able to conduct Demos on the fly. But I also can save this as a view, so it can show up like a dataset. This can allow us to construct a robust semantic layer.
And just to show you an example, let's pretend that right here I have a tax collections data product. So the idea here is, here's where we're going to put all our tax collection data, and we're going to do a medallion-type architecture where we have our raw data in bronze. So if I go and you see the purple icons, this means it's a physical dataset. So these are the actual physical Iceberg tables with the actual data. So now, if I go and I query one of these tables…I see there are a few records. And I notice that there's a null there. So there's some cleanup work that I need to do because this is the raw data. But now, if I go back, what happened was, in there, in my silver folder, I did this cleanup. And notice, these are green icons, which means these are views––so these are not copies of the data, these are just basically logical SQL views on the data that I created. And I'll show you the SQL in a moment. But then, if I run this, I'm going to notice that I had taken care of these nulls here, and see, the nulls are gone. But then, as another layer of changes, I wanted to join the 2 data sets I had in there. So if I go back to my tax collections data, again, in silver, I have business taxes and individual taxes, and I want to create a single view of all the tax data. So in our gold folder, I aggregated both tables. So now we just have all the tax records that have already been cleaned up, because they are a join of the silver quality records. So basically, no nulls, all the nulls have been cleaned up. Now I have a unified view, but none of this created a copy of the data, these are all logical views. So I'm creating layers of logical views without duplicating the data, without increasing my file storage footprint. And then if this was a really large dataset, and I wanted to create a BI dashboard off of it, I can always enable reflections to get that sub-second performance on a BI dashboard. But just to show you the SQL behind this, and it just makes crafting very easy to see––the approach is, if I go to the SQL editor, I have it saved as a script. So the beauty of scripts, I can easily search my scripts. I can just type in the Demo semantic layer. I can see here that I entered the data. All I had to do was create a view. I cleaned up the nulls, using a couple of the SQL functions available here on Dremio, and then I created the union view, or the view in the gold folder, right afterward. I can curate our data model very easily using SQL, and the cool thing is that none of this is accessible to an average user, so when you make a new user account, they don't see any of this, they don't see any of the sources, they don't see any of the sets. If I want a particular user to be able to see the tax collections product, I'd have to go over here, and I'd have to give them access. What I would do is I would use SQL to be able to grant access to individual datasets to individual objects in my storage. So I can go over here, as an example, and I can change the settings, and then from the settings, we can then handle privileges, and give users with specific roles, or specific users, access to this data set. And then also, we can do row and column-based rules using user-defined functions. So you can easily basically allow users to see the data but not see the data they shouldn't be seeing, and mask certain rows and columns based on the rules that you create. So you can create a robust, secure semantic layer to access your data across all your sources. So I can curate data across all my sources here, and make it easily visible and easily organized.
And not only am I able to create visual organization, but I'm also able to create documentation. So for example, if I go over here to the gold folder, and I go to that tax record set here, let's say, I want to create documentation. Right here, is where you can create the wiki in the details tab. You see this little pen. I can add some labels. But actually, what I want to do is I want to edit the wiki. So I'm going to go over to the details section. And I can see that there's a Wiki entry for this particular data set. I can generate the Wiki using Dremio's generative AI features. So right now, what it's doing is reading the data set, reading its schema, and reading the data, to generate something informative about the information in this particular data set. And once it's done generating, I can just copy this over, so I can just literally click the copy button, paste it right over, and save it. And now other users who are looking at this data set can get at least a cursory idea, a starting place, for documentation on how the dataset is set up that they can look at.
Demo 3: Better Orchestration with dbt: Discover the Benefits of Orchestrating DML and View Logic, Optimizing Data Workflows
Alex Merced:
And the thing is that orchestrating that semantic layer doesn't have to be done directly in the Dremio UI, you can do it pretty easily, using tools that you're probably already familiar with, like dbt. Ao with dbt, you can run queries in Dremio to create data sets using the DML, that Dermio supports on Iceberg tables, or to create views on your data sets and be able to take advantage of the benefits of using dbt, such as being able to have git version control, because it's all written as code, and being able to have this declarative way of describing your modeling at your disposal, working with Dremio, with all the other benefits that Dremio gives you as part of its lakehouse platform. So let's do a little quick demonstration of that.
We're going to ignore that one because I already created one called Test Run. And I'll show you what's going on there, this one's already all configured and all set up. So the main thing you need here is just really your models. Pretty much, you can, for the most part, unless you want to get more advanced with dbt, you can operate straight out of here in this models folder. And essentially, what's going to happen here is you're going to have several different SQL files, and you can create more. So by default, you'll have my first dbt SQL file and my second dbt SQL file. The way each of these works is that they are going to have some bit of SQL in there. And then what it is going to do is that when you run dbt, it is going to run that SQL and the output of that SQL command will be created as a view in Dremio. So if I go back to Dermio for a moment, you'll see it here, if I go to the dbt practice, on January 9, 2024, you'll see here, the output of my first dbt model one came out there ––there is again––it's just the number one––so that's just going to come out right there…oh, whoop, back over here…dbt, January ninth, my first model, run that again… and that's just the one column, one row with the number one.
And then, this one, same thing, it's just going to create another view called my second dbt model, because that's what I named the SQL file, but see, this one has that reference. So you see, it's referring to the previous model. So in this case, this knows it should not run until that other model has been run first. So theoretically––here, see, I have my third model which refers to this weather data. So, for example, I can go here. If I go to my DVD practice folder, I have some weather data in here. So let's just run a query on that data first, to show you… I'm just closing all these tabs….I'm just clicking it this way…weather.
Now, if I run right here on the weather data, you'll see this weather data is going to show up. And now, when I run that third model, it's essentially just creating another view of the same data. But let me create another view that depends on that one. And that one, we're just going to want the date and the precipitation. That's it. So what I'm going to do is I'm going to create a new file, so we'll call this my fourth model.SQL, and essentially, we only want to select––the two things we want to select––are date and precipitation (prcp). So date, prcp from––but I don't want it from the weather data per se. So again in my third model, I did reference and say, I can reference anything in my Dremio account. So I was able to just reference that weather data set and say, hey, I just want this view to be a select all from that dataset, and you saw it was created. But now what I'm going to do with this fourth model, I'm going to say, hey, I want to do that to reference, and I want to reference my third model because that's what we called it there. And just to compare syntax––ref, my single model…I see that I use single quotes, so I'll just be safe, I will use single quotes as well…ref, my single model…and then that'll be what that model comes out to. I'm just doing a quick double-check on everything. Otherwise, it all looks good. But the idea is that I would create all my models, and then in the future––well, first let's run it. So let's see here, let's do, dbt, run. So I would just do a dbt run––I'm going to do that in the right folder, so right here in the test run folder. I would do dbt run, it's going to run all the models that are in the folder. So it's right now, it's going to detect all the models, so it's going to take a moment to run that…Just a clarification, I made a quick typo. I forgot I should since date is a reserved word, I should have put double quotes around it. So I'm going to go run that again. So here we go. Let's try that again. Dbt, run. Awesome. And now I ran that successfully, and I can see that it detected the four models, ran each model, and it's going to create the right output. So now, if I go back to Dremio, I should be able to go back to my data view, go back to dbt practice, and go back to January 9, 2024, here. And see, there's my fourth model, and my fourth model should only have those two fields: date and precipitation. So then there, yep, it only has date and precipitation. So how cool is that?
Now what happens if in the future, I have a data consumer who is using this view, and they are like, I need another column, Alex. What I could use is another one of those columns. Well, let's choose one of these other columns in the weather data––let's choose snow. So maybe they want that snow column. Well, we can do all we have to do now is head back to our dbt models. I can just add snow. So simple as to fulfill this data request, to add a column that wasn't there before. And then I know––by reading this query, I know that this refers to a different model, so I can go check to make sure that the snow column is available in that model, making it easier for me to catch where I need to make these changes. And all of this is––I can version control with git because it's code, and then I can just run that dbt model again, and it can add that column. So see now, that that has run again successfully. So now all I have to do, is head back, let's go back to our data explorer, let's go back to dbt practice on January 29th, and let's see here, does my fourth model does it now have the extra snow column? And it does! So it becomes as easy as that––so I'm the data consumer, and I'm just like, hey cool, the column was added to the view that I've been working with. As a data engineer, all you have to do is just add one word in one column, and again everything becomes easily trackable because you can track it through code version control.
Demo 4: Elevating Data Quality: Learn Techniques to Automate Lakehouse Maintenance and Improve Data Integrity
Alex Merced:
And last, but not least, there are all sorts of different ways for you to do data quality management––you could take advantage of some of the tests that are in dbt, you could use libraries like Great Expectations, you could also just use the Dremio platform to create things constraint tables that would allow you to create your business logic, so you can test your tables against them. We'll do a quick demonstration of how something like that would look in managing your data quality.
Constraints tables––so the idea is that you have a table that lists different business constraints. So first let’s walk through the SQL––we have a table of product data. So basically, we have different product data, and we may have some rules regarding things that should be true about that product data, that we may want to validate that data against. So we track these truths in a business constraints table. So the idea is that these are rules that should be applied to the data that we can now use because it's in this other table to check our data. So in this case, we're going to inject some prices for our products here, so we inject some products there––now, this is one price that's potentially out of range. So you see, this one has a price of 105. And then we insert some rules––so for our first rule, our business constraints table has a key––so that means what the rule is, and then potentially 2 values that can be used to validate the rule, and then maybe a note with some explanation of the rule. So the first rule we're injecting is a price range, so saying, none of our products should be outside of the price range of 10 and 100. So that's the valid price range for our products. The second rule we have is days before the sale. So the idea is that if the product has been on the shelf for more than 90 days, it should be on sale. So in that case, if it's been on the shelf for longer, then, we'll then ask ourselves, why is it not on sale? So these are some of the business rules that we have that are in there, and then we're going to go validate those rules. So in this first square, we do is we select our products. And then we're just checking, hey, give us back any products that are outside of our price range. So that we identify products that may need some remediation, some fixing. Then we do the same thing here where we're going to check, hey, let's take a look at the number of days between when the date was posted on the shelf––so basically the current date and the date added. And if that exceeds the number from our business constraints, the 90 days, and on sale is false, so that means it's over 90 days, but it's also not on sale, well, we want to see that, because those are going to be products we want to revisit. By using this business constraints table, I could test this and other tables against rules that I want to make sure I apply. So it's one way to help––not validation of your typical types of validation, null checking, whatnot––but checking things that are more about your business-specific rules than necessarily just natural data quality issues. So in this case, let's run the query, hit run…oop…I forget to set the context context, let me just cancel that. Let's set the context to our January folder. Select, and let's run that again. And it's running. So it's going to be creating our tables again.
You can always find the code for all of this in the link in the video description, along with a repository that has a lot of other real resources, as far as snippets to help you with using Apache Arrow in Python, and exercises, so you can try out stuff on your laptop. Lots of really great resources in that repository. But with that, looks like it's wrapping up. So let's go here, we create the table. Then we created another table––that's our table for business constraints. We injected our product data. We injected our first constraint and our second constraint. And then we go check to see hey, who is outside of the range, and we get that product C is outside of our price range. Awesome. And then we can go ahead and address that or fix it. And then we have here that product A and product C have both been on the shelf for longer than 90 days, but they are, for some reason, not on sale. So now we've been able to check these things that are part of our business rules to help monitor business activity better. So I'll see you guys in the next one, have a great one, and Ciao.
Closing
Alex Merced:
Now, hopefully, you guys enjoyed these demonstrations of showing you [how you may be] working with Dremio, and the different ways you can do so. Again, the best way to see this firsthand is to get hands-on with it. We have plenty of videos on the dremio.com YouTube channel, showing you how to do different things, like connect to SuperSet, connect to Tableau, pull Dremio data into a Notebook, etcetera. But the first thing you need to do is build your lakehouse––so if you want to build your prototype lakehouse, scan the QR Code there on the left to get started building a lakehouse on your laptop that you can try out and experiment with. And when you're ready to get into building that production lakehouse to get all those business outcomes that we've discussed in this presentation, scan that QR code on the right, and we'll go there. Thank you for listening to this presentation. My name is Alex Merced, I'm a developer advocate here at Dremio, and now our next step is to do some Q&A. So I'll see you there.
Q&A
Alex Merced:
Hey, everybody, welcome back! So again, now it's Q&A time, so you can put any questions that you have about anything throughout the presentation in the Q&A box, I'll be keeping an eye on the Q&A box. We do have one question already ready to go. So we’ll be answering that question in just a moment. So just some quick housekeeping, again, you can find full versions of that dbt video and whatnot on our YouTube channel: youtube.com/dremio. This presentation will be posted––the HD recording, the slides––The whole deal will be posted over there at dremio.com/gnarly-data-waves within 24 to 48 hours. You should be getting an email with a lot of this information, post-this presentation, but I just want to make sure you guys are all aware of that.
But our first question is from Kyle Broadbeck. Thank you. One more question, how does this compare in concept to Microsoft Fabric which tries to achieve the same unified approach with OneLake? The bottom line is we're not all going to have our data in one place. So basically, everyone's starting to move in that unified direction. Dremio has been doing this for a long time, having this unified lake approach, where basically, you connect all these data sources. But one of the biggest high-level differences is that one Dremio will be multi-cloud. So Dremio makes it easy to work across multiple clouds and on-prem environments––so when you have environments in different clouds or cloud and on-prem environments, and you need to unite them, Dremio is going to be an effective solution. Two, Dremio is generally very built around Apache Iceberg, while, right now the Fabric-OneLake is built around DeltaLake. Dremio supports reading the DeltaLake tables, so you can have DeltaLake as part of your Dremio lakehouse as well. But it depends if you're more leaning in the Iceberg direction, Dremio is going to provide you with a much more robust solution at the moment, as far as the Iceberg Lake.
Now other things to consider––the agnostic––the idea that you’re independent of your cloud provider, so that way, one day when you decide to switch cloud providers, Dremio goes where you go, which is nice, and again it always keeps everything in open format, tries to do everything the standard way. So using that messy catalog allows you to use your catalog and take your catalog across the different tools, which is nice. So that way, that semantic layer, all that data that you curate is not just in Dremio, but that is all accessible in Spark, it's accessible in Blink, it becomes much more open-end when you're using a Dremio lakehouse. So that would be the main distinctions, the top high-level distinctions, I mean, there are also differences in functionality. But there are places where they can play together very well because Dremio can connect to Azure data sources, so you can have a OneLake data link, and use the OneLake services, but still have Dremio connect and be able to see all your data sources that you're curating on OneLake because it can connect to your ADLS storage, and so forth. So, they can share and work together––oftentimes things are better together.
The next question is, does Dremio support other programming languages except SQL? Now, at the moment, the SQL query engine natively accepts SQL commands, so essentially, what does it natively understand? SQL. But you do have a REST API––so theoretically, you can use any programming language to automate what you're sending over the Dremio through REST API calls. So you would eventually inevitably send an SQL command to Dremio, but the scripts that automate abstractions that you build around it can be done in any language. In the future that might expand, but one day at a time. Right now, Dremio has a REST API that allows you to pretty much interact with your Dremio cluster from any language because it's just basic HTTP calls on top of that. You also have a JDBC ODBC interface, which can generally be used in any language, and an Apache Arrow Flight interface which can be used in any language. Generally, the actual directions you're going to send over to the engine will always inevitably be SQL, but the scripts that you use to automate and maybe curate the data ahead of time before generating that SQL can be done in any language. For example, in Python, there's a library that I made called Dremio SimpleQuery that makes just working with the Apache Arrow Flight endpoint much, much easier, or being able to connect and send the SQL, pretty easily. But yeah, so that's the state of things––there may be advancements in that area in the future, but we'll have to see what the future holds in time. . .
So this is summarized––Dremio, the biggest thing about Dremio versus any cloud vendor-specific lakehouse solution is that it is cloud vendor-agnostic, so it can play well with your particular cloud vendor tools, but it also doesn't lock you into those tools either. And then also, when you have multiple, because some people might have AWS and Azure, some people might have AWS, Azure, on-prem, and Hadoop. Dremio's going to allow you to pull that together, and deliver all your data from one place with all those tools that we talked about in this video. Again, put any questions you have in the Q&A box, and I will be here to answer your questions. I'll give us a few more moments for any remaining questions.
And next week we will be doing a getting started seminar, so if you haven't gotten started with Dremio, and want to see an additional demo, next week's demo is going to be neat. What I'll be doing is, I'll be taking the seat of a data engineer for filling data requests. I want to have a Trello board where I'll be taking a look at different data requests that have been coming in and working my way through them, using the Dremio platform to see how it makes the data engineer and the data analyst’s life a lot easier. Many of the things that we've seen today, but we'll be doing that as well during next Tuesday's episode of Gnarly Data Waves. And we have so many other episodes of Gnarly Data Waves coming up, also just a lot of new content on the Dremio Youtube channel. Generally, we've been putting out a video almost every day, whether there are little educational hits or demonstration bits, so if you haven't been checking out the Dremio YouTube channel and the Dremio blog, there's a lot of new content that's coming out over the last few weeks on both that can be very, very valuable. I highly recommend taking a look at it, and not just for evaluating whether you want to do Dremio, but just for a better understanding of the lakehouse space, and things you can do when you're taking advantage of lakehouses. So with that, again, my name is Alex Merced. I see no further questions, so in that case, I will see you all next week for another episode in Gnarly Data Waves. Again, do make sure to subscribe to Gnarly Data Waves on iTunes or Spotify, so that way, you never miss an episode. And subscribe to the Dremio YouTube [email protected]/dremio, but otherwise, see you next week! Have a great one.
Ready to Get Started? Here Are Some Resources to Help
Webinars
Cyber Lakehouse for the AI Era, ZTA and Beyond
Many agencies today are struggling not only with managing the scale and complexity of cyber data but also with extracting actionable insights from that data. With new data retention regulations, such as M-21-31, compounding this problem further, agencies need a next-generation solution to address these challenges.
read more