Get Started Free
No time limit - totally free - just the way you like it.Sign Up Now
As a BI engineer, you are tasked with developing a new dashboard for the revenue unit of your organization. You have all the wireframes defined, had discussions with the business stakeholders to understand the key metrics, and all you need is access to the data to start working on it.
Traditionally, your organization’s data flow architecture looks something like this:
Your company’s data first lands in a cloud data lake such as Amazon S3 (the bottom). Then you submit a ticket for the data engineering team to help move the data for your reporting. Based on the workload of your data engineering team, at some point they will move the data via ETL pipelines to a data warehouse. You need access to multiple databases within the data warehouse, so you make another request to get that data. And finally, a data copy request is made for the business unit specific to your analysis.
Imagine you have to run your analyses on a dataset with millions of records. You also have to consider the performance aspect of the dashboard. So, you might create some cubes or extracts depending on the BI tool you plan to use, resulting in additional data copies. This is not an unusual way of working. In fact, a lot of BI engineers can relate to this scenario.
So what are the problems with this approach?
Here are some of the non-trivial ones that are relevant to this blog:
While the traditional way of working with ETL pipelines and data copies is the norm, there is a new and better approach that bestows some significant advantages to BI engineers and organizations as a whole.
It’s likely your organization stores most of its data in a cloud data lake, so why not use a BI tool like Tableau directly on the data lake? You may be asking:
This is where Dremio comes in. It is an engine for open lakehouse platforms that provides data warehouse-level performance and capabilities on the data lake. It sits between the data stored in a data lake and the client engine (as shown in the image above). In addition, Dremio enables native connection with BI tools such as Tableau to operate directly on the data lake in live query mode. And most importantly, it is super fast and comes with a free standard cloud edition.
Let’s walk through a quick tutorial to demonstrate how fast it is to build a dashboard in Tableau using the data stored in our data lake.
As seen in the snippet above, there is a preview of our dataset and all the available options.
This action will download a small Tableau data source (TDS) file. Note that this file doesn’t contain any data but rather the information necessary to connect to the actual data. This file will help us authenticate Tableau with Dremio and provide a seamless experience.
To have an idea of the total number of records in this dataset, let’s drop the “Migrated data (Count)” field under Rows in Tableau. The figure above shows that we have close to 32M records.
The critical thing to note here is that every analysis that is run in Tableau is actually a live query back to Dremio. Recall that Dremio sits in between the data lake and the client. It takes the query for this million-row dataset, processes it, and returns the result to Tableau in sub-seconds. The GIF above illustrates how fast it is to do this on the data lake directly. Dremio achieves sub-second performance using a feature called data reflections.
After adding some more analysis, our final dashboard looks like the following:
Dremio provides a new way to deal with some of the challenges of traditional data architectures. It mitigates the wait time on the part of an analyst/scientist, helps reduce data copies (when you can), and provides amazing performance while storing data in open formats so it can be leveraged by multiple engines based on the use case. Learn more about connecting Dremio cloud with Tableau here.