Migrate a BI Dashboard to Run Directly on Your Cloud Data Lake

Jason Hughes

To jump straight to the tutorial steps click here.

Introduction

Every organization strives to be data-driven and enable their data consumers to make evidence-based decisions in order to answer important business questions. Further, as the amount of data, questions and data consumers grow, the only way for an organization to become and/or stay data-driven is to provide these capabilities in a self-service manner.

One common method that allows data consumers of all kinds to get answers to their business questions in a self-service way is through business intelligence (BI) dashboards. BI dashboards are widely used throughout organizations to provide users of all technical skill levels with insights into the business and answers to their business questions in an intuitive and effective way.

However, while dashboards help democratize analytics across the organization, benefiting individual users as well as the entire business, there are limitations that arise when they’re actually implemented across organizations today, which have serious impacts on the business.

The backend architecture that provides users with a dashboard traditionally looks like this:

There are several issues related to this approach, which are exacerbated by the tens, hundreds or even thousands of dashboards used in an organization:

  1. Slow turnaround time for changes
    1. With each dashboard change often requiring updates to the BI extract, data warehouse structures, and the ETL processes, multiplied by the countless number of dashboards used throughout an organization, data engineering’s backlog grows tremendously and causes long delays in fulfilling answers to business questions.
    2. Some organizations we work with say that even minor dashboard changes can take three weeks or longer to address using this architecture.
  2. Data size limitation
    1. The user has access to much less data than they often need, both in terms of the granularity of the data (e.g., month versus day) as well as the history of the dataset (e.g., 1 month versus 3 years).
  3. Dashboard response time slows as data size and complexity grows
    1. This frustrates users and lowers the number of questions data consumers are able and/or willing to ask.
  4. Data drift and regulatory compliance issues
    1. The extra data copies in the data warehouse and BI extract (and potentially data marts in between) cause issues with data drift and regulatory compliance.
  5. Operational cost
    1. Data engineers have to create and manage the ETL pipelines, data warehouse structures and BI extracts, increasing operational costs.
  6. Infrastructure cost
    1. You have to pay extra for ETL compute resources, data warehouse compute, data warehouse data storage, and BI extract resources.
  7. The BI extract is exclusive to that BI tool
    1. Users who use other BI, data science and SQL client tools in the organization aren’t able to leverage the optimization, and instead have to create their own optimizations, further exacerbating data drift, regulatory compliance and infrastructure cost problems.
    2. Even if the organization uses a single BI tool today, you’re locked in to that dashboarding tool, prohibiting the option to allow users the freedom to choose other BI tools or migrate to another BI tool in the future.

With all of the downsides to this approach, why is it such a common architecture for dashboards? The primary reason is because we were limited by the capabilities of the tools at our disposal. Not anymore.

Organizations use Dremio every day to provide dashboards to data consumers without the above limitations. With Dremio, the architecture to provide a dashboard looks like this:

This blog post demonstrates how simple it is to migrate a dashboard that’s backed by a traditional architecture with all of its downsides to that same dashboard supported by an open data lake architecture powered by Dremio. We’ll go step-by-step through how to take an extract-based dashboard and modify it to run off live queries directly against your data lake storage.

These are the steps that will be covered to perform the migration:

  1. Launch Dremio
  2. Link Dremio to the dataset
  3. Add Dremio as a source for the dashboard in the BI tool
  4. Replace the dashboard’s extract data source with Dremio
  5. Optimize dashboard performance

All the content and resources in this blog post is publicly accessible, so you can follow along step by step.

Alternatively, you can use one of your organization’s dashboards and get it running on Dremio using the same instructions.

This tutorial uses Tableau, Dremio’s AWS Edition, and S3, however, the process is generally the same when using different BI tools (e.g., Power BI) and different data lakes (e.g., ADLS), although the exact steps may vary slightly depending on the BI tool.

This tutorial uses the following dashboard that analyzes a dataset of taxi rides in NYC to provide business KPIs:

From a business perspective, week-level granularity is okay for the bottom and right-hand side charts, however, it would be much more valuable to inform our decisions to use data at a day-level granularity.

Further, because this dashboard is based on an aggregation table, the specific computations used are hard-coded in the physical structure of the table. For example, the dashboard uses the sum and average of the Total Amount field. If business requirements were to change, as they inevitably do, to need the minimum and the maximum of the Total Amount field, many updates would be needed.

In the traditional architecture currently backing the dashboard, these types of changes generally entail the following steps:

  1. End user raises a ticket with the request.
  2. When data engineering is able to pick up the request, they modify the ETL jobs to aggregate the data at the day level instead of week, and add new max and min fields for Total Amount.
  3. If the existing dashboard process can’t experience downtime, a new aggregation table is created in addition to the existing one.
  4. The new ETL jobs to populate the new aggregation table are tested. Once they are tested and validated, the jobs are scheduled.
  5. The dashboard is then modified locally to use the new aggregation table, changing the data source as well as the field references in the dashboard to look at the new day-level granularity field and the new max and min fields. This step is often handled by a different team.
  6. Once the dashboard change is complete, the dashboard is published along with the creation and scheduling of the new extract on the new aggregation table.
  7. After QA and some level of UAT is complete, anything pointing to the existing dashboard is changed to instead point to the new dashboard.
  8. After sufficient production usage, the old dashboard, extract, aggregation table, and ETL jobs are retired.

For many organizations, this process can take three weeks or more. Below, we’ll demonstrate how a dashboard backed by Dremio is much simpler and faster to change.

Now, let’s dive in and walk through the steps to get a dashboard running on a modern data lake architecture powered by Dremio.

Prerequisites

If you don’t already have a Dremio deployment running, the easiest way to get one is via Dremio’s AWS Edition, which will have you up and running with Dremio in five minutes. Here is a link to the marketplace listing to launch Dremio in AWS, a video walkthrough of the steps, and step-by-step documentation.

Dremio can also be deployed via Amazon’s EKS, on Microsoft Azure via AKS or ARM template, on Hadoop via YARN, or even as standalone processes.

While Dremio AWS Edition is recommended for this walkthrough, any of these can be used. If you’re using the packaged workbook linked above and the backing data, it’s recommended that you deploy it in or near AWS’s us-west-2 (Oregon) region because the sample data used is in S3 in us-west-2, and deploying closest to the data gravity is best practice from an architectural and performance perspective.

First, you want to point Dremio to the dataset in the cloud data lake that will be used for the dashboard. In this instance, the Samples source in Dremio is an S3 bucket.

  1. Add a new data lake source and select the Sample Source option.
  2. Select the samples.dremio.com bucket.
  3. Click the promote icon on the right side of the page for the NYC-taxi-trips folder.

    Add Dremio as a Source for the Dashboard in the BI Tool

  1. Open the dashboard in Tableau Desktop.
  2. Select New Data Source and choose Dremio Connector by Dremio when using the Tableau SDK connector, or Other Databases (ODBC) when using the Dremio ODBC driver.
  3. Fill out the connection information:
    1. If using the Dremio ODBC driver, select Dremio Connector as the driver and enter 31010 as the port.
    2. For server, enter the IP that you’re accessing the Dremio UI from.
    3. Enter the username and password you created when launching Dremio.
  4. On the data source page in Tableau, select DREMIO as the database and Samples.samples.dremio.com as the schema. Drag the NYC-taxi-trips dataset into the canvas.

Replace the Dashboard’s Extract Data Source with Dremio

  1. Go to the sheet Number of Rides by Month.
  2. Right click on the NYC_TAXI_TRIPS_WEEKLY_AGG data source and select Replace Data Source. The default options will replace the dashboard’s existing data source (an extract from the cloud data warehouse) to instead fire live queries to Dremio on data in the data lake. Click OK.

    When you do this, you’ll see red exclamation marks on some old fields in the left sidebar as well as references to those fields in the canvas shelves in red. This is Tableau’s way of reporting that those fields don’t exist in the dataset in Dremio, which is expected. The logical NYC trips dataset doesn’t really have an Avg Total Amount field, it just has a Total Amount field, from which you want to take an average.

    This is a good example of the limitations of the aggregation table. We want access to the Total Amount data itself, then we can do average, sum, count and any other calculation we want to perform. We just need to tell Tableau that Avg Total Amount and Sum Total Amount are really one field in the data (Total Amount) with average and sum aggregations applied to it, respectively.

  3. For each field with a red exclamation mark in the left sidebar, right click on it, select Replace References, and choose the non-aggregation-table field provided by Dremio.
    1. For example, replace Avg Total Amount with Total Amount, Avg Fare Amount with Fare Amount, and Pickup Week with Pickup Datetime.
    2. Ensure all six of the fields that had red exclamation marks have their references replaced (i.e., there are no more red exclamation marks).
    3. Note that because Tableau doesn’t allow you to replace references to a data field with references to a dynamic field, you can’t use Replace References for Num Rides with the right field NYC-taxi-trips (Count), so you need to change it in the two places it’s used.
    4. In the worksheets Number of Rides by Month and Total Number of Rides, drag NYC-taxi-trips (Count) over the red box referencing SUM(Num Rides) to replace it. You can then right click on Num Rides in the left sidebar and click Delete.

Now the dashboard is running off live queries directly on data in your data lake, without having to create and manage any ETL jobs, data warehouses, or BI extracts. You can modify the pickup time filter to see the dashboard change based on these live queries. Each of these live queries analyzes a third of a billion rows in order to serve the dashboard.

Notice that the response times on the dashboard range from about three to ten seconds after a couple queries, depending on how many dates are included in your filter, which is pretty good for analyzing a third of a billion rows. This performance is sufficient for some dashboards, depending on the performance goals.

However, it’d be even better to provide users with a sub-second experience, so they can ask more business questions and have a truly interactive experience.


Optimize Dashboard Performance

Further optimization of dashboard performance is very simple with Dremio, both for data providers and data consumers, using a Dremio feature called data reflections.

At a high level, you can think of a data reflection as a hybrid between a materialized view and an index. It does precomputation similar to a materialized view, however, it substitutes the optimized structure transparently to users, much like an index, so users and applications can operate at the logical and semantic level that they want, while still experiencing interactivity.

  1. View the NYC-taxi-trips dataset in the Dremio UI, then select the Reflections tab of the dataset.
  2. You have the option to create a raw or aggregation reflection. Because this is a dashboard that does aggregation, as most dashboards do, you want to create an aggregation reflection for optimization.
    1. You need to specify which fields are dimensions (what fields you use individual values for) and measures (what fields you use aggregated values for). Based on these straightforward guidelines, pickup_datetime is a dimension and fare, tip and total amounts are measures.
    2. Dremio makes recommendations as to which fields it thinks are dimensions and which are measures based on data type and distribution. In this case, you just want to remove a couple fields that aren’t used in the dashboard to lower resource usage: Remove passenger_count from dimensions and trip_distance_mi from measures by clicking on the minus icon next to the field names.
  3. Enable the aggregation reflection by toggling it on.
  4. Click Save.
  5. (Optional) Click History to view the details and status of the reflection creation job. It will take approximately 10 to 20 seconds to complete.
  6. Go back to the dashboard and play with the filter.

    Notice the dashboard response is almost instantaneous, allowing the dashboard user to have a truly interactive experience and ask business questions at the speed of thought.

    Also, note that the performance of an application was optimized without modifying any of the application logic or code. This is a powerful capability that greatly improves time to insight by decoupling application logic from performance optimization.

  7. (Optional) Open the jobs page in the Dremio UI to see the three queries that Tableau sent to Dremio for each change of the filter. Notice that Dremio responded to all of them in parallel in less than a second. In fact, each of the queries completed in 400 milliseconds or less.

Note that if you wanted to optimize this dashboard with other data lake SQL engines, you would have to create aggregation tables in the data lake to achieve interactive performance. However, this raises the same problems with ETL and data copy management that we were trying to avoid by moving away from the traditional architecture in the first place.

Modify the Dashboard to Show Day-Level Granularity Instead of Week-Level

As noted earlier, week-level granularity is okay for the bottom and right-hand side charts, however, what would be much more valuable to inform our decisions is to see the data at a day-level granularity.

You may have noticed when we went through replacing the extract data source that the field in the aggregation table in the data warehouse had the date field as pickup_week rather than pickup_datetime. Because of that, we weren’t able to access the date-level granularity.

Now with Dremio accessing the raw data, we’re able to easily change the dashboard to display date-level granularity.

  1. In the Average Fare and Average Tip worksheet, right click on WEEK(Pickup Datetime) and select the Day which has 8 in gray on the right side—this is the day of month number aggregated across all years and months.
  2. In the Average and Sum Total Amount Last Month worksheet, right click on WEEK(Pickup Datetime) and select the Day which has May 8, 2015 in gray on the right side—this is the individual date.

Now go back to the dashboard view, which now presents day-level granularities in the two charts. This gives you a much more actionable view of the data.

Benefits of This Approach

This architecture eliminates or mitigates all the problems of the traditional architecture:

  1. Fast turnaround time for changes to dashboards and business questions.
  2. No data size limitation due to access and optimization being based on a cost-efficient scale-out platform.
  3. Dashboard response time remains interactive as data size and complexity grows. This is due to access and optimization being based on a performant scale-out platform.
  4. Minimal data drift and regulatory compliance issues since derived data structures are managed by the platform, which is similar to how an index doesn’t cause data drift or compliance issues.
  5. Large reduction of operational cost due to the platform managing the performance optimizations.
  6. Large reduction of infrastructure cost due to access and optimization based on a cost-efficient platform analyzing data directly in the data lake.
  7. Centralized optimization regardless of the client tool used currently and in the future, whether it’s multiple BI tools, SQL clients, custom applications, Python notebooks, etc.

Conclusion

Migration of a dashboard backed by a traditional data warehouse-based architecture to an open data lake architecture powered by Dremio is straightforward and provides many benefits for individual users, business units, and the broader analytics landscape of an organization.

Looking for more info or help from a Dremio expert? Feel free to post on the community forum or reach out to us at hello@dremio.com.