Dremio Jekyll

Joining Azure SQL Data Warehouse and Azure Data Lake Store for Power BI

Intro

Microsoft Azure is an open and flexible cloud computing platform that provides an extensive set of databases, cloud storages, and many services for developers. Moreover, the services range from small ones for personal use to powerful resources that are designed primarily for large companies. Microsoft Azure is one of the most popular platforms due to its good functionality and high-performance interface.

In this post, we will show you how easy would be to use Dremio to combine data from the Azure Data Lake Store (DLS) and SQL Data Warehouse to then visualize the results using Microsoft Power BI.

Assumptions

This tutorial is a continuation of the Unlocking Azure Data Lake Store for Power BI tutorial. So, we assume that you have Dremio and the Dremio ODBC driver installed; if not, go to Dremio’s download page, pick the installation for your operating system, and visit the documentation to read about installations and deployments. You must have a Microsoft Azure account and to follow this tutorial. If you don’t have one, you can use their free one month trial. Finally, you must have Power BI installed and configured.

Loading data into Azure SQL Data Warehouse

In this post, we will continue to explore the Backblaze Hard Drive Dataset that provides basic information and statistics for different hard drives. As before, we will work with the data for the first quarter of 2016 and 2017. But this time, we will store the data for 2017 in Azure SQL Data Warehouse and unite it with data for 2016 that we stored in Azure Data Lake Store earlier.

To start off, we have to create a blank SQL Data Warehouse. So, log in to Azure Portal, click “Create a resource” and select “SQL Data Warehouse” among databases.

Azure new database

Then, fill out the information for the warehouse and click “Server” to create and configure a new server. Fill out the server form as shown on the image below. Click “Select”.

New server settings

Then, click “Performance tier” and, for this tutorial, select “Gen2” and scale your system to the desired value. Click Apply.

Performance configuration

After that, click “Select” to create a SQL Data Warehouse and wait until the deployment is complete.

The next step is to edit the firewall settings. We will enable the connection to the server for our local machines. Go to the overview page of the server and select “Firewalls and virtual networks” under “Security”. Then, click Add client IP on the toolbar at the top of the page. Click “Save”.

Firewall settings

At this point we can upload data to the SQL Data Warehouse. There are several ways to do this, and you can read more at SQL Data Warehouse documentation page. Fill free to pick the method that works best for you, in this scenario, we decided to use Azure Data Factory. To do so, we suggest that you follow this tutorial.

Data curation in Dremio

We have already stored the data for 2016 in the Azure Data Lake Store and connected to it from Dremio. We covered the steps to accomplish this in our previous post. Now, we want to connect to Azure SQL Data Warehouse in Dremio. Click Add Source and select Microsoft SQL Server as a type. As you see, you should provide some information about your warehouse.

You can find the host under Server name in the Overview section in your warehouse. The port for SQL Data Warehouse is 1433.

Overview

Then, provide your master credentials and specify the database to which you want to connect. Click “Save”.

Credentials

Next, the source will appear in the list and you can navigate to the dataset. As in a previous tutorial, we will not need all the columns, so, leave only date, model, capacity bytes, failure, smart_1_raw, and smart_12_raw.

Dremio provides a handy “Drop” function, which you can find by clicking on the small arrow near the column name.

Data curation in Dremio

However, since we have so many columns, it is more convenient to use a SQL query.

Note: when connecting to the dataset from Azure SQL Data Warehouse, the data types are already defined, so in order to unite the datasets later, we will need to convert data from Azure Data Lake Store to the same types.

The SQL query for this dataset looks like following:

1
2
SELECT CAST(data_2017."date" as DATE) AS "date", model, capacity_bytes, failure, smart_1_raw, smart_12_raw
FROM harddrives2017.BackBlazeHDDData.hdd.data_2017

Save the dataset to the “Drives” folder.

Saving dataset

For Azure Data Lake Store dataset, leave the same variables, and change them to appropriate formats. Convert date to Date format, as shown on the pictures below.

Changing formats

Dremio SQL Editor

In a similar way, chance capacity bytes, failure, smart_1_raw, and smart_12_raw to integer.

The SQL Editor should look like this:

1
2
SELECT TO_DATE("2016"."date", 'YYYY-MM-DD', 1) AS "date", model, CONVERT_TO_INTEGER(capacity_bytes, 1, 1, 0) AS capacity_bytes, CONVERT_TO_INTEGER(failure, 1, 1, 0) AS failure, CONVERT_TO_INTEGER(smart_1_raw, 1, 1, 0) AS smart_1_raw, CONVERT_TO_INTEGER(smart_12_raw, 1, 1, 0) AS smart_12_raw
FROM harddrives."2016"

Again, save this dataset to the “Drives” folder.

Now, as in the previous tutorial, we want to unite the two datasets into one called “fulldata” and create month and year variables. Let’s create a new query by clicking on the button next to the dataset as shown below.

Create new query

Then, write the following SQL query:

1
2
3
4
5
SELECT *
FROM Drives.2016
UNION
SELECT *
FROM Drives.2017

Now, let’s create month and year variables by adding a new calculated field, as shown on the image.

image alt text

After all the preparations the final SQL editor should look like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
SELECT nested_1."date" AS "date", YEAR("date") AS "year", nested_1."month" AS "month", model, capacity_bytes, failure, smart_1_raw, smart_12_raw
FROM (
 SELECT nested_0."date" AS "date", MONTH("date") AS "month", model, capacity_bytes, failure, smart_1_raw, smart_12_raw
 FROM (
   SELECT nested_0."date" AS "date", model, capacity_bytes, failure, smart_1_raw, smart_12_raw
   FROM (
     SELECT *
     FROM Drives.2016
     UNION
     SELECT *
     FROM Drives.2017
   ) nested_0
 ) nested_0
) nested_1

Configuring the connection

Now you should configure the Power BI connection to Dremio. At the time of writing, Dremio is not yet preconfigured in Power BI, so you will need to use a custom ODBC driver. We expect this to change in the second half of 2018.

First, enable Dremio connector in Power BI. To do this, click File in the upper left corner and navigate to Options under Options and Settings menu. Go to Preview features and enable Custom data connectors.

Power BI Options

After that, restart Power BI and click Get data. Among the databases, you will find Dremio (Beta). Click Connect.

Available datasets

Write the right cluster node (in case of running Dremio on the local machine write localhost) and select DirectQuery as a connectivity mode. This means that each click in Power BI will issue a new, live query to Dremio.

Dremio cluster Node

Now, all that is left to do is enter your Dremio username and password and select the data that you want to visualize. Click Load.

Dremio credentials

Building visualizations

To start off, we want to see if there is any dependency between smart_1_raw which represents read error rate and smart_12_raw, which is a power cycle count - the number of times the power was turned off and turned back on.

For this, let’s build a stacked column chart. Drag smart_1_raw to the values and leave the sum aggregation, and put smart_12_raw to the axis option.

Stacked column chart

You can also apply visual filters and, for example, choose to show the items of smart_12_raw which value is less than 50.

Applying filters

Now, let’s see how the failures of devices with different capacity bytes changed over the year. Select 100% Stacked bar chart, and put failure to the Values, capacity_bytes to Legend and year to Axis. Also, choose to show data labels to see the percentage. This is what you should get.

visualization labels

Now, if you right-click on the visualization, select Analyze and then Explain the increase, you will see the next explanation.

Chart explanation

Note:this Power BI feature is in preview at the time of writing, so there might be some changes to it over time.

Finally, let’s build a treemap that will show the average of failure by capacity bytes. Click on the respective visual and put the failure to the value and select the average aggregation and put capacity_bytes to Group property of the visual. As a result, we have the following visualization.

Power BI treemap

Conclusion

Today, we demonstrated how to combine data from Azure Data Lake Store and Azure SQL Data Warehouse and visualize it in Power BI. With the help of Dremio, you can easily access all of the data from these services and then write queries to transform your data. With the ODBC connection to Power BI, you can turn your data into visualizations and find the important insights. Moreover, all of this is available through one uniform, high-performance Dremio interface.