Dremio Jekyll

Enabling Data-as-a-Service for Azure, PostgreSQL and Tableau

Intro

Azure Blob Storage is a cloud storage solution for massive amounts of unstructured data. It also can be used for streaming video, saving logs, storing data for backup and restore, storing data for analysis, and more. While it is a powerful tool to store data, sometimes there is a need to work with data from other data sources, such as PostgreSQL. Such combinations might be complicated to use for further analysis with BI tools. In this tutorial, we will demonstrate how to make this process easy with Dremio.

We will store our data in Azure Blob Storage and PostgreSQL, then connect these data sources to Dremio and prep the data to then analyze it using Tableau.

Assumptions

We assume that you have Dremio and ODBC driver installed; if not, go to Dremio’s download page, pick the installations for your operating system, and visit Dremio Docs to read about installations and deployments. You should have an Azure Account and storage account. We will also use Postgre, and Tableau Desktop.

Importing data into sources

In this tutorial, we will use a car accident dataset available in Kaggle. The accident is a unique value and can be recognized by Accident_Index. There is also information about vehicles, drivers, location, etc. Using this dataset, we can perform analysis for preventing such situations and reveal what trends are there for accidents.

To import data into Azure Blob Storage, go to Azure main page, open Storage account, choose Blobs and add a container as shown on the screenshots below.

image alt text

image alt text

image alt text

After that, open container and load your file by clicking “Upload”. Now, you should load data into SQL database. For that, first, create a database. Also, you should create a server if you haven’t done it before. Then, go to the database and choose the query editor.

First, create a password:

1
CREATE MASTER KEY ENCRYPTION BY PASSWORD = 'UNIQUE_STRING_HERE'

Next, create a Blob storage credential:

1
2
3
CREATE DATABASE SCOPED CREDENTIAL BlobCredential
WITH IDENTITY = 'SHARED ACCESS SIGNATURE',
SECRET = 'KEY ENCRYPTION';

Create an external data source:

1
2
3
4
5
6
CREATE EXTERNAL DATA SOURCE AzureBlob
WITH (
    TYPE       = BLOB_STORAGE,
    LOCATION   = 'https://kshakhovska.blob.core.windows.net/test-files',
    CREDENTIAL = BlobCredential
);

After that, create a table which has the same columns as the CSV file:

image alt text

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
CREATE TABLE Vehicle (
   Accident_Index TEXT,
   Vehicle_Reference INT,
   Vehicle_Type INT,
   Towing_and_Articulation INT,
   Vehicle_Manoeuvre   INT,
   Vehicle_Location_Restricted_Lane INT,
   Junction_Location   INT,
   Skidding_and_Overturning    INT,
   Hit_Object_in_Carriageway   INT,
   Vehicle_Leaving_Carriageway INT,
   Hit_Object_off_Carriageway  INT,
   First_Point_of_Impact   INT,
   Was_Vehicle_Left_Hand_Drive INT,
   Journey_Purpose_of_Driver   INT,
   Sex_of_Driver   INT,
   Age_of_Driver   INT,
   Age_Band_of_Driver  INT,
   Engine_Capacity INT,
   Propulsion_Code INT,
   Age_of_Vehicle  INT,
   Driver_IMD_Decile   INT,
   Driver_Home_Area_Type INT
)

Finally, you are ready to insert your data:

1
2
3
4
5
6
7
BULK INSERT dbo.Vehicle
FROM 'Vehicles_reduce.csv'
WITH (
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\n',
DATA_SOURCE = 'AzureBlob'
);

image alt text

Let’s check whether the data was uploaded correctly.image alt text

Now, let’s import data into Postgre. For that, create a table for your CSV. Then choose import/export data and fill in the information using an example below.

image alt text

As we can see, we loaded the data successfully.

image alt text

Connecting to Dremio

When data is uploaded, login to Dremio and choose an option to add a new source. Then, select Microsoft SQL Server and Postgre and fill in fields with necessary information as shown below:

image alt text

image alt text

Data curation in Dremio

When the sources are connected, we can begin data preprocessing. Let’s open Vehicles from Azure Blob Storage source and merge it with a descriptive dataset from Postgre. For that, click “Join”, select Postgre source and choose your data.

image alt text

Next, provide a join condition and choose the inner join type. Click preview to check whether everything is okay before applying.

image alt text

Let’s remove our joining condition from the dataset and leave only the necessary label value. For that, drop those two columns and rename label.

image alt text

Here is the result. Let’s repeat the same steps with other descriptive CSV. This is our final dataset.

image alt text

After all the preparations, Dremio will automatically generate the final SQL script.

1
2
3
4
5
6
7
8
9
10
11
SELECT nested_1.Accident_Index AS Accident_Index, nested_1.Vehicle_Reference AS Vehicle_Reference, nested_1.Vehicle_Type AS Vehicle_Type, nested_1.Towing_and_Articulation AS Towing_and_Articulation, nested_1.Vehicle_Location_Restricted_Lane AS Vehicle_Location_Restricted_Lane, nested_1.Junction_Location AS Junction_Location, nested_1.Skidding_and_Overturning AS Skidding_and_Overturning, nested_1.Hit_Object_in_Carriageway AS Hit_Object_in_Carriageway, nested_1.Vehicle_Leaving_Carriageway AS Vehicle_Leaving_Carriageway, nested_1.Hit_Object_off_Carriageway AS Hit_Object_off_Carriageway, nested_1.First_Point_of_Impact AS First_Point_of_Impact, nested_1.Was_Vehicle_Left_Hand_Drive AS Was_Vehicle_Left_Hand_Drive, nested_1.Journey_Purpose_of_Driver AS Journey_Purpose_of_Driver, nested_1.Age_of_Driver AS Age_of_Driver, nested_1.Age_Band_of_Driver AS Age_Band_of_Driver, nested_1.Engine_Capacity AS Engine_Capacity, nested_1.Propulsion_Code AS Propulsion_Code, nested_1.Age_of_Vehicle AS Age_of_Vehicle, nested_1.Driver_IMD_Decile AS Driver_IMD_Decile, nested_1.Driver_Home_Area_Type AS Driver_Home_Area_Type, nested_1.Vehicle_Manoeuvre AS Vehicle_Manoeuvre, join_Sex_of_Driver.label AS Sex_of_Driver
FROM (
 SELECT nested_0.Accident_Index AS Accident_Index, nested_0.Vehicle_Reference AS Vehicle_Reference, nested_0.Vehicle_Type
AS Vehicle_Type, nested_0.Towing_and_Articulation AS Towing_and_Articulation, nested_0.Vehicle_Location_Restricted_Lane AS Vehicle_Location_Restricted_Lane, nested_0.Junction_Location AS Junction_Location, nested_0.Skidding_and_Overturning AS Skidding_and_Overturning, nested_0.Hit_Object_in_Carriageway AS Hit_Object_in_Carriageway, nested_0.Vehicle_Leaving_Carriageway AS Vehicle_Leaving_Carriageway, nested_0.Hit_Object_off_Carriageway AS Hit_Object_off_Carriageway, nested_0.First_Point_of_Impact AS First_Point_of_Impact, nested_0.Was_Vehicle_Left_Hand_Drive AS Was_Vehicle_Left_Hand_Drive, nested_0.Journey_Purpose_of_Driver AS Journey_Purpose_of_Driver, nested_0.Sex_of_Driver AS Sex_of_Driver, nested_0.Age_of_Driver AS Age_of_Driver, nested_0.Age_Band_of_Driver AS Age_Band_of_Driver, nested_0.Engine_Capacity AS Engine_Capacity, nested_0.Propulsion_Code AS Propulsion_Code, nested_0.Age_of_Vehicle AS Age_of_Vehicle, nested_0.Driver_IMD_Decile AS Driver_IMD_Decile, nested_0.Driver_Home_Area_Type AS Driver_Home_Area_Type, join_Vehicle_Manoeuvre.label AS Vehicle_Manoeuvre
 FROM (
   SELECT Accident_Index, Vehicle_Reference, Vehicle_Type, Towing_and_Articulation, Vehicle_Manoeuvre, Vehicle_Location_Restricted_Lane, Junction_Location, Skidding_and_Overturning, Hit_Object_in_Carriageway, Vehicle_Leaving_Carriageway, Hit_Object_off_Carriageway, First_Point_of_Impact, Was_Vehicle_Left_Hand_Drive, Journey_Purpose_of_Driver, Sex_of_Driver, Age_of_Driver, Age_Band_of_Driver, Engine_Capacity, Propulsion_Code, Age_of_Vehicle, Driver_IMD_Decile, Driver_Home_Area_Type
   FROM "Azure SQL Database".vehicle.dbo.Vehicle
 ) nested_0
  INNER JOIN Postgre.public.Vehicle_Manoeuvre AS join_Vehicle_Manoeuvre ON nested_0.Vehicle_Manoeuvre = join_Vehicle_Manoeuvre.code
) nested_1
INNER JOIN Postgre.public.Sex_of_Driver AS join_Sex_of_Driver ON nested_1.Sex_of_Driver = join_Sex_of_Driver.code

We have created a new space for our data and will save it there.

image alt text

Building visualizations in Tableau

Once you have your ODBC driver installed, we are ready to connect to Tableau. Click on “Tableau” tab and download a file. Open it and provide your Dremio credentials.

image alt text

Let’s begin the visualization part. Tableau has a user-friendly interface which contains such main parts:

  • Data: it includes your dimensions and measures from the data source
  • Pages: allows building history animation based on this field
  • Columns/Rows contain a field that will be visualized in column/row in the chart
  • Filters: enables filtering data
  • Marks: allows adding marks to your chart.

We want to build a chart that can demonstrate car accident statistics. For that, we choose gender, vehicle maneuver, engine capacity, and driver’s age. We can distinguish vehicle maneuver by color and engine capacity by size. For better accuracy, let’s filter out data from nulls. Finally, we receive such visualization:

image alt text

Charts are very convenient for analysis. For example, we can see that “Going ahead right-hand bend” manoeuvre mistake do equally men and women, whereas “Moving off” mistake mostly do men. Or we can analyze engine capacity by size. “Overtaking - nearside” caused by cars with the smallest engines.

Sometimes you may not be sure which type of chart to choose. In Tableau, you can use “Show Me” icon and try different types. Moreover, if you don’t have enough dimensions, by hovering on the desired chart you can see what else you need.

image alt text

For example, using the same columns we can build a different chart.

image alt text

As you can see, using the same data we can point out different things. In the previous chart using color feature, we could see manoeuvre distribution; in this chart, we can observe gender distribution.

Conclusions

In this tutorial, we showed how to work with Azure Blob Storage and how to copy data to Azure SQL database. With the help of Dremio, you can easily combine data from several data sources and prepare it for further analysis. Therefore, we provided graphical analysis using Tableau to which you can connect from Dremio with only one click.

We hope you enjoyed this tutorial, checkout the rest of our tutorials and resources page to learn more about how you can gaining insights from your Azure data, faster, using Dremio.