What Is a Data Lakehouse?
The data lakehouse is a new architecture that combines the best parts of data lakes and data warehouses. Learn more about the data lakehouse and its key advantages.read more
Azure Blob Storage is a cloud storage solution for massive amounts of unstructured data. It also can be used for streaming video, saving logs, storing data for backup and restore, storing data for analysis, and more. While it is a powerful tool to store data, sometimes there is a need to work with data from other data sources, such as PostgreSQL. Such combinations might be complicated to use for further analysis with BI tools. In this tutorial, we will demonstrate how to make this process easy with Dremio.
We will store our data in Azure Blob Storage and PostgreSQL, then connect these data sources to Dremio and prep the data to then analyze it using Tableau.
We assume that you have Dremio and ODBC driver installed; if not, go to Dremio’s deployments page, pick the installations for your operating system, and visit Dremio Docs to read about installations and deployments. You should have an Azure Account and storage account. We will also use Postgre, and Tableau Desktop.
In this tutorial, we will use a car accident dataset available in Kaggle. The accident is a unique value and can be recognized by Accident_Index. There is also information about vehicles, drivers, location, etc. Using this dataset, we can perform analysis for preventing such situations and reveal what trends are there for accidents.
To import data into Azure Blob Storage, go to Azure main page, open Storage account, choose Blobs and add a container as shown on the screenshots below.
After that, open container and load your file by clicking “Upload”. Now, you should load data into SQL database. For that, first, create a database. Also, you should create a server if you haven’t done it before. Then, go to the database and choose the query editor.
First, create a password:
CREATE MASTER KEY ENCRYPTION BY PASSWORD = 'UNIQUE_STRING_HERE'
Next, create a Blob storage credential:
CREATE DATABASE SCOPED CREDENTIAL BlobCredential WITH IDENTITY = 'SHARED ACCESS SIGNATURE', SECRET = 'KEY ENCRYPTION';
Create an external data source:
CREATE EXTERNAL DATA SOURCE AzureBlob WITH ( TYPE = BLOB_STORAGE, LOCATION = 'https://kshakhovska.blob.core.windows.net/test-files', CREDENTIAL = BlobCredential );
After that, create a table which has the same columns as the CSV file:
CREATE TABLE Vehicle ( Accident_Index TEXT, Vehicle_Reference INT, Vehicle_Type INT, Towing_and_Articulation INT, Vehicle_Manoeuvre INT, Vehicle_Location_Restricted_Lane INT, Junction_Location INT, Skidding_and_Overturning INT, Hit_Object_in_Carriageway INT, Vehicle_Leaving_Carriageway INT, Hit_Object_off_Carriageway INT, First_Point_of_Impact INT, Was_Vehicle_Left_Hand_Drive INT, Journey_Purpose_of_Driver INT, Sex_of_Driver INT, Age_of_Driver INT, Age_Band_of_Driver INT, Engine_Capacity INT, Propulsion_Code INT, Age_of_Vehicle INT, Driver_IMD_Decile INT, Driver_Home_Area_Type INT )
Finally, you are ready to insert your data:
BULK INSERT dbo.Vehicle FROM 'Vehicles_reduce.csv' WITH ( FIELDTERMINATOR = ',', ROWTERMINATOR = '\n', DATA_SOURCE = 'AzureBlob' );
Let’s check whether the data was uploaded correctly.
Now, let’s import data into Postgre. For that, create a table for your CSV. Then choose import/export data and fill in the information using an example below.
As we can see, we loaded the data successfully.
When data is uploaded, login to Dremio and choose an option to add a new source. Then, select Microsoft SQL Server and Postgre and fill in fields with necessary information as shown below:
When the sources are connected, we can begin data preprocessing. Let’s open Vehicles from Azure Blob Storage source and merge it with a descriptive dataset from Postgre. For that, click “Join”, select Postgre source and choose your data.
Next, provide a join condition and choose the inner join type. Click preview to check whether everything is okay before applying.
Let’s remove our joining condition from the dataset and leave only the necessary label value. For that, drop those two columns and rename label.
Here is the result. Let’s repeat the same steps with other descriptive CSV. This is our final dataset.
After all the preparations, Dremio will automatically generate the final SQL script.
SELECT nested_1.Accident_Index AS Accident_Index, nested_1.Vehicle_Reference AS Vehicle_Reference, nested_1.Vehicle_Type AS Vehicle_Type, nested_1.Towing_and_Articulation AS Towing_and_Articulation, nested_1.Vehicle_Location_Restricted_Lane AS Vehicle_Location_Restricted_Lane, nested_1.Junction_Location AS Junction_Location, nested_1.Skidding_and_Overturning AS Skidding_and_Overturning, nested_1.Hit_Object_in_Carriageway AS Hit_Object_in_Carriageway, nested_1.Vehicle_Leaving_Carriageway AS Vehicle_Leaving_Carriageway, nested_1.Hit_Object_off_Carriageway AS Hit_Object_off_Carriageway, nested_1.First_Point_of_Impact AS First_Point_of_Impact, nested_1.Was_Vehicle_Left_Hand_Drive AS Was_Vehicle_Left_Hand_Drive, nested_1.Journey_Purpose_of_Driver AS Journey_Purpose_of_Driver, nested_1.Age_of_Driver AS Age_of_Driver, nested_1.Age_Band_of_Driver AS Age_Band_of_Driver, nested_1.Engine_Capacity AS Engine_Capacity, nested_1.Propulsion_Code AS Propulsion_Code, nested_1.Age_of_Vehicle AS Age_of_Vehicle, nested_1.Driver_IMD_Decile AS Driver_IMD_Decile, nested_1.Driver_Home_Area_Type AS Driver_Home_Area_Type, nested_1.Vehicle_Manoeuvre AS Vehicle_Manoeuvre, join_Sex_of_Driver.label AS Sex_of_Driver FROM ( SELECT nested_0.Accident_Index AS Accident_Index, nested_0.Vehicle_Reference AS Vehicle_Reference, nested_0.Vehicle_Type AS Vehicle_Type, nested_0.Towing_and_Articulation AS Towing_and_Articulation, nested_0.Vehicle_Location_Restricted_Lane AS Vehicle_Location_Restricted_Lane, nested_0.Junction_Location AS Junction_Location, nested_0.Skidding_and_Overturning AS Skidding_and_Overturning, nested_0.Hit_Object_in_Carriageway AS Hit_Object_in_Carriageway, nested_0.Vehicle_Leaving_Carriageway AS Vehicle_Leaving_Carriageway, nested_0.Hit_Object_off_Carriageway AS Hit_Object_off_Carriageway, nested_0.First_Point_of_Impact AS First_Point_of_Impact, nested_0.Was_Vehicle_Left_Hand_Drive AS Was_Vehicle_Left_Hand_Drive, nested_0.Journey_Purpose_of_Driver AS Journey_Purpose_of_Driver, nested_0.Sex_of_Driver AS Sex_of_Driver, nested_0.Age_of_Driver AS Age_of_Driver, nested_0.Age_Band_of_Driver AS Age_Band_of_Driver, nested_0.Engine_Capacity AS Engine_Capacity, nested_0.Propulsion_Code AS Propulsion_Code, nested_0.Age_of_Vehicle AS Age_of_Vehicle, nested_0.Driver_IMD_Decile AS Driver_IMD_Decile, nested_0.Driver_Home_Area_Type AS Driver_Home_Area_Type, join_Vehicle_Manoeuvre.label AS Vehicle_Manoeuvre FROM ( SELECT Accident_Index, Vehicle_Reference, Vehicle_Type, Towing_and_Articulation, Vehicle_Manoeuvre, Vehicle_Location_Restricted_Lane, Junction_Location, Skidding_and_Overturning, Hit_Object_in_Carriageway, Vehicle_Leaving_Carriageway, Hit_Object_off_Carriageway, First_Point_of_Impact, Was_Vehicle_Left_Hand_Drive, Journey_Purpose_of_Driver, Sex_of_Driver, Age_of_Driver, Age_Band_of_Driver, Engine_Capacity, Propulsion_Code, Age_of_Vehicle, Driver_IMD_Decile, Driver_Home_Area_Type FROM "Azure SQL Database".vehicle.dbo.Vehicle ) nested_0 INNER JOIN Postgre.public.Vehicle_Manoeuvre AS join_Vehicle_Manoeuvre ON nested_0.Vehicle_Manoeuvre = join_Vehicle_Manoeuvre.code ) nested_1 INNER JOIN Postgre.public.Sex_of_Driver AS join_Sex_of_Driver ON nested_1.Sex_of_Driver = join_Sex_of_Driver.code
We have created a new space for our data and will save it there.
Once you have your ODBC driver installed, we are ready to connect to Tableau. Click on “Tableau” tab and download a file. Open it and provide your Dremio credentials.
Let’s begin the visualization part. Tableau has a user-friendly interface which contains such main parts:
We want to build a chart that can demonstrate car accident statistics. For that, we choose gender, vehicle maneuver, engine capacity, and driver’s age. We can distinguish vehicle maneuver by color and engine capacity by size. For better accuracy, let’s filter out data from nulls. Finally, we receive such visualization:
Charts are very convenient for analysis. For example, we can see that “Going ahead right-hand bend” manoeuvre mistake do equally men and women, whereas “Moving off” mistake mostly do men. Or we can analyze engine capacity by size. “Overtaking - nearside” caused by cars with the smallest engines.
Sometimes you may not be sure which type of chart to choose. In Tableau, you can use “Show Me” icon and try different types. Moreover, if you don’t have enough dimensions, by hovering on the desired chart you can see what else you need.
For example, using the same columns we can build a different chart.
As you can see, using the same data we can point out different things. In the previous chart using color feature, we could see manoeuvre distribution; in this chart, we can observe gender distribution.
In this tutorial, we showed how to work with Azure Blob Storage and how to copy data to Azure SQL database. With the help of Dremio, you can easily combine data from several data sources and prepare it for further analysis. Therefore, we provided graphical analysis using Tableau to which you can connect from Dremio with only one click.
We hope you enjoyed this tutorial, checkout the rest of our tutorials and resources page to learn more about how you can gaining insights from your Azure data, faster, using Dremio.
The adoption of data mesh as a decentralized data management approach has become popular in recent years, helping teams overcome challenges associated with centralized data architecture.read more
Learn how MSK accelerates cancer research by enabling faster access to data using Dremio's open data lakehouseread more