Analyzing Multiple Cloud Data Sources using Dremio
Azure Data Lake Storage Gen2 is a new storage solution from the Azure platform. It combines many advantages of Azure Data Lake Storage Gen1 and Azure Storage. ADLS Gen2 provides a great level of convenience, scalability, and cost-efficiency. Each organization or person who needs a storage solution should choose one depending on their particular use case, but ADLS Gen2 is definitely a decent option.
Amazon S3 stands for Amazon Simple Storage Service. It is also a storage solution but from Amazon. Amazon S3 allows storing huge amount of data with high scalability, security, durability, and availability. Many companies use Amazon S3 for a wide range of use cases.
Dremio supports both ADLS Gen2 and Amazon S3 data sources. So, we can connect them to Dremio, perform data curation, and then export data to any BI or data science tool for further processing. In this tutorial, we will show how to load data to ADLS Gen2 and Amazon S3, how to connect these data sources to Dremio, how to perform data curation in Dremio, and how to work with Tableau after Dremio.
The dataset we are going to use is The World’s Largest Banks dataset. We have intentionally split the dataset into 2 parts to work with them separately in Amazon S3 and Azure Data Lake Storage Gen2.
In this tutorial, we assume that you have the following items already setup:
Loading data into ADLS Gen2
Let’s load our data into Azure Data Lake Storage Gen2 first. To do this, we need to create a resource group, as well as a storage account. We will do this using the Azure portal.
Go to the Resource groups section and then click on the Add button, as shown on the image below. Choose the subscription plan, location, and the name of the resource group (we have chosen largest_banks).
After that, click on the Review + create button, wait until the validation is passed, and press Create. After a while, the resource group should be ready.
The next thing we need to do is to create a storage account. Go to All services, select the Storage tab, and click on Storage accounts. Then click Add again. Fill the form which you can see on the image below. Namely, you need to devise a storage account name, choose location, select the resource group to which the storage account should belong to. Make sure that StorageV2 (general purpose v2) is selected in the field Account kind. Then, click Next : Advanced > button.
On the Advanced tab, you should set Data Lake Storage Gen2 option as Enabled (see the image below). Then, click Create.
After the storage account is created, you need to create a filesystem for this account. Go to File systems tab and press + File system button:
We called our filesystem as largestbanksdremio. You can call it according to your imagination. In any case, the next step is to load files into storage. We will use Azure Storage Explorer software for this purpose. After you run it and connect to your Azure account, you should be able to see the structure of your storage solutions here (notice that we have largestbanksdremio filesystem under banksstorage storage account which is marked as ADLS Gen2):
The file loading is simple - just click on the corresponding button for uploading and choose the needed file in your local filesystem:
After the successful uploading, you should see the file in the Azure Storage Explorer under the largestbanksdremio file system:
This means that the first part of the dataset is loaded into Azure Data Lake Storage Gen2.
Loading data into Amazon S3
First of all, we need to create a bucket. To do this, on AWS portal, go to Services and click on S3 in the Storage section. Then, click on the Create bucket button:
As a next step, we need to fill in the fields which are required for the new bucket creation. We specify the name of the bucket (largestbankss3) and click the Next button several times (because we want to leave all other options and permissions in their default state).
On the Review step, you can check the correctness of the information you provided to the system. Then, click Create bucket:
Now we can upload the second part of the dataset to S3 bucket. To do this, click on the Upload button and then specify the file you want to upload in your local filesystem:
If the uploading is successful, the file should appear in the list of the bucket’s content:
So, the second part of the dataset (banks_second.csv) has been loaded into Amazon S3.
Dremio and ADLS Gen2 connection
On the home page of Dremio GUI, click on the corresponding button to add data sources. In the next window of available data sources select Azure Storage source.
At the next step, you need to fill the form for the new data source connection:
The Name field is the field for specifying the name of the data source (whatever you want, for example, largest_banks_adls). Account Name is the name of your Azure storage account. For authentication, you need to find the access key. Go to Azure portal and click on Access keys tab inside your storage account. Use key1 as Shared Access Key in Dremio.
After this, you can click Save and wait until the connection is established. Then, you need to review the available dataset and make a couple of changes. To do this, click on the name of the file (banks_first.csv) in Dremio.
First of all, check the line delimiter. It should be the same as was used when creating the file (\n in our case). Also, click on the checkbox for Extract Field Names to name the columns using the values from the first row of the file. After this, you can press the Save button and the dataset will be created.
Now you have Dremio connected to ADLS Gen2, and you can process the data stored in ADLS Gen2.
Dremio and Amazon S3 connection
The creation of the connection between Dremio and Amazon S3 is similar to the connection of many other data sources to Dremio. First of all, you should choose the needed data source from the list of supported data sources:
Next, we need to fill in the form, where several important parameters should be specified. The name of the data source is one of them. We decided to call our data source as the largest_banks_s3. Also, we need to specify the AWS Access Key and AWS Access Secret. Read further to learn where to find them if you don’t have them yet.
Go to the AWS portal again. Click on All services and find IAM under the Security, Identity & Compliance section:
On the next page, click the Manage Security Credentials button:
The page with security credentials should appear. Go to Access keys (access key ID and secret access key) tab and click the Create New Access Key button there:
After that, the following pop up window should appear:
You can toggle special button to show or hide access key. You will find both AWS Access Key and AWS Access Secret there. Copy them and insert in the corresponding fields in Dremio. Now you can press the Save button, and if everything is fine, the connection should be created. The next several steps is similar to other connection cases also (go to the file, check the line delimiter, click the checkbox to use the first row as the names for the columns, etc.).
At the end of the work done, you should see the following picture inside your Dremio GUI:
This means that you have successfully connected both Azure Data Lake Studio Gen2 (with banks_first.csv file) and Amazon S3 (with banks_second.csv file) to Dremio and you are ready to proceed with data curation step.
Data curation in Dremio
Let’s perform some manipulations with our data. For further work with the dataset, we need to join our tables (banks_first and banks_second). In order to do it, open the first table, and click Join. Then, choose the second table and press Next. We are going to merge our tables on A column. Click Apply.
After that, we can drop additional columns that we don’t need anymore.
Then, we will change the data types of some columns that store numbers and dates but have a text data type. In order to replace data format, we need to click on the left corner of the cell and choose a suitable data format, in our case Date.
Also, we will replace total_assest_us_b *(float) and *rank (integer) columns. Then we need to save our dataset and create .tds file for further visualization in Tableau.
Now we can connect Dremio to Tableau and import the data to this BI tool.
Tableau and Dremio connection
Now we are going to open the previously created file in Tableau. To do that, we need to enter Dremio username and password.
Building visualizations in Tableau
Let’s start building visualizations. Set up latitude and longitude dimensions and drag them into rows and columns as you can see below. This will illustrate all of the banks on a map.
The next step is to drag the Country field into the color field.
The plot is showing all of our countries with the World’s Largest Banks on the map. But let’s add some additional information to it.
Drag Total assets metric to the Label tab to display the total amount of assets for the World’s Largest Banks by a particular country.
What if you also want to know the number of banks assigned to countries? Banks are not metrics, they are dimensions, and if we just drag them on the Detail tab, nothing will happen. To resolve this problem, we should change the data format (click measure and choose count).
Our final graph looks like the following. By hovering on a particular circle, you can see the number of banks.
The cloud has made it increasingly common to have related data stored in different systems and locations. Dremio makes it easy to join data from different sources. In addition, its abstraction layer, powered by virtual datasets, makes it possible to move data over time without having to change any queries, reports and applications.
Today we demonstrated how to visualize data from ADLS and Amazon S3 in Tableau. If you or your company use different data storage solutions, take into consideration Dremio platform to simplify data visualization across different BI tools.
We hope you enjoyed this tutorial, stay tuned for more tutorials and resources to learn how you can use Dremio to start gaining insights from your data, faster.