Dremio Jekyll

Analyzing Multiple Cloud Data Sources using Dremio

Dremio

Intro

Azure Data Lake Storage Gen2 is a new storage solution from the Azure platform. It combines many advantages of Azure Data Lake Storage Gen1 and Azure Storage. ADLS Gen2 provides a great level of convenience, scalability, and cost-efficiency. Each organization or person who needs a storage solution should choose one depending on their particular use case, but ADLS Gen2 is definitely a decent option.

Amazon S3 stands for Amazon Simple Storage Service. It is also a storage solution but from Amazon. Amazon S3 allows storing huge amount of data with high scalability, security, durability, and availability. Many companies use Amazon S3 for a wide range of use cases.

Dremio supports both ADLS Gen2 and Amazon S3 data sources. So, we can connect them to Dremio, perform data curation, and then export data to any BI or data science tool for further processing. In this tutorial, we will show how to load data to ADLS Gen2 and Amazon S3, how to connect these data sources to Dremio, how to perform data curation in Dremio, and how to work with Tableau after Dremio.

The dataset we are going to use is The World’s Largest Banks dataset. We have intentionally split the dataset into 2 parts to work with them separately in Amazon S3 and Azure Data Lake Storage Gen2.

Assumptions

In this tutorial, we assume that you have the following items already setup:

Loading data into ADLS Gen2

Let’s load our data into Azure Data Lake Storage Gen2 first. To do this, we need to create a resource group, as well as a storage account. We will do this using the Azure portal.

Go to the Resource groups section and then click on the Add button, as shown on the image below. Choose the subscription plan, location, and the name of the resource group (we have chosen largest_banks).

image alt text

After that, click on the Review + create button, wait until the validation is passed, and press Create. After a while, the resource group should be ready.

image alt text

The next thing we need to do is to create a storage account. Go to All services, select the Storage tab, and click on Storage accounts. Then click Add again. Fill the form which you can see on the image below. Namely, you need to devise a storage account name, choose location, select the resource group to which the storage account should belong to. Make sure that StorageV2 (general purpose v2) is selected in the field Account kind. Then, click Next : Advanced > button.

image alt text

On the Advanced tab, you should set Data Lake Storage Gen2 option as Enabled (see the image below). Then, click Create.

image alt text

After the storage account is created, you need to create a filesystem for this account. Go to File systems tab and press + File system button:

image alt text

We called our filesystem as largestbanksdremio. You can call it according to your imagination. In any case, the next step is to load files into storage. We will use Azure Storage Explorer software for this purpose. After you run it and connect to your Azure account, you should be able to see the structure of your storage solutions here (notice that we have largestbanksdremio filesystem under banksstorage storage account which is marked as ADLS Gen2):

image alt text

The file loading is simple - just click on the corresponding button for uploading and choose the needed file in your local filesystem:

image alt text

After the successful uploading, you should see the file in the Azure Storage Explorer under the largestbanksdremio file system:

image alt text

This means that the first part of the dataset is loaded into Azure Data Lake Storage Gen2.

Loading data into Amazon S3

First of all, we need to create a bucket. To do this, on AWS portal, go to Services and click on S3 in the Storage section. Then, click on the Create bucket button:

image alt text

As a next step, we need to fill in the fields which are required for the new bucket creation. We specify the name of the bucket (largestbankss3) and click the Next button several times (because we want to leave all other options and permissions in their default state).

image alt text

On the Review step, you can check the correctness of the information you provided to the system. Then, click Create bucket:

image alt text

Now we can upload the second part of the dataset to S3 bucket. To do this, click on the Upload button and then specify the file you want to upload in your local filesystem:

image alt text

If the uploading is successful, the file should appear in the list of the bucket’s content:

image alt text

So, the second part of the dataset (banks_second.csv) has been loaded into Amazon S3.

Dremio and ADLS Gen2 connection

On the home page of Dremio GUI, click on the corresponding button to add data sources. In the next window of available data sources select Azure Storage source.

image alt text

At the next step, you need to fill the form for the new data source connection:

image alt text

The Name field is the field for specifying the name of the data source (whatever you want, for example, largest_banks_adls). Account Name is the name of your Azure storage account. For authentication, you need to find the access key. Go to Azure portal and click on Access keys tab inside your storage account. Use key1 as Shared Access Key in Dremio.

image alt text

After this, you can click Save and wait until the connection is established. Then, you need to review the available dataset and make a couple of changes. To do this, click on the name of the file (banks_first.csv) in Dremio.

First of all, check the line delimiter. It should be the same as was used when creating the file (\n in our case). Also, click on the checkbox for Extract Field Names to name the columns using the values from the first row of the file. After this, you can press the Save button and the dataset will be created.

image alt text

Now you have Dremio connected to ADLS Gen2, and you can process the data stored in ADLS Gen2.

Dremio and Amazon S3 connection

The creation of the connection between Dremio and Amazon S3 is similar to the connection of many other data sources to Dremio. First of all, you should choose the needed data source from the list of supported data sources:

image alt text

Next, we need to fill in the form, where several important parameters should be specified. The name of the data source is one of them. We decided to call our data source as the largest_banks_s3. Also, we need to specify the AWS Access Key and AWS Access Secret. Read further to learn where to find them if you don’t have them yet.

image alt text

Go to the AWS portal again. Click on All services and find IAM under the Security, Identity & Compliance section:

image alt text

On the next page, click the Manage Security Credentials button:

image alt text

The page with security credentials should appear. Go to Access keys (access key ID and secret access key) tab and click the Create New Access Key button there:

image alt text

After that, the following pop up window should appear:

image alt text

You can toggle special button to show or hide access key. You will find both AWS Access Key and AWS Access Secret there. Copy them and insert in the corresponding fields in Dremio. Now you can press the Save button, and if everything is fine, the connection should be created. The next several steps is similar to other connection cases also (go to the file, check the line delimiter, click the checkbox to use the first row as the names for the columns, etc.).

At the end of the work done, you should see the following picture inside your Dremio GUI:

image alt text

This means that you have successfully connected both Azure Data Lake Studio Gen2 (with banks_first.csv file) and Amazon S3 (with banks_second.csv file) to Dremio and you are ready to proceed with data curation step.

Data curation in Dremio

Let’s perform some manipulations with our data. For further work with the dataset, we need to join our tables (banks_first and banks_second). In order to do it, open the first table, and click Join. Then, choose the second table and press Next. We are going to merge our tables on A column. Click Apply.

image alt text

After that, we can drop additional columns that we don’t need anymore.

image alt text

Then, we will change the data types of some columns that store numbers and dates but have a text data type. In order to replace data format, we need to click on the left corner of the cell and choose a suitable data format, in our case Date.image alt text

Also, we will replace total_assest_us_b *(float) and *rank (integer) columns. Then we need to save our dataset and create .tds file for further visualization in Tableau.

image alt text

Now we can connect Dremio to Tableau and import the data to this BI tool.

Tableau and Dremio connection

Now we are going to open the previously created file in Tableau. To do that, we need to enter Dremio username and password.

image alt text

Building visualizations in Tableau

Let’s start building visualizations. Set up latitude and longitude dimensions and drag them into rows and columns as you can see below. This will illustrate all of the banks on a map.

image alt text

The next step is to drag the Country field into the color field.

image alt text

The plot is showing all of our countries with the World’s Largest Banks on the map. But let’s add some additional information to it.

image alt text

Drag Total assets metric to the Label tab to display the total amount of assets for the World’s Largest Banks by a particular country.

image alt text

What if you also want to know the number of banks assigned to countries? Banks are not metrics, they are dimensions, and if we just drag them on the Detail tab, nothing will happen. To resolve this problem, we should change the data format (click measure and choose count).

image alt text

Our final graph looks like the following. By hovering on a particular circle, you can see the number of banks.

image alt text

Conclusion

The cloud has made it increasingly common to have related data stored in different systems and locations. Dremio makes it easy to join data from different sources. In addition, its abstraction layer, powered by virtual datasets, makes it possible to move data over time without having to change any queries, reports and applications.

Today we demonstrated how to visualize data from ADLS and Amazon S3 in Tableau. If you or your company use different data storage solutions, take into consideration Dremio platform to simplify data visualization across different BI tools.

We hope you enjoyed this tutorial, stay tuned for more tutorials and resources to learn how you can use Dremio to start gaining insights from your data, faster.