Visualizing Azure Data Lake with Apache Superset and Dremio

   

Table of Contents

Intro

Azure Data Lake is a scalable data storage and analytics service. It reduces efforts for everyone to store data of any format and type and do processing and analytics across platforms and languages. Azure Data Lake provides a variety of functions and solutions for data management and governance. To make your application more powerful, you can integrate ADLS with operational stores and data warehouses.

Superset is a modern BI web application project that is in the incubating stages at The Apache Software Foundation. It is an open source project that provides users with an intuitive, visual and interactive data exploration platform. Some of the key features that Superset offer are:

  • Over 30 types of visualizations
  • Druid.io integration
  • Easy to use constructor for visualizations
  • Easy to share and collaborate on dashboards
  • Enterprise-ready authentication
  • A simple semantic layer that allows users to decide which fields they want to use in their visualizations
  • And much more.

In this tutorial we will explore how can we visualize your data lake using Superset and Dremio and Azure Data Lake Store. We will connect Dremio to ADLS and then after preparing and curating the data, we will connect Superset to Dremio and generate a dashboard. Without further ado, let’s dive right in!

Assumptions

We assume that you have Dremio and ODBC driver installed; if not, go to Dremio’s deploy page, pick the installations for your operating system, and visit Dremio Docs to read about installations and deployments. You should have an Azure Account and storage account.

Also, for this tutorial we expect that you have Superset installed and configured, if you haven’t done so, take a look at the Superset installation documentation, there you will find detailed information about pre-requisites, and installation steps for different operating systems.

  • Be familiar with the fundamentals of Dremio (we will be using Dremio 3.2)
  • Have the latest version of python and pip installed

Installing Superset

For this tutorial, I will be installing Supserset on Mac OSX Mojave. If you need details about how to install it on a different OS, take a look at their documentation page, they have very well detailed information about how to do so.

First, we need to update pip

pip install --upgrade setuptools pip

Then follow these steps to install Superset

# Install superset
pip install superset
# Create an admin user (you can hit enter in each entry to leave values as default, however, you will need to set an admin password)
fabmanager create-admin --app superset
# Initialize the database
superset db upgrade
# Load sample data
superset load_examples
# Create default roles and permissions
superset init
# To start a development web server on port 8088, use -p to bind to another port
superset runserver -d

Once all the steps are completed, open a new tab in your browser and go to http://localhost:8088/ here you will be presented with the login screen

image alt text

If you left the default values when you were installing Superset, just type admin and the password that you chose in the respective fields.

Before we continue on Superset, let’s switch gears for a moment and get our data ready in Dremio.

Importing data into Azure Data Lake Store

In this tutorial we will analyze this App reviews dataset from Kaggle. This dataset contains house sale prices for King County, namely homes sold between May 2014 and May 2015.

To import data into Azure Azure Data Lake Storage, go to Azure All services page, open Storage and choose Data Lake Storage Gen1. Then, add a container and upload data as shown on the screenshots below.

image alt text

Fill in the necessary information

image alt text

Then, click “create” and after a few minutes your container will be created. You will see a notification when the process has been successfully created. If you don’t see it immediately, simply refresh the page and you will see the item listed on the main screen

image alt text

To upload the data, select your storage item and then click on “data explorer”

image alt text

Then from the toolbar on the top of the screen, select ‘Upload’

image alt text

Then we will select the file that I want to upload

image alt text

Connecting to Dremio

Once the data has been uploaded, log in to Dremio and select add a new source, and then select “Azure Data Lake Store” from the list of available data sources

image alt text

Before connecting we need to do some preparations:

  1. Go to the Azure Portal and select Azure Active Directory from the left navigation bar. Then select App Registrations.
  2. Click on Endpoints
image alt text
  1. Copy the OAUTH 2.0 TOKEN ENDPOINT to the Dremio source form.
image alt text
  1. Go back to the Azure portal and click New registration.
image alt text
  1. Select a Name
  2. Click on the registered app and copy Application ID to Dremio source form.
  3. While in the registered app, select Keys under API Access. Enter a Description and select an expiration. Click Save and copy Value to Dremio source form.
  4. Also, in Data Explorer -> Access provide your app access to read, write and execute data.

Finally, your ADLS form in Dremio should look like this

image alt text

If everything went well, you should see the file on your main screen

image alt text

To see the data, simply click on the listed file and click on “Extract field names” and then “save”

image alt text

Data curation in Dremio

Now, we can begin data preprocessing. First, I need to change the data type for a few of the columns that. Notice that Rating and Reviews are not integer. To change the type, all I have to do is select the ‘ABC’ icon next to the column name and then select the data type I want to use, in this case ‘Float’

image alt text

I also will apply the same procedure to the “reviews” field and the ‘Last Updated’ field.

There are many more operations that you can perform on your dataset using Dremio. We have a great tutorial dedicated to data curation where you can take a look at all the tips and tricks related to data curation in Dremio.

Now, let’s save the dataset and since we don’t have to make any more changes to it, let’s start gaining insights from it!

image alt text

Creating and Connecting a Database in Superset

At this point we are ready to start working with our ADLS data from Superset. Navigate to the Superset instance that you have open in a different tab. On the main screen, select “Sources” and then from the drop-down menu select “Databases”

image alt text

Then select the plus (+) sign on the main screen

image alt text

Now type in the following parameters in the indicated fields

Database: Dremio

SQLAlchemy URI: dremio+pyodbc://[Dremio username]:[Dremio password]@[Dremio host]:31010/dremio

Expose in SQL Lab: check

Allow Run Sync: check

NOTE: Replace the parameters in the SQLAlchemy URI to match the information from your environment. In my case the URI looks like the following:

dremio+pyodbc://lucio:pa$$w0rd]@localhost:31010/dremio

Now, click on “Test Connection”, if everything went well, you should see the following pop-up

image alt text

After all the parameters have been typed in, scroll down and then click ‘Save’

image alt text

If everything went as expected, you will see your newly created source in the “List Databases” menu

image alt text

Adding a new table

At this point we have successfully added a new data source to Supserset, which in this case is Dremio. Now we need to add the table that we want to work with. To do this, first select “Tables” from the “Sources menu”

image alt text

From the new screen, click on the plus sign and then from the database menu, select “Dremio”. Then I’ll go ahead and fill in the rest of the parameters based on the table that I want to use. In this case, is the employee VDS that we created at the beginning of the tutorial.

image alt text

Once you click “Save”, Superset will present you with the list of available datasets

image alt text

Previewing your data

Before we start visualizing data, I’ll run a query to double check the data that I connected to. To do this, I’ll go to the SQL Lab tab in Superset

image alt text

Then I’ll select the database and schema that I want to query and hit “Run query”

image alt text

Exploring your data

Now we can start exploring the data, to do so, head back to the ‘Sources’ drop-down menu and select ‘Tables’. Then, click on the name of the dataset that we added.

image alt text

Superset will present you with a table visualization type by default

image alt text

Now let’s walk through the steps of exploring the data that we are working with. First, I’ll change the default time range filter to ‘No Filter’

image alt text

Then, since I want to know the distribution of apps by category that I have in the dataset, I’ll group by “category” and then select “count(* )” as the metric to use

image alt text

After the changes are done, hit the “Run query” button and then you will be presented with the results based on the parameters selected.

image alt text

Before I continue creating other visualizations, I’ll go ahead and save this table as “Apps by Category”. To name a visualization, simply click on the “- untitled” field and add a name, then hit enter. The visualization will be saved automatically.

image alt text

Now, I’m going to create a different visualization. I want to see what is the average size of each app for each one of the categories that we have in our dataset. To do so, I’ll select the “Bar chart” chart from the visualization options.

image alt text

Then, I will change the metric configuration to represent the “size_MB” column with the AVG aggregate

image alt text

After running the query, the new chart looks like the following

image alt text

Now, I would like to create a filter that I can use later in a dashboard. To do so, change the visualization type to “Filter Box” and will leave the rest of the parameters with their default value.

image alt text

Then I’ll save this element as ‘Filter’

Creating a Dashboard in Superset

Creating dashboards in Superset is very simple and intuitive. The first thing that we need to do is select “Dashboards” in the main menu and then click on the plus sign to create a new one

image alt text

I’ll name it “App Reviews ADLS”, assign an owner and then save it.

image alt text

Once you save it, Superset will present you with the list of dashboards available. I’ll click on the name link which will take me to the dashboard editor.

Here, select “Edit Dashboard”

image alt text

Then select “Your charts & filters”

image alt text

In this screen you will be able to see all the elements available to build the dashboard, in this case, I have three different elements: 2 charts and 1 filter.

To add elements to the dashboard, open the visualization from the charts menu, then click “Save” and select “Add to existing dashboard”

image alt text

Then select “save and go to dashboard”

image alt text

The visualization will be automatically added to the dashboard, here we can continue adding more elements. I’ll click on “Edit dashboard” and finish customizing the dashboard with the rest of the elements that I have available.

image alt text

Once I’m done, I’ll save the changes and I can start slicing the data using the filer that I created. For example, I can change the values of the filter so the dashboard can show me the data about apps that were updated between February and March of 2018

image alt text

Then I can start answering questions such as which was the highest ranking app between February and March 2018.

Conclusion

In this tutorial, we focused our attention on working with a modern BI tool such as Superset and Dremio. We showed you the steps to connect, curate and prepare data from disparate data sources using Dremio. Moreover, we walked through the steps of connecting Supserset to Dremio and then creating several visualizations and a filter that we finally used to create a dashboard.

This is just the tip of the iceberg of the number of great things that you can do with Dremio. Whether you are a data architect or a data consumer, to learn more about Dremio, and experience the benefits of using Dremio Enterprise Edition, please visit Dremio University, a free online Dremio learning platform where you can become a Dremio expert in no time.

Ready to Get Started? Here Are Some Resources to Help

Whitepaper Thumb

Whitepaper

Operating Dremio Cloud Runbook

read more

Webinars

Unlock the Power of a Data Lakehouse with Dremio Cloud

read more
Whitepaper Thumb

Whitepaper

Migration Strategy From Dremio Software to Dremio Cloud

read more
get started

Get Started Free

No time limit - totally free - just the way you like it.

Sign Up Now
demo on demand

See Dremio in Action

Not ready to get started today? See the platform in action.

Watch Demo
talk expert

Talk to an Expert

Not sure where to start? Get your questions answered fast.

Contact Us

Ready to Get Started?

Bring your users closer to the data with organization-wide self-service analytics and lakehouse flexibility, scalability, and performance at a fraction of the cost. Run Dremio anywhere with self-managed software or Dremio Cloud.