Dremio Jekyll

Unlocking Data-as-a-Service for Apache Superset using Dremio

Introduction

Superset is a modern BI web application project that is in the incubating stages at The Apache Software Foundation. It is an open source project that provides users with an intuitive, visual and interactive data exploration platform. Some of the key features that Superset offer are:

  • Over 30 types of visualizations
  • Druid.io integration
  • Easy to use constructor for visualizations
  • Easy to share and collaborate on dashboards
  • Enterprise-ready authentication
  • A simple semantic layer that allows users to decide which fields they want to use in their visualizations
  • And much more.

In this tutorial we will explore how can we unlock Data-as-a-Service for Superset using Dremio. We will connect Dremio to two different data sources and then after preparing and curating the data, we will connect Superset to Dremio and generate a dashboard. Without further ado, let’s dive right in!

Assumptions

For this tutorial we expect that you have Superset installed and configured, if you haven’t done so, take a look at the Superset installation documentation, there you will find detailed information about pre-requisites, and installation steps for different operating systems.

  • Be familiar with the fundamentals of Dremio (we will be using Dremio 3.2)
  • Have the latest version of python and pip installed

Installing Superset

For this tutorial, I will be installing Supserset on Mac OSX Mojave. If you need details about how to install it on a different OS, take a look at their documentation page, they have very well detailed information about how to do so.

First, we need to update pip

1
pip install --upgrade setuptools pip

Then follow these steps to install Superset

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# Install superset

pip install superset

# Create an admin user (you can hit enter in each entry to leave values as default, however, you will need to set an admin password)

fabmanager create-admin --app superset

# Initialize the database

superset db upgrade

# Load sample data

superset load_examples

# Create default roles and permissions

superset init

# To start a development web server on port 8088, use -p to bind to another port

superset runserver -d

Once all the steps are completed, open a new tab in your browser and go to http://localhost:8088/ here you will be presented with the login screen

image alt text

If you left the default values when you were installing Superset, just type admin and the password that you chose in the respective fields.

Before we continue on Superset, let’s switch gears for a moment and get our data ready in Dremio.

Connecting to your data sources

In this scenario I’ll go ahead and play the role of a data analyst, as such, I would like to analyze data about the employees that we currently have in our company (disclaimer: the datasets that I’m using are fictional).

First thing I need to do is find the employees dataset. Since I’m familiar with the data source, I can navigate directly to the datasource and find the table that I need to work with which I know is stored in the PG datasource.

image alt text

However, if you are not familiar with the data, Dremio provides the option to search the catalog for the dataset that you need. To do so, simply type the word that you want to search by in the “search catalog” entry and hit enter. I’ll go ahead and search for “employees” and Dremio will automatically provide a list of available datasets that match the string I typed.

image alt text

Select the dataset that you want to preview

image alt text

Once we verify that the values of the fields are the ones that we are expecting, we can move into doing the preparation of the data.

Prepping the data

In this scenario, to have a complete dataset, I need to join my ‘employees’ Virtual Dataset (VDS) with data about the departments where each one of these employees work. The catch is that this dataset is located in SQL Server. In an ordinary environment, I would have to put together an ETL process to extract the data from each data source and then join both of them and then copy the results into a different data set so I can then analyze it.

Let’s get that done in Dremio with just a few clicks.

First, click “join” on the data preview screen

image alt text

Here you have several options, you can create a custom join with any data set, or you can accept any of the suggestions that Dremio is proposing. Additionally, you also have the option to select the type of join to be done: Inner, outer, etc.

I’ll select one of Dremio’s suggestions, inner join with the ‘Departments’ dataset located in SQL Server

image alt text

After clicking ‘Apply’, Dremio will present us with a virtual dataset that contain all the information we need for this analysis.

image alt text

Now let’s save the dataset. I’ll click “Save as”, give it a name and select the space where I want to save it.

image alt text

If there are any changes that you want to do to the data, this is the moment to do so. In our Data Curation with Dremio tutorial, we explain in detail all the tasks that you can perform when trying to prepare your datasets. I encourage you to take a look. For know, let’s get our connection with Superset ready.

Creating and Connecting a Database in Superset

At this point you should have Superset open in a different tab. On the main screen, select “Sources” and then from the drop-down menu select “Databases”

image alt text

Then select the plus (+) sign on the main screen

image alt text

Now type in the following parameters in the indicated fields

Database: Dremio

SQLAlchemy URI: dremio+pyodbc://[Dremio username]:[Dremio password]@[Dremio host]:31010/dremio

Expose in SQL Lab: check

Allow Run Sync: check

NOTE: Replace the parameters in the SQLAlchemy URI to match the information from your environment. In my case the URI looks like the following:

dremio+pyodbc://lucio:pa$$w0rd]@localhost:31010/dremio

Now, click on “Test Connection”, if everything went well, you should see the following pop-up

image alt text

After all the parameters have been typed in, scroll down and then click ‘Save’

image alt text

If everything went as expected, you will see your newly created source in the “List Databases” menu

image alt text

Adding a new table

At this point we have successfully added a new data source to Supserset, which in this case is Dremio. Now we need to add the table that we want to work with. To do this, first select “Tables” from the “Sources menu”

image alt text

From the new screen, click on the plus sign and then from the database menu, select “Dremio”. Then I’ll go ahead and fill in the rest of the parameters based on the table that I want to use. In this case, is the employee VDS that we created at the beginning of the tutorial.

image alt text

Once you click “Save”, Superset will present you with the list of available datasets

image alt text

Previewing your data

Before we start visualizing data, I’ll run a query to double check the data that I connected to. To do this, I’ll go to the SQL Lab tab in Superset

image alt text

Then I’ll select the database and schema that I want to query and hit “Run query”

image alt text

Exploring your data

Now we can start exploring the data, to do so, head back to the ‘Sources’ drop-down menu and select ‘Tables’. Then, click on the name of the dataset that we added.

image alt text

Superset will present you with a table visualization type by default

image alt text

Now let’s walk through the steps of exploring the data that we are working with. First, I’ll change the default time range filter to ‘No Filter’

image alt text

Then, since I want to know how many employees I have per department, I’ll group by “department_name” and then select “count( * )” as the metric to use

image alt text

After the changes are done, hit the “Run query” button and then you will be presented with the results based on the parameters selected.

image alt text

Before I continue creating other visualizations, I’ll go ahead and save this table as “Employees by Department”. To name a visualization, simply click on the “- untitled” field and add a name, then hit enter. The visualization will be saved automatically.

image alt text

Now, I’m going to create a different visualization. I want to see what is the average salary for each one of the departments that we have in our dataset. To do so, I’ll select the “Bar chart” chart from the visualization options.

image alt text

Then, I will change the metric configuration to represent the “salary” column with the AVG aggregate

image alt text

After running the query, the new chart looks like the following

image alt text

Now, I would like to create a filter that I can use later in a dashboard. To do so, change the visualization type to “Filter Box” and will leave the rest of the parameters with their default value.

image alt text

Creating a Dashboard in Superset

Creating dashboards in Superset is very simple and intuitive. The first thing that we need to do is select “Dashboards” in the main menu and then click on the plus sign to create a new one

image alt text

I’ll name it “Acme Employees”, assign an owner and then save it.

image alt text

Once you save it, Superset will present you with the list of dashboards available, in this case the only one is “Acme Employees”, I’ll click on the name link which will take me to the dashboard editor.

Here, select “Edit Dashboard”

image alt text

Then select “Your charts & filters”

image alt text

In this screen you will be able to see all the elements available to build the dashboard, in this case, I have three different elements: 2 charts and 1 filter.

To add elements to the dashboard, open the visualization from the charts menu, then click “Save” and select “Add to existing dashboard”

image alt text

Then select “save and go to dashboard”

image alt text

The visualization will be automatically added to the dashboard, here we can continue adding more elements. I’ll click on “Edit dashboard” and finish customizing the dashboard with the rest of the elements that I have available.

image alt text

Once I’m done, I’ll save the changes and I can start slicing the data using the filer that I created. For example, I can change the values of the filter so the dashboard can show me the data about employees who were hired between 1987 and 2005.

image alt text

Then I can start answering questions such as the average monthly salary for the Sales department between 1995 and 2005 was $8K and the team had 29 employees.

Conclusion

We focused our attention in this tutorial on working with a modern BI tool such as Superset and Dremio. We showed you the steps to connect, curate and prepare data from disparate data sources using Dremio. Moreover, we walked through the steps of connecting Supserset to Dremio and then creating several visualizations and a filter that we finally used to create a dashboard.

We hope you found this tutorial useful. To learn more about Dremio, and experience the benefits of using Dremio Enterprise Edition, please visit Dremio University, a free online Dremio learning platform where you can become a Dremio expert in no time.