Dremio Jekyll

Intro

The “last mile” in many data analytics jobs is a BI or visualization tool. Pictures are powerful and can help to illustrate patterns or summarizations of data. In this tutorial we’ll use Tableau to visualize a virtual dataset created in Dremio. You can do similar visualizations with Power BI, Qlik Sense, and other tools. Even if you’ve never used Tableau, the steps are straightforward and you should be able to follow along.

Assumptions

To follow this tutorial you should have access to a Dremio installation, and you should have completed the first two tutorials - Getting Oriented to Dremio, and Working With Your First Dataset.

You also need access to Tableau. If you don’t have Tableau a free trial is available at www.tableau.com. The screenshots in this tutorial are from Tableau running on Windows, but the steps also work for Tableau running on Mac, although the interface is slightly different in some areas.

Installing the ODBC Driver

To connect Tableau to Dremio, we’ll first need to install Dremio’s ODBC driver. Instructions are available for each operating system here.

Starting With The Data

First, log into Dremio and navigate to your “SFIncidents” space, then open your “incidents” virtual dataset. You should now see a preview:

Preview your incidents data

Using Dremio and Tableau Together

Once you have your ODBC driver installed, click on the “Tableau” button on the top of your SFIncidents.incidents virtual dataset, and a small .tds file called SFIncidents.incidents.tds will download to your local machine. This is a configuration file that makes it easy to launch Tableau connected to this virtual dataset over ODBC (note that Tableau is establishing a live connection, and no data is being exported to your local machine).

Click the .tds file, which will launch Tableau connected to your virtual dataset. If you’re on Windows, you should be prompted to log in to Dremio:

Logging into Taleau

If you’re on Mac you’ll be prompted to log in after you start to access the data. Enter your credentials for Dremio and you’ll now see your virtual dataset in Tableau, just as if it were a table in a relational database:

Viewing your data in Tableau

Tableau does a great job of visualizing data, especially geospatial data. To take advantage of this feature, we need to make the Lat and Lon fields dimensions. Click on each, and select “Convert to Dimension”:

Converting measures to dimensions

Once these are dimensions, you can drag Lon onto your Columns shelf and Lat onto your Rows shelf. You’ll now see your data on a map of San Francisco (if you dragged Lat to Columns and Lon to Rows, you’ll see your data in a chart):

Seeing a map of your data

It can be a little hard to see, but those are thousands of blue dots mapped onto the San Francisco peninsula with the surrounding water of the San Francisco Bay. Feel free to zoom in to take a closer look.

Refining Your Dataset

There are lots of types of crime in this dataset, so let’s focus on a few that are more interesting.

Go back to SFIncidents.incidents virtual dataset and select “Keep Only” from the transformation menu for the Categories column:

Selecting Keep Only

You’ll now see a histogram of the different values in this column based on a sample of the data:

Keep Only histogram

Let’s select VEHICLE THEFT. Notice that as you make selections the data refreshes dynamically in the preview below the histogram to give you immediate feedback on your choices. Click “Apply.”

One of the goals of Dremio is that you don’t have to be a SQL jockey to take advantage of its power. As we’ve worked through renaming columns, converting data types, and selecting a subset of the data, we’ve been refining our virtual dataset. Dremio represents these transformations using SQL. If you click the “SQL Editor” button in the top left, a window opens above your data to show you the query that defines this virtual dataset:

SQL window

The full SQL is:

In this case we built the query that defines this virtual dataset without knowing SQL, but if we happen to know SQL we can work with the query directly.

Now click “Save” at the top. With this change, any query issued to Dremio on this virtual dataset will only return incidents from the categories we selected. Note that this applies not only to Tableau but to any tool that accesses this data in Dremio.

This is an important point - you can create these kinds of filters in different client tools, but sometimes it makes a lot of sense to create and maintain these in Dremio so that all tools benefit. The same idea can be applied to securing access to data - you can limit the columns that are displayed, mask sensitive data, or only show aggregated representations of data using virtual datasets.

Making Your Data Look Great

Now, let’s go back to Tableau to see the changes to our virtual dataset. From the Data menu in Tableau, navigate to the Refresh option:

Refresh the data in Tableau

And you’ll see the same map but with just a few thousand point instead of over 150,000:

Map of your data

Notice that we didn’t need to change our connection or our query in Tableau. The same query works, but returns fewer records.

Let’s make our map look a little nicer. From the Map menu select Map Layers:

Making the map look nicer

Now you’ll see a Map Layers configuration screen on the left. Let’s change the Background to Normal, and select the options to display Streets and Boundaries (Streets and Highways on MacOS), Zip Code Boundaries, and Zip Code Labels:

Making the map look nicer

Your map should now look like this:

Nice looking map

These points all correspond vehicle theft on different days of the week, so let’s associate those with colors. Close the Map Layers configuration tab, and drag DayOfWeek from the list of dimensions onto the color tool:

Associating colors with days of the week

You’ll now see a list of Days of the Week on the right, each associated with different colors:

Final map

Conclusion

In this tutorial we started with some JSON data representing over 150,000 police incidents in San Francisco. While this data was stored in Amazon S3, it could just as easily have been in MongoDB or Elasticsearch. We were able to easily rename columns, convert data types, and filter down the data to a focused set of incidents for us to visualize in Tableau.

Next Steps

In our next tutorial we’ll take a look at Combining Data From Multiple Sources in Dremio (Coming Soon). We’ll enrich the police incidents with demographic data as well as a few other sources to make our analysis more interesting.