The “last mile” in many data analytics jobs is a BI or visualization tool. Pictures are powerful and can help to illustrate patterns or summarizations of data. In this tutorial we’ll use Tableau to visualize a virtual dataset created in Dremio. You can do similar visualizations with Power BI, Qlik Sense, and other tools. Even if you’ve never used Tableau, the steps are straightforward and you should be able to follow along.
We encourage you to work through this tutorial. Here’s a video in case you’d rather sit back and watch.
You also need access to Tableau. If you don’t have Tableau a free trial is available at www.tableau.com. The screenshots in this tutorial are from Tableau running on Windows, but the steps also work for Tableau running on Mac, although the interface is slightly different in some areas.
To connect Tableau to Dremio, we’ll first need to install Dremio’s ODBC driver. Instructions are available for each operating system here. If you’re using Tableau on macOS, the instructions can be found here.
First, log into Dremio and navigate to your “SFIncidents” space, then open your “incidents” virtual dataset. You should now see a preview:
Once you have your ODBC driver installed, click on the “Tableau” button on the top of your SFIncidents.incidents virtual dataset, and a small .tds file called SFIncidents.incidents.tds will download to your local machine. This is a configuration file that makes it easy to launch Tableau connected to this virtual dataset over ODBC (note that Tableau is establishing a live connection to Dremio, there is no Tableau extract being created).
Click the .tds file, which will launch Tableau connected to your virtual dataset. If you’re on Windows, you should be prompted to log in to Dremio:
If you’re on Mac you’ll be prompted to log in after you start to access the data. Enter your credentials for Dremio and you’ll now see your virtual dataset in Tableau, just as if it were a table in a relational database:
Tableau does a great job of visualizing data, especially geospatial data. To take advantage of this feature, we need to make the Lat and Lon fields dimensions. Click on each, and select “Convert to Dimension”:
Once these are dimensions, you can drag Lon onto your Columns shelf and Lat onto your Rows shelf. You’ll now see your data on a map of San Francisco (if you dragged Lat to Columns and Lon to Rows, you’ll see your data in a chart instead of a map):
It can be a little hard to see, but those are thousands of blue dots mapped onto the San Francisco peninsula with the surrounding water of the San Francisco Bay. Feel free to zoom in to take a closer look.
There are lots of types of crime in this dataset, so let’s focus on a few that are more interesting.
Go back to your SFIncidents space and hover over the incidents virtual dataset. You’ll see a gear. Click on the gear to see a list of options, including Edit:
Click Edit to open the virtual dataset in edit mode. Now select “Keep Only” from the transformation menu for the Categories column:
You’ll now see a histogram of the different values in this column based on a sample of the data:
Let’s select VEHICLE THEFT. Notice that as you make selections the data refreshes dynamically in the preview below the histogram to give you immediate feedback on your choices. Click “Apply.” You should now see an updated view of the virtual dataset that only includes incidents from the VEHICLE THEFT category.
One of the goals of Dremio is that you don’t have to be a SQL jockey to take advantage of its power. As we’ve worked through renaming columns, converting data types, and selecting a subset of the data, we’ve been refining our virtual dataset. Dremio represents these transformations using SQL. If you click the “SQL Editor” button in the top left, a window opens above your data to show you the query that defines this virtual dataset:
The full SQL is:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 SELECT IncidntNum, Category, Descript AS Description, DayOfWeek, TO_DATE("SF_incidents2016.json"."Date", 'YYYY-MM-DD', 1) AS "Date", "SF_incidents2016.json"."Time" AS "Time", PdDistrict, Resolution, Address, CONVERT_TO_FLOAT(X, 1, 0, 0) AS Lon, CONVERT_TO_FLOAT(Y, 1, 0, 0) AS Lat, Location, PdId FROM Samples."samples.dremio.com"."SF_incidents2016.json" WHERE (is_convertible_data(X, 1, 'FLOAT')) AND (is_convertible_data(Y, 1, 'FLOAT')) AND (Category = 'VEHICLE THEFT')
In this case we built the query that defines this virtual dataset without knowing SQL, but if we happen to know SQL we can work with the query directly.
Now click “Save” at the top. With this change, any query issued to Dremio on this virtual dataset will only return incidents from the categories we selected. Note that this applies not only to Tableau but to any tool that accesses this data in Dremio.
This is an important point - you can create these kinds of filters in different client tools, but sometimes it makes a lot of sense to create and maintain these in Dremio so that all tools benefit. The same idea can be applied to securing access to data - you can limit the columns that are displayed, mask sensitive data, or only show aggregated representations of data using virtual datasets.
Now, let’s go back to Tableau to see the changes to our virtual dataset. From the Data menu in Tableau, navigate to the Refresh option:
And you’ll see the same map but with just a few thousand points instead of over 150,000:
Notice that we didn’t need to change our connection or our query in Tableau. The same query works, but returns fewer records.
Let’s make our map look a little nicer. From the Map menu select Map Layers:
Now you’ll see a Map Layers configuration screen on the left. Let’s change the Background to Normal, and select the options to display Streets and Boundaries (Streets and Highways on MacOS), Zip Code Boundaries, and Zip Code Labels:
Your map should now look like this:
These points all correspond to vehicle theft on different days of the week, so let’s associate those with colors. Close the Map Layers configuration tab, and drag DayOfWeek from the list of dimensions onto the color tool:
You’ll now see a list of Days of the Week on the right, each associated with different colors:
In this tutorial we started with some JSON data representing over 150,000 police incidents in San Francisco. While this data was stored in Amazon S3, it could just as easily have been in MongoDB or Elasticsearch. We were able to easily rename columns, convert data types, and filter down the data to a focused set of incidents for us to visualize in Tableau.
In our next tutorial we’ll take a look at Combining Data From Multiple Sources in Dremio. We’ll enrich the police incidents with demographic data as well as a few other sources to make our analysis more interesting.