Table of Contents
Table of Contents
The “last mile” in many data analytics jobs is the application of a BI or visualization tool. Pictures are powerful and can help to illustrate patterns or summarize data. This tutorial uses Tableau to visualize a dataset created in Dremio. You can do similar visualizations with Power BI, Qlik Sense, and other tools. Even if you’ve never used Tableau, the steps are straightforward and you should be able to follow along.
This is the third tutorial in the series, and we recommend that you complete the first two, Getting Oriented to Dremio and Working with Your First Dataset, in order to familiarize yourself and gain access to the elements we are building upon in this tutorial. To begin, you should have access to a Dremio deployment as well as Tableau. If you don’t have Tableau, a free trial is available at www.tableau.com.
Installing the ODBC Driver
Starting with the Data
First, log in to Dremio and navigate to your “SFIncidents” space, then open your “Incidents” virtual dataset (VDS). You should see the following preview:
NOTE: The following steps use Tableau to plot the data points on a map of San Francisco, and for this to happen successfully you need to rename the X and Y fields in the VDS to Lon and Lat, respectively.
Using Dremio and Tableau Together
Once you have your ODBC driver installed, click on the “Tableau” button at the top of your SFIncidents.incidents virtual dataset.
Dremio will generate a “.tds” file called “SFIncidents.incidents.tds” that will automatically download to your local machine. This is a configuration file that makes it easy to launch Tableau connected to this virtual dataset over ODBC (Note: Tableau establishes a live connection to Dremio, there is no Tableau extract created).
Click the .tds file, which launches the Tableau instance connected to your virtual dataset. If you’re on Windows, you should be prompted to log in to Dremio:
If you’re on Mac, you’ll be prompted to log in after you start to access the data. Enter your credentials for Dremio and you’ll see your virtual dataset in Tableau:
Tableau does a great job of visualizing data, especially geospatial data. To take advantage of this feature, you need to make the Lat and Lon fields dimensions. Click on each, and select “Convert to Dimension”:
Once these become dimensions, you can drag Lon onto your Columns shelf and Lat onto your Rows shelf. You’ll now see your data on a map of San Francisco (alternatively, if you dragged Lat to Columns and Lon to Rows, you’ll see your data in a chart instead of a map):
It can be a little hard to see, but there are over 150,000 blue dots mapped onto the San Francisco peninsula with the surrounding water of the San Francisco Bay. Feel free to zoom in to take a closer look.
Refining Your Dataset
There are lots of different values in the “Category” field in this dataset, so let’s focus on a few that are more interesting.
Go back to your “SFIncidents” space and hover over the “Incidents” VDS. Click on the pencil icon to access the VDS in “editing” mode:
Now click on the dropdown menu for the “Categories” column and select “Keep Only”:
You’ll now see a histogram of the different values in this column based on a sample of the data. Let’s select VEHICLE THEFT. Notice that as you make a selection the data refreshes dynamically in the preview below the histogram to give you immediate feedback on your choices.
Click “Apply.” You should see an updated view of the virtual dataset that only includes incidents from the “VEHICLE THEFT” category.
One of the principles behind Dremio is that users don’t have to be SQL experts to take advantage of its power. As we’ve worked through renaming columns, converting data types, and selecting a subset of the data, we’ve been refining our virtual dataset. Dremio represents these transformations using SQL. If you click the “SQL Editor” button at the top left, a window opens above your data to show you the query that defines this virtual dataset:
The full SQL is:
SELECT CONVERT_TO_INTEGER(IncidntNum, 1, 1, 0) AS IncidntNum, Category, Descript AS Description, DayOfWeek, TO_DATE("SF_incidents2016.json"."Date", 'YYYY-MM-DD', 1) AS "Date", TO_TIME("SF_incidents2016.json"."Time", 'HH:MI', 1) AS "Time", PdDistrict, Resolution, Address, CONVERT_TO_FLOAT(X, 1, 1, 0) AS Lon, CONVERT_TO_FLOAT(Y, 1, 1, 0) AS Lat, Location, PdId FROM Samples."samples.dremio.com"."SF_incidents2016.json" WHERE Category = 'VEHICLE THEFT'
In this case, you built the query that defines this virtual dataset without knowing SQL, but if you happen to know SQL you can work with the query directly.
Now click “Save” at the top.
With this change, any query issued to Dremio on this virtual dataset will only return incidents from the categories you selected. Note that this applies not only to Tableau but to any tool that accesses this data in Dremio.
It is important to note that you can create filters in different client tools, but sometimes it makes a lot of sense to create and maintain these in Dremio so that all tools get the same results. The same idea can be applied to securing access to data — you can limit the columns that are displayed, mask sensitive data, or only show aggregated representations of data using virtual datasets.
Making Your Data Look Great
Now, let’s go back to Tableau to see the changes to our virtual dataset. From the Data menu in Tableau, navigate to the Refresh option:
You’ll see the same map but with just a few thousand data points versus over 150,000:
Notice that we didn’t need to change our connection or our query in Tableau. The same query works, but returns fewer records.
Let’s make our map look a little nicer. From the Map menu select Map Layers:
You’ll now see a Map Layers configuration screen on the left. Change the Background to Normal, and select the options to display Streets and Boundaries (Streets and Highways on MacOS), Zip Code Boundaries, and Zip Code Labels:
Your map should look like this:
These points correspond to vehicle theft on different days of the week, so let’s associate those with colors. Close the Map Layers configuration tab, and drag DayOfWeek from the list of dimensions onto the color tool:
You’ll now see a list of Days of the Week on the right, each associated with different colors:
This tutorial uses JSON data representing over 150,000 police incidents in San Francisco. While this data was stored in Amazon S3, it could just as easily have been in any other data source in your cloud data lake. We were able to easily rename columns, convert data types, and filter down the data to a focused set of incidents to visualize in Tableau.
Ready to Get Started? Here Are Some Resources to Help
What Is a Data Lakehouse?
The data lakehouse is a new architecture that combines the best parts of data lakes and data warehouses. Learn more about the data lakehouse and its key advantages.read more
Simplifying Data Mesh for Self-Service Analytics on an Open Data Lakehouse
The adoption of data mesh as a decentralized data management approach has become popular in recent years, helping teams overcome challenges associated with centralized data architecture.read more