Dremio Jekyll

Working With Your First Dataset

Intro

In this tutorial we’ll work with a sample dataset to give you a sense for the ways Dremio makes it easy to work with data. While Dremio is built to make working with very large datasets fast for analytical processing, this tutorial uses a small dataset representing a single year of incidents as reported by the San Francisco Police Department.

We also have created a video if you would like to sit back and watch.

Assumptions

We recommend you participate in the preceding Getting Oriented to Dremio tutorial prior to beginning this tutorial. To follow along with this tutorial you should have access to a Dremio deployment, be connected to the sample data on Amazon S3, and have a space set up for this exercise. If you have questions along the way, don’t hesitate to ask on the Dremio Community Site.

The Data

Dremio supports most data sources and many file formats. In this tutorial we will be working with data provided by the city and county of San Francisco. Many interesting datasets are maintained by the city, including police incidents for each year, starting in 2003:

Field Type Example
IncidentNum String 170512983
Category String VEHICLE THEFT
Descript String STOLEN AUTOMOBILE
DayofWeek String Saturday
Date String 06/24/2017
Time String 00:30
PdDistrict String SOUTHERN
Resolution String NONE
Address String 9TH ST / MISSION ST
X String -122.414714295579
Y String 37.7762310404758
Location String (37.7762310404758°, -122.414714295579°)
PdId Integer 17051298307021

Accessing the Data with Dremio

Let’s begin by creating a new space. Click on the plus sign next to Spaces and add a space called “SFIncidents”:

image alt text

We’ll use this space to work with our virtual datasets (VDS). Next, let’s start to work with the police incidents data stored on S3. Navigate to Samples in the Sources section on the bottom left. Then click on samples.dremio.com:

image alt text

The white file icon next to each file indicates that Dremio knows this is a file, but it doesn’t yet know its format. Throughout this series of Dremio tutorials we’ll be working with a single year of incidents (SF_Incidents2016.json) that includes about 150,000 records. The city provides years of historical data in case you’d like to work with a larger dataset at a later time.

image alt text

To identify the format, hover over the file and to the right you’ll see a folder icon with an arrow pointing to another folder with a grid:

image alt text

Once you click on this icon, you’ll see a sample of the data and dialog to confirm the format:

image alt text

Dremio supports many types of files, including Excel, JSON, Parquet, and others. With some file formats there are required configurations (e.g., field delimiter, line delimiter), but because JSON is self-describing we can simply click “Save” and see a sample of the data presented in the Dremio dataset viewer:

image alt text

The next section navigates and describes the content on this screen.

Understanding the Dataset Viewer

In the upper left corner is a purple dataset icon followed by the full name of the dataset — Samples.samples.dremio.com.SF_incidents2016.json — which follows Dremio’s hierarchical namespace model: [source name].[space name].[dataset name].

image alt text

The purple icon indicates that Dremio understands this is a physical dataset. It also means that we can now query this dataset using SQL from any tool. For example:

1
2
SELECT *
FROM Samples."samples.dremio.com"."SF_incidents2016.json"

The data is still in S3 and Dremio hasn’t made a copy or any changes to the original, but in just a few clicks we have opened up this JSON data to a wide range of analytical tools. We’ll explore this in more detail later.

An essential goal of Dremio is that users should be able to easily prepare data to suit their own needs, and do so without making copies or taxing IT. Let’s start by working with data types and column names.

Building Your First Virtual Dataset

Click on “Descript” at the top of the third column. You should be able to edit the column by simply typing “Description” and clicking return.

image alt text

Now the name of this column is “Description” for all queries. Note that this made no changes to the source data.

image alt text

Youmay also notice that the purple icon is now green and the name is “New Query.” Instead of making changes to the physical dataset (PDS), we are creating a virtual dataset , where you can apply any changes to the data dynamically, without altering the original data, and without making a copy.

Next, the column called “Date” is actually a string, so let’s change the Date column from a string to a date. On the left side of the column header click on the “Abc” icon to open the data type menu:

image alt text

Click on “Date…” and you’ll see a screen with lots of options to help you convert this string to a date. From the list of formats, the first option works for this column. Below, you can see a preview of the data with the original in yellow and the converted column in blue:

image alt text

This preview enables immediate feedback. If you try some of the other options you’ll see the preview refresh. You can give the new column a name and optionally drop the original column from your virtual dataset, or keep it. In this exercise we’ll replace the original with our new column called Date that is a data type. Click “Apply” and you’ll return to the dataset viewer screen. Notice the data type icon on the Date column replaced “Abc” with a calendar.

There are a few more changes to make with this dataset. Change the types of the X and Y columns to Float, and rename them to Lon and Lat, respectively. After applying these changes your virtual dataset should look like this:

image alt text

Notice along the upper right of the screen there is a series of gray dots with a yellow dot on the top. This is the version history of your virtual dataset. Hover over each dot to see the changes made along the way, and click on a dot to return to that step in the event you want to undo the changes you’ve made to your virtual dataset.

image alt text

Now let’s save our virtual dataset. Click “Save As…”

image alt text

You’ll see a list of spaces. Call this VDS “Incidents,” select the “SFIncidents” space, and then click “Save.”

image alt text

Now the upper left displays a green dataset icon with the full name of this virtual dataset, “SFIncidents.incidents”:

image alt text

We’ve successfully created our first virtual dataset, “SFIncidents.incidents.” Without making a copy of the source data or writing any SQL, we’ve changed the names of some of the columns and changed some of the data types to suit our needs.

This virtual dataset is a first-class relational object. You can query it with SQL, join it to other physical or virtual datasets, perform aggregations, and more. For example, you can click the New Query button at the top and query this virtual dataset with SQL:

1
2
3
4
5
6
SELECT Category, count(*) as "incident count"
FROM SFIncidents.incidents
WHERE DayOfWeek in ('Monday','Tuesday','Wednesday','Thursday')
GROUP BY Category
ORDER BY "incident count" DESC
LIMIT 5

This query asks for and displays the top five most frequent incidents that occur during weekdays (be sure to click Run instead of Preview):

image alt text

Conclusion

This tutorial worked with JSON files in Amazon S3, and in just a few minutes we were able to connect Dremio to a public file and issue a SQL query, all without making a copy of the data or writing any code. We explored physical and virtual datasets, which allow users to customize the data they need for their own analytical jobs, without asking IT to first make a copy.

Next Steps

In our next tutorial we will connect a BI tool to Dremio to explore the police incidents data. Even if you’ve never used Tableau, it will be easy to follow Visualizing Your Dataset with Tableau.