In this tutorial we’ll work with a sample dataset to give you a sense for the ways Dremio makes it easy to work with data. While Dremio is built to make working with very large datasets fast for analytical processing, in this tutorial we will work with a small dataset to help you get started. We’ll use a single year of incidents as reported by the San Francisco Police Department. This is a JSON file of around 150,000 records.
We encourage you to work through this tutorial. Here’s a video in case you’d rather sit back and watch.
To follow this tutorial you should have access to a Dremio installation. If you haven’t done so already, installing Dremio is easy - see the Quickstart for instructions. This tutorial assumes you’re on version 1.1.0_20170812… or later.
We also think it will be easier if you’ve read Getting Oriented to Dremio. You should at least be connected to the sample data on S3 and have a space set up for this exercise. If you have questions along the way, don’t hesitate to ask on the Dremio Community Site.
Dremio supports most data sources and many file formats. In this tutorial we will be working with data provided by the city and county of San Francisco. Many interesting datasets are maintained by the city, including police incidents for each year, starting in 2003:
|Address||String||9TH ST / MISSION ST|
Let’s start by creating a new space. Click on the plus sign next to Spaces and add a space called “SFIncidents”:
We’ll use this space to work with our virtual datasets. Next, let’s start to work with the Police incidents data stored on S3. Navigate to Samples in the Sources section on the bottom left, then samples.dremio.com:
The white file icon next to each file indicates that Dremio knows this is a file, but it doesn’t yet know its format. Throughout this series of tutorials we’ll be working with a single year of incidents (SF_Incidents2016.json) that includes about 150,000 records. The city provides many years of history in case you’d like to work with a larger dataset later.
To identify the format, hover over the file and to the right you’ll see a folder icon with an arrow pointing to another folder with a grid:
Once you click on this icon, you’ll see a sample of the data and a dialog to confirm the format:
Dremio supports many types of files, including Excel, JSON, Parquet, and others. With some file formats there are configurations to make (eg, field delimiter, line delimiter), but because JSON is self-describing we can simply click “Save” and now you’ll see a sample of the data presented in Dremio’s dataset viewer:
There’s a lot going on in this screen, so let’s go over a few things.
You’ll see a purple dataset icon in the upper left, followed by the full name of the dataset - Samples.”samples.dremio.com”.”SF_incidents2016.json” - which follows Dremio’s hierarchical namespace model: [source name].[space name].[dataset name].
The purple icon means that Dremio understands this is a physical dataset. It also means that we can now query this dataset using SQL from any tool. For example:
1 2 SELECT * FROM Samples."samples.dremio.com"."SF_incidents2016.json"
The data is still in S3 and Dremio hasn’t made a copy or any changes to the original, but in just a few clicks we have opened up this JSON to a wide range of analytical tools. We’ll explore this in more detail later.
One of the basic ideas of Dremio is that users should be able to easily curate the data to suit their own needs, and they should be able to do this without making copies. Let’s do a little of this now by working with data types and column names.
Click on the word “Descript” on the top of the third column. You should now be able to edit the column by simply typing “Description” and clicking return. Now the name of this column will be “Description” for all queries. Note that this made no changes to the source data.
You’ll notice that the purple icon is now green and the name is “New Query.” Instead of making changes to the physical dataset, we are creating what in Dremio is called a virtual dataset. Virtual datasets apply any changes you want to make to the data dynamically, without changing the original data, and without making a copy.
Next, the column called “Date” is actually a string, so let’s change the Date column from a string to a date. On the left side of the column header click on the “Abc” icon to open the data type menu:
Click on “Date…” and you’ll now see a screen with lots of options to help you convert this string to a date. From the list of formats, the first option works for this column. Below, you can see a preview of the data with the original in yellow and the converted column in blue:
This preview is helpful for immediate feedback. If you try some of the other options you’ll see the preview refresh. You can give the new column a name and optionally drop the original column from your virtual dataset, or keep it. Here we’ll replace the original with our new column called Date that is a date type. Click “Apply” and you’ll return to the dataset viewer screen. Notice the data type icon on the Date column is now a calendar instead of “Abc”.
There are a few more changes to make with this dataset. Change the types of the X and Y columns to Float, and rename them to Lon and Lat respectively. After applying these changes your virtual dataset should look like this:
Notice along the upper right a series of grey dots with a yellow dot on the top. This is the version history of your virtual dataset. Hover over each dot to see the changes made along the way, and click on a dot to return to that step in case you want to undo the changes you’ve made to your virtual dataset.
Now let’s save our virtual dataset. Click “Save As…” and you’ll see a list of spaces. Select “SFIncidents” and call this virtual dataset “incidents.”
Now the upper left shows a green dataset icon, and the full name of this virtual dataset is “SFIncidents.incidents”:
Now we’ve successfully created our first virtual dataset, “SFIncidents.incidents.” Without making a copy of the source data, and without writing any SQL, we’ve changed the names of some of the columns and changed some of the data types to suit our own needs.
This virtual dataset is a first class relational object. You can query it with SQL, join it to other physical or virtual datasets, perform aggregations, and so on. For example, you can now click the New Query button at the top and query this virtual dataset with SQL:
1 2 3 4 5 6 SELECT Category, count(*) as "incident count" FROM SFIncidents.incidents WHERE DayOfWeek in ('Monday','Tuesday','Wednesday','Thursday') GROUP BY Category ORDER BY "incident count" DESC LIMIT 5
This query asks for the top 5 most frequent incidents that occur during weekdays. You should see the following results (be sure to click Run instead of Preview):
In this tutorial we worked with JSON files in Amazon S3. In just a few minutes we were able to connect Dremio to a public file and issue a SQL query, all without making a copy of the data, and all without writing any code. We explored the ideas of physical and virtual datasets, which allow users to customize the data they need for their own analytical jobs, without asking IT to first make a copy.
In our next tutorial we will connect a BI tool to Dremio to explore the police incidents data. Even if you’ve never used Tableau, it will be easy to follow Visualizing Your Dataset With Tableau.