Working With Your First Dataset
In this tutorial we’ll work with a sample dataset to give you a sense for the ways Dremio makes it easy to work with data. While Dremio is built to make working with very large datasets fast for analytical processing, this tutorial uses a small dataset representing a single year of incidents as reported by the San Francisco Police Department.
We also have created a video if you would like to sit back and watch.
We recommend you participate in the preceding Getting Oriented to Dremio tutorial prior to beginning this tutorial. To follow along with this tutorial you should have access to a Dremio deployment, be connected to the sample data on Amazon S3, and have a space set up for this exercise. If you have questions along the way, don’t hesitate to ask on the Dremio Community Site.
Dremio supports most data sources and many file formats. In this tutorial we will be working with data provided by the city and county of San Francisco. Many interesting datasets are maintained by the city, including police incidents for each year, starting in 2003:
|Address||String||9TH ST / MISSION ST|
Accessing the Data with Dremio
Let’s begin by creating a new space. Click on the plus sign next to Spaces and add a space called “SFIncidents”:
We’ll use this space to work with our virtual datasets (VDS). Next, let’s start to work with the police incidents data stored on S3. Navigate to Samples in the Sources section on the bottom left. Then click on samples.dremio.com:
The white file icon next to each file indicates that Dremio knows this is a file, but it doesn’t yet know its format. Throughout this series of Dremio tutorials we’ll be working with a single year of incidents (SF_Incidents2016.json) that includes about 150,000 records. The city provides years of historical data in case you’d like to work with a larger dataset at a later time.
To identify the format, hover over the file and to the right you’ll see a folder icon with an arrow pointing to another folder with a grid:
Once you click on this icon, you’ll see a sample of the data and dialog to confirm the format:
Dremio supports many types of files, including Excel, JSON, Parquet, and others. With some file formats there are required configurations (e.g., field delimiter, line delimiter), but because JSON is self-describing we can simply click “Save” and see a sample of the data presented in the Dremio dataset viewer:
The next section navigates and describes the content on this screen.
Understanding the Dataset Viewer
In the upper left corner is a purple dataset icon followed by the full name of the dataset — Samples.samples.dremio.com.SF_incidents2016.json — which follows Dremio’s hierarchical namespace model: [source name].[space name].[dataset name].
The purple icon indicates that Dremio understands this is a physical dataset. It also means that we can now query this dataset using SQL from any tool. For example:
1 2 SELECT * FROM Samples."samples.dremio.com"."SF_incidents2016.json"
The data is still in S3 and Dremio hasn’t made a copy or any changes to the original, but in just a few clicks we have opened up this JSON data to a wide range of analytical tools. We’ll explore this in more detail later.
An essential goal of Dremio is that users should be able to easily prepare data to suit their own needs, and do so without making copies or taxing IT. Let’s start by working with data types and column names.
Building Your First Virtual Dataset
Click on “Descript” at the top of the third column. You should be able to edit the column by simply typing “Description” and clicking return.
Now the name of this column is “Description” for all queries. Note that this made no changes to the source data.
Youmay also notice that the purple icon is now green and the name is “New Query.” Instead of making changes to the physical dataset (PDS), we are creating a virtual dataset , where you can apply any changes to the data dynamically, without altering the original data, and without making a copy.
Next, the column called “Date” is actually a string, so let’s change the Date column from a string to a date. On the left side of the column header click on the “Abc” icon to open the data type menu:
Click on “Date…” and you’ll see a screen with lots of options to help you convert this string to a date. From the list of formats, the first option works for this column. Below, you can see a preview of the data with the original in yellow and the converted column in blue:
This preview enables immediate feedback. If you try some of the other options you’ll see the preview refresh. You can give the new column a name and optionally drop the original column from your virtual dataset, or keep it. In this exercise we’ll replace the original with our new column called Date that is a data type. Click “Apply” and you’ll return to the dataset viewer screen. Notice the data type icon on the Date column replaced “Abc” with a calendar.
There are a few more changes to make with this dataset. Change the types of the X and Y columns to Float, and rename them to Lon and Lat, respectively. After applying these changes your virtual dataset should look like this:
Notice along the upper right of the screen there is a series of gray dots with a yellow dot on the top. This is the version history of your virtual dataset. Hover over each dot to see the changes made along the way, and click on a dot to return to that step in the event you want to undo the changes you’ve made to your virtual dataset.
Now let’s save our virtual dataset. Click “Save As…”
You’ll see a list of spaces. Call this VDS “Incidents,” select the “SFIncidents” space, and then click “Save.”
Now the upper left displays a green dataset icon with the full name of this virtual dataset, “SFIncidents.incidents”:
We’ve successfully created our first virtual dataset, “SFIncidents.incidents.” Without making a copy of the source data or writing any SQL, we’ve changed the names of some of the columns and changed some of the data types to suit our needs.
This virtual dataset is a first-class relational object. You can query it with SQL, join it to other physical or virtual datasets, perform aggregations, and more. For example, you can click the New Query button at the top and query this virtual dataset with SQL:
1 2 3 4 5 6 SELECT Category, count(*) as "incident count" FROM SFIncidents.incidents WHERE DayOfWeek in ('Monday','Tuesday','Wednesday','Thursday') GROUP BY Category ORDER BY "incident count" DESC LIMIT 5
This query asks for and displays the top five most frequent incidents that occur during weekdays (be sure to click Run instead of Preview):
This tutorial worked with JSON files in Amazon S3, and in just a few minutes we were able to connect Dremio to a public file and issue a SQL query, all without making a copy of the data or writing any code. We explored physical and virtual datasets, which allow users to customize the data they need for their own analytical jobs, without asking IT to first make a copy.
In our next tutorial we will connect a BI tool to Dremio to explore the police incidents data. Even if you’ve never used Tableau, it will be easy to follow Visualizing Your Dataset with Tableau.