Dremio Jekyll

Getting Started With Data Reflections

Introduction

Dremio includes an innovative, patent-pending acceleration technology called Data Reflections.

In this tutorial we’ll explain how Data Reflections work, and how to use them with your favorite datasets.

Assumptions

We assume you have Dremio 1.3 or later installed, and that you have a basic understanding of physical datasets, virtual datasets, and SQL. While data reflections are designed to accelerate queries from any tool, for this tutorial we’ll be issuing queries through Dremio’s SQL console to keep things simple. Feel free to use your own SQL or BI tool if you’d like.

Why Data Reflections?

Analytical workloads typically involve large datasets and resource-intensive operations. Data analysts and data scientists require interactive query performance to do their jobs effectively - the nature of many analytical jobs is iterative as one answer leads to the next question. When these individuals are unable to move to the next question at the speed of thought, their work is impacted. Data Reflections are one of the ways Dremio provides high performance access to data of any size, any format, and any source.

How Data Reflections Work

Data Reflections are similar to indexes in relational databases. They provide Dremio’s cost-based query optimizer useful information that can result in dramatically more cost-effective query plans than performing query push-downs to the data source. Data Reflections organize and optimize the data close to Dremio’s query execution engine, using techniques such as columnarization, compression, sorting, partitioning, and aggregating data. All reflections persist in Dremio’s Reflection Store, which is configurable for each deployment to use cloud object stores such as Amazon S3, Hadoop’s HDFS, a NAS or SAN, or even local attached storage that is mounted across the Dremio nodes.

Just as indexes in a relational database, Data Reflections are invisible to end users - they don’t need to connect to a different resource or change their queries. You can add Data Reflections and Dremio’s query planner will start to use them automatically, without changing the behavior of your end users.

There are two types of Data Reflections:

  • Raw Reflections. These accelerate all types of queries and allow you to optimize performance on a subset of rows or columns for any given data source, or the full dataset.
  • Aggregation Reflections. These further accelerate BI-style queries that involve aggregations and group by operations. They can also be configured to work on a subset of the rows or columns of a data source.

In this tutorial we’ll take a closer look at both types of Data Reflections.

Creating A Raw Data Reflection

Let’s start by taking a look at how a query performs using a standard push-down. If you haven’t already, create the Sample Source data source in Dremio (you can skip ahead a few steps if you have). You can find it as one of the options on the Add Source dialog:

Add the Sample source in Dremio

Now you’ll see the S3 bucket called samples.dremio.com:

View the samples.dremio.com S3 bucket on AWS

Click on that folder to see a few files provided as samples. We’ll use the SF_Incidents2016.json data source for this tutorial, so click on that file:

Selecting the SF_Incidents2016.json file

You’ll see a sample of the JSON, click OK to confirm the format. Now you’ll see a preview of this physical dataset:

Preview the JSON

Click the data type menu option on the left column header for the Date column to convert this string to a date:

Convert the Date field to a date data type

Now save this as a virtual dataset called “SFIncidents” in one of your spaces.

Open a new query window and enter the following query and click Run. You’ll need to change the space name to work in your local environment:

1
2
3
SELECT PdDistrict, DayOfWeek, Category, Resolution
FROM "@kelly".SFIncidents
WHERE '2016-02-01' <= SFIncidents."Date" AND '2016-03-01' > SFIncidents."Date"

This query returns 12,092 records and will take a few seconds to run, depending on where you’re running Dremio and your internet connection speed. In this tutorial the Dremio cluster is running on Google Cloud Platform and the connection to S3 is pretty fast, but your results may be significantly slower.

Let’s look at the job profile for this query to see whether it was accelerated. Open a new tab in your browser with the Jobs open:

View Dremio's Job history

Let’s take a closer look at the query we just ran. You can see the SQL that was executed and some information about how long the query took to execute. Let’s click on the Acceleration tab of this job to see more info:

Open the acceleration tab for the query profile

Here Dremio tells us the query was not accelerated because no matching reflections were available. This is to be expected. Now that we have our bearings, let’s go create a reflection to accelerate queries on this dataset.

Navigate back to the virtual dataset SFIncidents and click the configuration gear icon to open us the settings menu. Click on the Reflections option on the left. Dremio will first sample the data to determine suggestions for configuring reflections on this dataset, which may take a few moments. Once this is complete you should see:

Suggested Data Reflections from Dremio

Here there are two types of reflections suggested: Raw Reflections and Aggregation Reflections. Let’s turn on Raw Reflections for now and then come back to Aggregation Reflections later. Toggle the button for Raw Reflections over to the right until it turns blue, then click Save:

Toggling Raw Reflctions to enable the setting

This will start a Data Reflection build process that may take several minutes to run depending on your internet connection. Dremio will read the dataset from its source and save the Data Reflection as a Parquet file in Dremio’s Reflection Store. You can see the job as it progresses if you look at the Jobs tab and select the Acceleration job type (not selected by default). You should see a job running as the $dremio$ user that looks something like this:

Creating the data reflection, viewing the job

And if you look at this job you should see more details about it:

More details about the job

There’s a Show Reflection button you can click to see more details about the reflection. You’ll now see a small flame next to the Raw Reflections section as well as the footprint size:

Enabled reflection shows flame next to footprint size

The footprint tells you how much space this data reflection consumes, in this case 4.33MB. In some cases you will see a second value represented in parentheses which represents the total size of all versions of this reflection - periodically data reflections are refreshed and expired versions are only cleared from the reflection store once a day. The value in parentheses is the total space consumed by all versions of this particular reflection.

The original size of this JSON file is 50.3MB. By creating a raw reflection on this physical dataset, we included all rows and all columns in the reflection, and the resulting size in Parquet is about 90% smaller than the JSON. Another thing that has happened is that we have located the reflection much closer to Dremio so it is no longer necessary to read the data over the network. These are just two ways Dremio can accelerate queries on this data - by organizing it in a format that is much more efficient for scanning, and by moving it closer to Dremio’s distributed processing environment. We’ll explore other ways Dremio can accelerate queries using reflections a little later in this tutorial.

Now let’s try our query again. Open a new query window and paste the query and click Run:

1
2
3
SELECT PdDistrict, DayOfWeek, Category, Resolution
FROM "@kelly".SFIncidents
WHERE '2016-02-01' <= SFIncidents."Date" AND '2016-03-01' > SFIncidents."Date"

Your query should be significantly faster this time. If it wasn’t be sure you clicked Run instead of Preview (Preview queries never use Reflections).

You can now go back to the Jobs tab to see whether this query was accelerated. You should see something like this:

Viewing whether the query was accelerated

Note the flame icon next to the status at the top of the job. This indicates the query was accelerated by Dremio’s query planner. You can also see on this overview page that it was accelerated by a reflection called “Unnamed Reflection” and the age of the reflection, which is 16m:55s in this example. The horizontal bars tell you this was a Raw Reflection (Aggregation Reflections use a different icon).

Great! Now you’ve created your first data reflection. From now on, queries against this virtual dataset should be accelerated rather than scanning the original JSON data on S3.

Dremio’s query planner can also use this reflection for other virtual datasets that are based on the same physical dataset. Let’s try this by navigating to our virtual dataset SFIncidents, making some changes, then saving it as a new virtual dataset.

Open SFIncidents and drop all the columns to the right of Address. You can click on the SQL editor window and should see the following SQL:

1
2
SELECT IncidntNum, Category, Descript, DayOfWeek, SFIncidents."Date" AS "Date", SFIncidents."Time" AS "Time", PdDistrict, Resolution, Address
FROM "@kelly".SFIncidents

Save this as a virtual dataset called SFIncidents_less. Now open a new query window and use the same query we used for SFIncidents, but modified to query SFIncidents_less instead:

1
2
3
SELECT PdDistrict, DayOfWeek, Category, Resolution
FROM "@kelly".SFIncidents_less
WHERE '2016-02-01' <= SFIncidents_less."Date" AND '2016-03-01' > SFIncidents_less."Date"

Now let’s open up the job to see if this query was accelerated. Indeed it was:

Confirming the query was accelerated

Dremio’s query planner understands the relationship of virtual datasets to one another and their underlying physical datasets. This means that Data Reflections can be used to accelerate a wide range of queries across many different virtual datasets.

Now let’s take a closer look at Aggregation Reflections

Exploring Aggregation Reflections

Let’s go back to our SFIncidents dataset and click on the gear icon to access the settings menu. If we look at the suggestions for Aggregation Reflections we can see that Dremio has analyzed the data types and cardinality of the fields in this JSON document to suggest several fields as dimensions and measures:

Viewing aggregation reflection settings

In this dataset there is only one numeric field, and so there is only one recommended measure. Let’s drag the Category field into the list of Dimensions to have that included in our Aggregation Reflection. Then toggle the switch for Aggregation Reflections and click Save. Once that job has completed you should see something like the following:

Seeing the data reflection was created successfully

The footprint of the Aggregation Reflection is only 1MB. This is because the reflection only includes five of the original fields, and because the data is aggregated according to the distinct values in these fields. The size ratio of a raw reflection and aggregation reflection will vary based on these factors, but typically aggregation reflections are much smaller than raw reflections.

You might have noticed that the process of building this aggregation reflection was much faster than building the raw reflection. This is because Dremio’s query planner was able to read from the raw reflection in order to create the aggregation reflection. In fact, for periodic refreshes of reflections, Dremio can minimize the number of times it reads from the physical source in order to update many data reflections efficiently. If you were creating the aggregation reflection first it would have taken just as long to create as the raw reflection, even though the final size is much smaller. The reason is that Dremio would need to read all the data in the dataset in order to compute the aggregated values.

Now let’s try an aggregation query to see how Dremio’s query planner handles this type of query. Open a new query window and enter:

1
2
3
4
SELECT Category, COUNT(\*) AS total
FROM "@kelly".SFIncidents
GROUP BY Category
ORDER BY total DESC

Now navigate over to the Jobs tab and take a look at the query profile:

Viewing the query profile for the aggregation query

We can see that this query was accelerated by another Unnamed Reflection, but this time the icon is different - it looks like a bar chart. This means an aggregation reflection was used for the query. Let’s take a closer look by clicking on the Acceleration tab of the query profile:

Checking the acceleration tab on the query profile

Here we can see that both reflections were considered by Dremio’s query planner. Ultimately the aggregation reflection was used because it was determined by Dremio’s cost-based optimizer to have a lower cost than the raw reflection. We can see that the raw reflection was not chosen because it was “too expensive.”

Advanced Data Reflection Menus

If we go back to the Data Reflections menu we can see there is a Switch to Advanced button on the upper right. Let’s click that to take a look at the Advanced menu (note, once you do this you can’t go back to the basic menu).

Using Dremio's advanced data reflection settings

You can see there are many more options here. For example, you can name each reflection, which may be helpful in understanding the pattern of query for which it was created. In fact, go ahead and name this reflection “All columns”.

You can also select a subset of the columns for a raw reflection. This can be especially helpful for tables with many, many columns when only a subset are queried at any given time. You can also sort and partition data in the reflection to speed up certain query patterns. We’ll look at these in more detail in a more advanced tutorial, but for now let’s try adding one more reflection sorted on a specific column.

Click the New Reflection button on the right and enable all the columns except those after the Address column. Also, click the Sorted cell for the Date field to make this new reflection sorted by Date, which will helpful in certain query patterns, such as those that sort on this field or range-based predicates on this field. Name this reflection “Sort by Date,” then click save to create this new reflection. When this job is finished and you re-open the reflections screen for this dataset you should see:

Viewing the dataset reflections once the job has completed

Now let’s go back to try another query. Open a new query window and enter the following SQL, which is on SFIncidents_less, but now includes an ORDER BY clause on the date field.

1
2
3
4
SELECT PdDistrict, DayOfWeek, Category, Resolution
FROM "@kelly".SFIncidents_less
WHERE '2016-02-01' <= SFIncidents_less."Date" AND '2016-03-01' > SFIncidents_less."Date"
ORDER BY SFIncidents_less."Date" DESC

Dremio should use the new reflection we created as the data is already sorted in the reflection, and so the performance should be significantly faster. Let’s look at the query profile to find out:

Viewing the query profile now that the new reflection has been created

Here we can see that our new reflection called “Sort by Date” was used to accelerate the query. Dremio considered our other two reflections as well. The other raw reflection called “All columns” was not used because it was too expensive compared to the new reflection that sorts the data. The aggregation reflection was not used because it did not cover the query. This means that the data available in this data reflection could not be used to resolve the query, which makes sense in this case because the data is aggregated. Another example of when you might see this message is if your reflection only included a subset of the columns, and your query asked for a column that wasn’t in the reflection.

We can see this by issuing a query that requests columns we know are not in the new reflection:

1
2
3
4
SELECT PdDistrict, DayOfWeek, Category, Resolution, PdId
FROM "@kelly".SFIncidents
WHERE '2016-02-01' <= SFIncidents."Date" AND '2016-03-01' > SFIncidents."Date"
ORDER BY SFIncidents."Date" DESC

And the query profile shows us that the All Columns reflection was used:

Viewing the query profile

The other two reflections did not cover this particular query.

Summary

In this tutorial we’ve worked through the basics of Data Reflections in Dremio, a powerful feature that can dramatically accelerate queries on diverse datasets from many different databases and file systems. We showed how to create data reflections and showed a number of different queries and how they are accelerated by investigating the query profiles in Dremio’s job history. We explored the difference between Raw and Aggregation reflections and showed how multiple data reflections can be used on the same physical dataset. Finally, we showed how derivative virtual datasets can benefit from “upstream” data reflections, all invisibly to end users and their favorite tools.