Dremio Jekyll

Accelerating Analytics For Postgres With Dremio

Intro

Dremio is a new approach to data analytics that works with many types of data sources and most BI tools, R, Python, and any SQL-based client. Dremio’s solution eliminates the need for traditional ETL, data warehouses, cubes, and aggregation tables, as well as the infrastructure, copies of data, and effort these systems entail. One of the most important characteristics of Dremio is a feature called Dremio Reflections which can dramatically accelerate analytical processing.

In this article, we will first introduce a slow query executed in PostgreSQL. Then we will show how to use Dremio to accelerate the same query. We will walk through the steps of connecting to PostgreSQL, creating a Reflection, and finally showing how the query performance has improved.

Assumptions

To follow this tutorial you should have access to a Dremio installation. If you haven’t done so already, installing Dremio is easy - see the Quickstart for instructions.

We also think it will be easier if you’ve read Getting Oriented to Dremio and Working With Your First Dataset. If you have questions along the way, don’t hesitate to ask on the Dremio Community Site.

Note that Dremio supports ODBC and JDBC for issuing queries from your favorite tool. In this tutorial we will be issuing queries from Dremio’s UX.

Starting Point: A Slow Query in Postgres

Before focusing on Dremio, let’s start with the database table we have in PostgreSQL and the query we need to execute. We have a table in PostgreSQL called dremio_test with 4,000,000 records, each record describes an individual person and has fields like person_id, city, street, weight, height among others. The following query is for a health study where we need to report some metrics of people living on the same street.

In the following image we can see the execution of the query in PostgreSQL with the number of rows returned and the query execution time.

Query results and total query time as reported by pgAdmin for PostgreSQL

This query executes in PostgreSQL in about 46 seconds. The reason for the low performance is mainly due to the sort done by PostgreSQL in order to solve the group by clause and calculate the aggregation functions. In the following image we can see the output of the “explain analyze” command in PostgreSQL, which shows that the query execution is divided basically into the following 3 phases:

  • A sequential scan to solve the filters in the where clause which takes 2.3 seconds (2313 miliseconds in the red circle) and return 25,278 rows.

  • A sort of 998,919 rows based on the key: city, street which takes 44 seconds (green box).

  • A GroupAggregate phase to calculate aggregation functions, making an accumulated execution time of 45 seconds (blue box).

Viewing Postgres query profile data

One approach frequently applied to optimize this kind of query, is to execute it against a different database repository specifically designed to optimize analytical queries, such as a data warehouse. For this approach we would need to provision a server, export the data, apply the necessary transformations and import the resulting data. All these tasks require a very complex, time consuming and expensive project.

This leads us to Dremio, a different approach to making analytical queries on PostgreSQL fast, but without all the complexity.

Creating A PostgreSQL Data Source

In this section, we will first explain how to create a PostgreSQL connection and then we will see how to create a dataset in Dremio based on a query. Let’s start!

Create a connection to PostgreSQL. In Dremio this is a really easy task. We start by pressing the “+” sign marked with a blue circle:

Clicking + in Dremio to create new data source

Then the following page shows many data source options. Let’s pick the PostgreSQL option.

Supported data sources in Dremio, including PostgreSQL

After selecting the PostgreSQL option, we need to complete a few data source related fields as the IP address and port of the server we are connecting to, as well as the database name and authentication information. Let’s name this new source “population”:

Creating a new Postgres data source in Dremio

Finally, after pressing Save we will have a new data source to PostgreSQL created. If we double click on the new data source name, Dremio will show us a list of the tables and views stored in our PostgreSQL server. In our case, the next image shows a list of tables where we can see the table “study_member”. Note that Dremio is showing regular PostgreSQL tables, we didn’t create copies for tables or move any data.

Viewing all PostgreSQL tables in Dremio

Clicking on any of these tables would open the table and allow us to preview the data, perform any necessary curations, join to other tables, and more. These steps are covered in other tutorials such as Working With Your First Dataset and Data Curation With Dremio.

Executing A Query To PostgreSQL From Dremio

In this section we will first execute a query in Dremio similar to the query we executed via pgAdmin in the first part of the article. To execute a query in Dremio, click the “New Query” and then type the query. Note, that we need to change the table name in the FROM clause in order to follow Dremio’s naming model. Dremio follows a hierarchical namespace model.We named this data source “population,” so the new name of this table is “population.public.study_member”. Let’s see the next image with the query and first resulting rows:

Running a SQL query in Dremio and viewing the results

If we go to the “Jobs” section we can explore a detailed report for the query. The next image shows the query, the execution time, along with other details. As we can see, the execution time is still around 47 seconds, so Dremio isn’t adding much latency to the query when it pushes the work down to PostgreSQL.

Confirming query execution in Dremio using the same Postgres SQL

Now that we’ve seen the query run and return results to Dremio, let’s take a look into PostgreSQL logs to verify Dremio did a push down of the query. In the following image we can see a terminal where the PostgreSQL log file is open. In this file we can see the execution (at PostgreSQL level) of the same query, the query duration (marked in blue) is almost 47 seconds. We also can see marked in green the query text on Dremio and PostgreSQL log and marked in red the execution date on both sides.

Confirming SQL push down into Postgres and understanding processing time

Creating A Dataset In Dremio

At this point we will create a “virtual dataset” called “health_study_data” based on the the query we executed. A virtual dataset in Dremio is similar to a view in a relational databases - no data is being copied. And like views, users can query this virtual dataset, including predicates, aggregations, and joins to other datasets. Dremio’s query planner will rewrite queries against the virtual dataset and push them down as appropriate.

Dataset creation in Dremio is really easy. In our query window we can simply click the “Save As” button.

Saving a virtual dataset in Dremio based on Postgres table

Then we are prompted to name the virtual dataset. Let’s call it “health_study_data”:

Naming new virtual dataset

After creating the “health_study_data” virtual dataset, we could run an equivalent query by executing the following SQL command:

SELECT * FROM health_data.health_study_data

In the next image we can see the query with a dataset referenced in the FROM and after execution the results of the query.

Viewing results of SQL push down into Postgres

After issuing this simple SELECT *, Dremio does a push down of the original query of the dataset to PostgreSQL. Again, we can see this query is being pushed down into PostgreSQL as is shown in the log:

Confirming Dremio's SQL push down into Postgres

Using A Reflection To Accelerate A Query In Dremio

In this section, we will execute the same query as the first section, and we will show how Dremio can optimize the query execution by using a Reflection.

Reflections in Dremio are powered by a data acceleration technology based on Apache Arrow, which is a columnar in-memory analytics layer designed to accelerate big data. Dremio’s Reflections are optimized physical data structures that accelerate data and queries automatically.

First, let’s enable Reflections for this virtual dataset. For this task we first select the dataset “health_study_data” and press the configuration symbol marked in red (you can also click this gear from within the virtual dataset):

Clicking the gear to access virtual dataset configurations

Then, select “Settings” and the following page is displayed:

Accessing Dremio Reflections configuration

Then click on “Reflections”, and the following page is displayed.

Enabling Dremio Reflections on Postgres data

Dremio provides two types of Reflections: Raw Reflections and Aggregation Reflections. Describing Reflections in-depth is beyond the scope of this article, but here’s a quick summary.

  • Raw Reflections accelerate the underlying data by keeping a copy of the source data with row-level fidelity in a physical format that is highly optimized for analytics and specific workloads.

  • Aggregation Reflections provide more advanced optimizations for group by and aggregation workloads.

In this example we will enable both types of Reflections by activating both blue slide buttons showed in the previous image. Remember to press Save!

At this time Dremio will build a Reflection for this virtual dataset by reading the data from PostgreSQL and storing a compressed and columnarized version of the results as Parquet in its reflection store. Depending on the size of your data, this may take anywhere from a few seconds to several minutes.

After these Reflections are ready, this screen now shows us flames indicating the reflections are active. We can also see the size of the footprint, which represents the amount of space consumed by the Reflections on disk.

Understanding the cost of Dremio Reflections

Note that both types of reflections consume less than 1MB combined. The size of the original table in PostgreSQL is over 1GB. The reason these reflections are less than one 1/1000th of the original size is due to multiple factors:

  • This is a subset of the columns in the table.

  • This is a subset of the rows based on the query predicates and group by statements.

  • Dremio columnarizes and compresses the data, which can reduce storage by 70% or more (depending on many factors)

The point is that Dremio’s reflections typically consume a relatively small percentage of the original data size. Exactly how much will depend on these factors and others.

Once we have Reflections enabled, we can go back to the text query window, by clicking on “New Query,” type or paste the original query and press run, as we can see in the next image.

Testing for acceleration of Postgres table using Dremio

We can see the same results returned as before, but much more quickly. Note that the results would be the same if we were querying the virtual dataset with

SELECT * FROM health_data.health_study_data

After we execute the query if we click on “jobs”, we will see the next print screen where Dremio shows the query text along with the execution time, which for this query is less than 1 second (580ms to be exact)! The flame at the right of the execution time indicates that the query was accelerated by using a Reflection in Dremio.

Viewing Dremio job to confirm status of acceleration

After enabling Reflections for the “health_study_data” dataset, anytime that Dremio needs to execute a query, an evaluation will be done to determine the optimal alternative to execute the query between using the “health_study_data” dataset or pushing down the query to PostgreSQL data source. In the previous query, Dremio choose to solve the query by using one of the Reflections instead of pushing down the query to PostgreSQL.

At the right part of the screen, we can see a detail report of the query, if we press “Profile”, and then press “Acceleration”, we can see the following print screen where we can confirm the query was accelerated.

Confirming acceleration of Postgres data

In case you’re wondering Dremio used the Raw Reflection for this query, even though the query was an aggregation query. The reason is that there was a group by on state, which is a column that was not included in the aggregation reflection definition. So, in this case, Dremio’s query planner had three potential options: Raw Reflection, Aggregation Reflection, and Push Down. It determined that Aggregation Reflection was not an option, and that Raw Aggregation was a lower cost than Push Down into PostgreSQL.

A query such as:

SELECT street, COUNT(*)

FROM health_data.health_study_data

GROUP BY street

Would have been optimized by Dremio’s query planner to use the aggregation reflection we created earlier.

Conclusion

Dremio is a data platform from where we can access different data sources, and also allow us to execute queries to integrate BI tools with several data engines or sources. In some cases Dremio can optimize dramatically the execution time of the queries, as we showed in the present article. What’s nice is that the original SQL didn’t need to be changed, and users can optimize their data using Dremio without having to change their tools or the data model they use to query.

In this tutorial we looked at PostgreSQL, but it’s worth noting that Dremio works with many other databases, including relational, NoSQL, and filesystems like Amazon S3 and Hadoop HDFS.