Accelerating Analytics For Postgres With Dremio
Dremio is a new approach to data analytics that works with many types of data sources and most BI tools, R, Python, and any SQL-based client. Dremio’s solution eliminates the need for traditional ETL, data warehouses, cubes, and aggregation tables, as well as the infrastructure, copies of data, and effort these systems entail. One of the most important characteristics of Dremio is a feature called Dremio Reflections which can dramatically accelerate analytical processing.
In this article, we will first introduce a slow query executed in PostgreSQL. Then we will show how to use Dremio to accelerate the same query. We will walk through the steps of connecting to PostgreSQL, creating a Reflection, and finally showing how the query performance has improved.
To follow this tutorial you should have access to a Dremio installation. If you haven’t done so already, installing Dremio is easy - see the Quickstart for instructions.
Note that Dremio supports ODBC and JDBC for issuing queries from your favorite tool. In this tutorial we will be issuing queries from Dremio’s UX.
Starting Point: A Slow Query in Postgres
Before focusing on Dremio, let’s start with the database table we have in PostgreSQL and the query we need to execute. We have a table in PostgreSQL called dremio_test with 4,000,000 records, each record describes an individual person and has fields like person_id, city, street, weight, height among others. The following query is for a health study where we need to report some metrics of people living on the same street.
1 2 3 4 5 6 7 8 9 10 11 SELECT city, street, COUNT(\*) FROM public.study_member WHERE is_veteran AND kids = false AND street_num <> 'street 10' GROUP BY city, street
In the following image we can see the execution of the query in PostgreSQL with the number of rows returned and the query execution time.
This query executes in PostgreSQL in about 46 seconds. The reason for the low performance is mainly due to the sort done by PostgreSQL in order to solve the group by clause and calculate the aggregation functions. In the following image we can see the output of the “explain analyze” command in PostgreSQL, which shows that the query execution is divided basically into the following 3 phases:
- A sequential scan to solve the filters in the where clause which takes 2.3 seconds (2313 miliseconds in the red circle) and return 25,278 rows.
- A sort of 998,919 rows based on the key: city, street which takes 44 seconds (green box).
- A GroupAggregate phase to calculate aggregation functions, making an accumulated execution time of 45 seconds (blue box).
One approach frequently applied to optimize this kind of query, is to execute it against a different database repository specifically designed to optimize analytical queries, such as a data warehouse. For this approach we would need to provision a server, export the data, apply the necessary transformations and import the resulting data. All these tasks require a very complex, time consuming and expensive project.
This leads us to Dremio, a different approach to making analytical queries on PostgreSQL fast, but without all the complexity.
Creating A PostgreSQL Data Source
In this section, we will first explain how to create a PostgreSQL connection and then we will see how to create a dataset in Dremio based on a query. Let’s start!
Create a connection to PostgreSQL. In Dremio this is a really easy task. We start by pressing the “+” sign marked with a blue circle:
Then the following page shows many data source options. Let’s pick the PostgreSQL option.
After selecting the PostgreSQL option, we need to complete a few data source related fields as the IP address and port of the server we are connecting to, as well as the database name and authentication information. Let’s name this new source “population”:
Finally, after pressing Save we will have a new data source to PostgreSQL created. If we double click on the new data source name, Dremio will show us a list of the tables and views stored in our PostgreSQL server. In our case, the next image shows a list of tables where we can see the table “study_member”. Note that Dremio is showing regular PostgreSQL tables, we didn’t create copies for tables or move any data.
Clicking on any of these tables would open the table and allow us to preview the data, perform any necessary curations, join to other tables, and more. These steps are covered in other tutorials such as Working With Your First Dataset and Data Curation With Dremio.
Executing A Query To PostgreSQL From Dremio
In this section we will first execute a query in Dremio similar to the query we executed via pgAdmin in the first part of the article. To execute a query in Dremio, click the “New Query” and then type the query. Note, that we need to change the table name in the FROM clause in order to follow Dremio’s naming model. Dremio follows a hierarchical namespace model.We named this data source “population,” so the new name of this table is “population.public.study_member”. Let’s see the next image with the query and first resulting rows:
If we go to the “Jobs” section we can explore a detailed report for the query. The next image shows the query, the execution time, along with other details. As we can see, the execution time is still around 47 seconds, so Dremio isn’t adding much latency to the query when it pushes the work down to PostgreSQL.
Now that we’ve seen the query run and return results to Dremio, let’s take a look into PostgreSQL logs to verify Dremio did a push down of the query. In the following image we can see a terminal where the PostgreSQL log file is open. In this file we can see the execution (at PostgreSQL level) of the same query, the query duration (marked in blue) is almost 47 seconds. We also can see marked in green the query text on Dremio and PostgreSQL log and marked in red the execution date on both sides.
Creating A Dataset In Dremio
At this point we will create a “virtual dataset” called “health_study_data” based on the the query we executed. A virtual dataset in Dremio is similar to a view in a relational databases - no data is being copied. And like views, users can query this virtual dataset, including predicates, aggregations, and joins to other datasets. Dremio’s query planner will rewrite queries against the virtual dataset and push them down as appropriate.
Dataset creation in Dremio is really easy. In our query window we can simply click the “Save As” button.
Then we are prompted to name the virtual dataset. Let’s call it “health_study_data”:
After creating the “health_study_data” virtual dataset, we could run an equivalent query by executing the following SQL command:
1 SELECT * FROM health_data.health_study_data
In the next image we can see the query with a dataset referenced in the FROM and after execution the results of the query.
After issuing this simple SELECT *, Dremio does a push down of the original query of the dataset to PostgreSQL. Again, we can see this query is being pushed down into PostgreSQL as is shown in the log:
Using A Reflection To Accelerate A Query In Dremio
In this section, we will execute the same query as the first section, and we will show how Dremio can optimize the query execution by using a Reflection.
Reflections in Dremio are powered by a data acceleration technology based on Apache Arrow, which is a columnar in-memory analytics layer designed to accelerate big data. Dremio’s Reflections are optimized physical data structures that accelerate data and queries automatically.
First, let’s enable Reflections for this virtual dataset. For this task we first select the dataset “health_study_data” and press the configuration symbol marked in red (you can also click this gear from within the virtual dataset):
Then, select “Settings” and the following page is displayed:
Then click on “Reflections”, and the following page is displayed.
Dremio provides two types of Reflections: Raw Reflections and Aggregation Reflections. Describing Reflections in-depth is beyond the scope of this article, but here’s a quick summary.
- Raw Reflections accelerate the underlying data by keeping a copy of the source data with row-level fidelity in a physical format that is highly optimized for analytics and specific workloads.
- Aggregation Reflections provide more advanced optimizations for group by and aggregation workloads.
In this example we will enable both types of Reflections by activating both blue slide buttons showed in the previous image. Remember to press Save!
At this time Dremio will build a Reflection for this virtual dataset by reading the data from PostgreSQL and storing a compressed and columnarized version of the results as Parquet in its reflection store. Depending on the size of your data, this may take anywhere from a few seconds to several minutes.
After these Reflections are ready, this screen now shows us flames indicating the reflections are active. We can also see the size of the footprint, which represents the amount of space consumed by the Reflections on disk.
Note that both types of reflections consume less than 1MB combined. The size of the original table in PostgreSQL is over 1GB. The reason these reflections are less than one 1/1000th of the original size is due to multiple factors:
- This is a subset of the columns in the table.
- This is a subset of the rows based on the query predicates and group by statements.
- Dremio columnarizes and compresses the data, which can reduce storage by 70% or more (depending on many factors)
The point is that Dremio’s reflections typically consume a relatively small percentage of the original data size. Exactly how much will depend on these factors and others.
Once we have Reflections enabled, we can go back to the text query window, by clicking on “New Query,” type or paste the original query and press run, as we can see in the next image.
We can see the same results returned as before, but much more quickly. Note that the results would be the same if we were querying the virtual dataset with
1 SELECT * FROM health_data.health_study_data
After we execute the query if we click on “jobs”, we will see the next print screen where Dremio shows the query text along with the execution time, which for this query is less than 1 second (580ms to be exact)! The flame at the right of the execution time indicates that the query was accelerated by using a Reflection in Dremio.
After enabling Reflections for the “health_study_data” dataset, anytime that Dremio needs to execute a query, an evaluation will be done to determine the optimal alternative to execute the query between using the “health_study_data” dataset or pushing down the query to PostgreSQL data source. In the previous query, Dremio choose to solve the query by using one of the Reflections instead of pushing down the query to PostgreSQL.
At the right part of the screen, we can see a detail report of the query, if we press “Profile”, and then press “Acceleration”, we can see the following print screen where we can confirm the query was accelerated.
In case you’re wondering Dremio used the Raw Reflection for this query, even though the query was an aggregation query. The reason is that there was a group by on state, which is a column that was not included in the aggregation reflection definition. So, in this case, Dremio’s query planner had three potential options: Raw Reflection, Aggregation Reflection, and Push Down. It determined that Aggregation Reflection was not an option, and that Raw Aggregation was a lower cost than Push Down into PostgreSQL.
A query such as:
1 2 3 SELECT street, COUNT(*) FROM health_data.health_study_data GROUP BY street
Would have been optimized by Dremio’s query planner to use the aggregation reflection we created earlier.
Dremio is a data platform from where we can access different data sources, and also allow us to execute queries to integrate BI tools with several data engines or sources. In some cases Dremio can optimize dramatically the execution time of the queries, as we showed in the present article. What’s nice is that the original SQL didn’t need to be changed, and users can optimize their data using Dremio without having to change their tools or the data model they use to query.
In this tutorial we looked at PostgreSQL, but it’s worth noting that Dremio works with many other databases, including relational, NoSQL, and filesystems like Amazon S3 and Hadoop HDFS.