The Dremio Self-Service Data Platform is a new approach to data analytics which works with virtually any data sources and any business intelligence or data science tool. Dremio’s solution eliminates the need for traditional ETL, data warehouses, cubes, and aggregation tables, as well as the infrastructure, copies of data, and effort these systems entail.
One of the most important characteristics of Dremio is the data acceleration technology that dramatically accelerate analytical processing called Data Reflections. Data Reflections are based on Apache Arrow, which is a columnar in-memory data format that is optimized for analytics. Data Reflections are optimized physical data structures that accelerate data and queries automatically.
In this article, we will show how to connect Dremio with a new data source: Greenplum. Then we will simply execute a query from a table residing on Greenplum. Note: Greenplum is not a supported data source at the time of writing. However, you can still experiment with Greenplum. Because Greenplum uses the Postgres wire protocol, and because Dremio supports Postgres, it can be used to accelerate queries on Greenplum, and to blend this data with other sources.
Pivotal’s Greenplum database product uses massively parallel processing (MPP) techniques. Each computer cluster consists of a master node, standby master node, and segment nodes. All of the data resides on the segment nodes and the catalog information is stored in the master nodes.
Segment nodes run one or more segments, which are modified PostgreSQL database instances and are assigned a content identifier. For each table the data is divided among the segment nodes based on the distribution column keys specified by the user in the data definition language. For each segment content identifier there is both a primary segment and mirror segment which are not running on the same physical host. When a query enters the master node, it is parsed, planned and dispatched to all of the segments to execute the query plan and either return the requested data or insert the result of the query into a database table.
When issuing queries from Dremio to Greenplum, each query is submitted to Greenplum’s master node. In other distributed systems such as MongoDB and Elasticsearch, Dremio generates its own query plan and pushes query fragments down into each of the partitions of data (ie, Greenplum segments). In the future if Dremio supports Greenplum this is one approach that may be implemented to provide additional performance advantages.
Download and install Dremio on your local machine or a cluster of machines using Dremio’s Quick Start Guide for your operating system. Once installed, on Windows and macOS you’ll start and stop Dremio from this interface:
You can press the “Start” button to start Dremio, then press “Open Dremio” to launch Dremio in your default browser.
Log in with your credentials, and the following page is displayed in the browser:
Navigate to the Sources and click the “plus” sign. The following page will be displayed, and we will choose the data source type to use. As Greenplum is an MPP database engine based on PostgreSQL, you can click PostgreSQL.
You should see the following form. You can enter the connection parameters for your Greenplum cluster.
In the following screen we can see the tables in the Greenplum database. The purple icons indicate these are physical tables we are connected to.
In the following image we will execute a query to prove that we are working on a Greenplum database.
Dremio is a data platform that allows you to easily access different data sources using standard SQL, and to accelerate query processing by up to 1000x. In this article we took a closer look at connecting Dremio with Greenplum, an MPP analytical database. For next steps, you can take a closer look at using Data Reflections to accelerate queries on your Greenplum database by following this tutorial on Getting Oriented to Data Reflections.