Apache Drill Explained by Dremio
Apache Drill is an open-source SQL execution engine that makes it possible to use SQL to query non-relational databases and file systems. This includes joins between these systems – for example, Drill could allow a user to join ecommerce data stored in Elasticsearch with web analytics data stored in a Hadoop cluster.
Apache Drill was initially inspired by Dremel. Dremel is a query system built by Google, designed to speed up large scale interactive data analysis. Dremel was built to run queries using a columnar data representation, which helps in maintaining low latency, even when run against large data sets.
Drill is mainly focused on Hadoop and cloud storage:
- Hadoop: Apache Hadoop, MapR, and Amazon EMR
- Cloud storage: Amazon S3, Google Cloud Storage, Azure Storage, Swift
Drill is able to support these datastores, and provide good performance, through several key innovations.
Conventional query engines – such as Hive, Impala, and relational databases – need to understand the data structure before they can run any query processes. Drill, on the other hand, supports on-the-fly schema discovery. This is accomplished by compiling and then re-compiling the query during the initial execution. Because the query is defined by the actual data flow, Drill can manage data with evolving schema or even no schema at all.
Traditional query engines are built on a relational data model, which is limited to flat records controlled by a fixed structure. Apache Drill was designed from the ground up to support complex/semi-structured data found in non-relational datastores such as Hadoop and cloud storage. Drill's internal in-memory data representation is columnar, allowing it to perform low cost SQL processing on complex data without shifting the data to be represented in rows.
Drill allows users to customize the way in which they join tables, either by distributed or broadcast joins:
- Broadcast Join: All relevant records of a file are broadcast to all other nodes before the join is initiated. The inner side of the join is redistributed while the outer side is kept in its original location. A broadcast join is useful when a large table is being joined to a smaller table.
- Distributed Join: Each side of the join is hash distributed using a hash-based distribution operator on the join key. A distributed join is useful when both sides of tables are fairly large.
You can connect to Apache Drill through the following interfaces:
- Drill Shell: A pure-Java console utility for executing SQL commands
- Drill Web UI: A simple interface for managing Drill. This UI can be launched from any browser pointed at a node in the Drill cluster
- ODBC/JDBC: Drivers that give Drill access to BI tools, such as Tableau, or SQuirreL
- SQLLine: A JDBC application packaged with Drill
- Rest API: Uses HTTP requests to access Drill through the web console
Drill relies on a number of different techniques for improved performance.
Drill represents data internally as JSON documents – similar to MongoDB and Elasticsearch. These JSON documents are "shredded" into columns, which allows Drill to deliver the performance enhancements of columnar analytics but retain the ability to query complex data. Note, this internal representation is not based on Apache Arrow.
Drill compiles queries at runtime, which allows for faster execution than interpreted queries.
Drill takes advantage of modern CPU instruction pipelines to operate on batches of records (vectors) instead of individual records. Vectorized processing is far more efficient on recent CPUs, but many database engines do not take advantage of it.
If a Drillbit is running on each node in a cluster, Drill can structure is execution plan such that data movement over the network is minimized.
Additional Required Middleware
As a SQL execution engine, Drill offers valuable functionality as far as executing queries on non-relational and filesystem sources. In this capacity, Drill can act as one tier in a data analytics stack, where other tools provide data curation, governance, and other services that are important for IT analysts.
Drill nodes run a service called a "Drillbit", which is responsible for query acceptance and processing. All Drill nodes are the same, which makes installation and monitoring much easier.
A Drill client issues a query to any Drillbit service, and that Drillbit then parses and optimizes the query, and is responsible for its execution. The owning Drillbit parcels the query out to other Drillbits according to its execution plan, and then receives the results and returns them to the client when complete.
Drill offers standard ODBC, JDBC, and REST interfaces to clients, which means that it works with BI tools such as Tableau, Power BI, and Qlik with no additional middleware.
Looking for Faster Queries, Self-Service, or an Integrated Solution?
Dremio is an open-source Data-as-a-Service Platform based on Apache Arrow. Dremio goes beyond Apache Drill to provide an integrated self-service platform that incorporates capabilities for data acceleration, data curation, data catalog, and data lineage, all on any source, and delivered as a self-service platform.
Run SQL on any data source. Including optimized push downs and parallel connectivity to non-relational systems like MongoDB, Elasticsearch, S3 and HDFS.
Accelerate data. Using Data Reflections, a highly optimized representation of source data that is managed as columnar, compressed Apache Arrow for efficient in-memory analytical processing, and Apache Parquet for persistence.
Integrated data curation. Easy for business users, yet sufficiently powerful for data engineers, and fully integrated into Dremio.
Cross-Data Source Joins. execute high-performance joins across multiple disparate systems and technologies, between relational and NoSQL, S3, HDFS, and more.
Dremio helps you get more value from your data, faster. Try it today.