Apache Drill is an open-source SQL execution engine that makes it possible to use SQL to query non-relational databases and file systems. This includes joins between these systems – for example, Drill could allow a user to join ecommerce data stored in Elasticsearch with web analytics data stored in a Hadoop cluster.
Apache Drill was initially inspired by Dremel. Dremel is a query system built by Google, designed to speed up large scale interactive data analysis. Dremel was built to run queries using a columnar data representation, which helps in maintaining low latency, even when run against large data sets.
Drill is mainly focused on Hadoop and cloud storage:
Drill is able to support these datastores, and provide good performance, through several key innovations.
Conventional query engines – such as Hive, Impala, and relational databases – need to understand the data structure before they can run any query processes. Drill, on the other hand, supports on-the-fly schema discovery. This is accomplished by compiling and then re-compiling the query during the initial execution. Because the query is defined by the actual data flow, Drill can manage data with evolving schema or even no schema at all.
Traditional query engines are built on a relational data model, which is limited to flat records controlled by a fixed structure. Apache Drill was designed from the ground up to support complex/semi-structured data found in non-relational datastores such as Hadoop and cloud storage. Drill's internal in-memory data representation is columnar, allowing it to perform low cost SQL processing on complex data without shifting the data to be represented in rows.
Drill allows users to customize the way in which they join tables, either by distributed or broadcast joins:
You can connect to Apache Drill through the following interfaces:
Drill relies on a number of different techniques for improved performance.
Drill represents data internally as JSON documents – similar to MongoDB and Elasticsearch. These JSON documents are "shredded" into columns, which allows Drill to deliver the performance enhancements of columnar analytics but retain the ability to query complex data. Note, this internal representation is not based on Apache Arrow.
Drill compiles queries at runtime, which allows for faster execution than interpreted queries.
Drill takes advantage of modern CPU instruction pipelines to operate on batches of records (vectors) instead of individual records. Vectorized processing is far more efficient on recent CPUs, but many database engines do not take advantage of it.
If a Drillbit is running on each node in a cluster, Drill can structure is execution plan such that data movement over the network is minimized.
As a SQL execution engine, Drill offers valuable functionality as far as executing queries on non-relational and filesystem sources. In this capacity, Drill can act as one tier in a data analytics stack, where other tools provide data curation, governance, and other services that are important for IT analysts.
Drill nodes run a service called a "Drillbit", which is responsible for query acceptance and processing. All Drill nodes are the same, which makes installation and monitoring much easier.
A Drill client issues a query to any Drillbit service, and that Drillbit then parses and optimizes the query, and is responsible for its execution. The owning Drillbit parcels the query out to other Drillbits according to its execution plan, and then receives the results and returns them to the client when complete.
Drill offers standard ODBC, JDBC, and REST interfaces to clients, which means that it works with BI tools such as Tableau, Power BI, and Qlik with no additional middleware.
Dremio is an open-source Data-as-a-Service Platform based on Apache Arrow. Dremio goes beyond Apache Drill to provide an integrated self-service platform that incorporates capabilities for data acceleration, data curation, data catalog, and data lineage, all on any source, and delivered as a self-service platform.
Run SQL on any data source. Including optimized push downs and parallel connectivity to non-relational systems like MongoDB, Elasticsearch, S3 and HDFS.
Accelerate data. Using Data Reflections, a highly optimized representation of source data that is managed as columnar, compressed Apache Arrow for efficient in-memory analytical processing, and Apache Parquet for persistence.
Integrated data curation. Easy for business users, yet sufficiently powerful for data engineers, and fully integrated into Dremio.
Cross-Data Source Joins. execute high-performance joins across multiple disparate systems and technologies, between relational and NoSQL, S3, HDFS, and more.
Data Lineage . Full visibility into data lineage, from data sources, through transformations, joining with other data sources, and sharing with other users.
Dremio helps you get more value from your data, faster. Try it today.