Dremio Jekyll

Dremio Loves Open Source

We made Dremio open source so that everyone can build a Data-as-a-Service Platform. We also build Dremio Enterprise Edition with features that are essential for large organizations, such as enterprise security and connectivity to enterprise data sources. Learn more about Dremio Editions.

At Dremio we are committed to the open source software model. Many of us have been actively committing to projects for nearly a decade. We use a number of projects to build Dremio, including projects we embed in our products, as well as tools we use to build software.

Apache Arrow

Apache Arrow and Dremio

We created the Apache Arrow project in close collaboration with many other projects. There are two main goals of the project. First, to define a standard representation for columnarized data for in-memory analytics. Arrow is complementary to popular on-disk formats such as Apache Parquet and Apache ORC.

Second, we wanted to ensure a common set of APIs and low-level processing capabilities that could easily be accessed across different applications in many different languages. It turns out that the overhead of serializing and deserializing data between processes is frequently the limiting factor in high performance analytics. Apache Arrow standardizes an efficient in-memory columnar representation that is the same as the wire representation. Today it includes first class bindings in many libraries and languages, including JAVA, C++, Python, C, Go, Ruby, Rust, and JavaScript.

Dremio’s distributed SQL execution engine is based on Apache Arrow. As data accessed from different sources (RDBMS, NoSQL, file systems, etc) Dremio it is read into native Arrow buffers for in-memory processing across 1-1000+ servers.

  • PMC Chair - Jacques Nadeau
  • PMC Members - Abdelhakim Deneche, Steven Phillips

Apache Parquet

Apache Parquet and Dremio

A columnar database organizes the values for a given column contiguously on disk. This has the advantage of significantly reducing the number of seeks for multi-row reads. Furthermore, compression algorithms tend to be much more effective on a single data type rather than the mix of types present in a typical row. The tradeoff is that writes are slower, but this is a good optimization for analytics where reads typically far outnumber writes.

Apache Parquet was created in 2012 and 2013 by teams at Twitter and Cloudera as a columnar data model for HDFS. Since then it has become the standard way people store data for analytics in their Hadoop clusters, as well as other processing systems.

Apache Calcite

Apache Calcite and Dremio

Apache Calcite is a SQL parser, validator, and engine for implementing a cost-based optimizer. We have developed our own adapters to data sources, and our own model for calculating costs in our query planner.

  • PMC Members - Jacques Nadeau
  • Committers - Laurent Goujon

React

React and Dremio

React is a powerful JavaScript framework from Facebook for building interactive user experiences. We think Dremio’s user experience sets it apart from other products in the market, and we have an exceptional team building this part of the product.

Apache ZooKeeper

Apache Zookeeper and Dremio

ZooKeeper is a distributed, hierarchical key-value store. We use ZooKeeper to manage state for ephemeral processes.

RocksDB

RocksDB and Dremio

RocksDB is a high-performance embedded key-value store from Facebook. We use RocksDB to persist metadata.

Ansible

Ansible and Dremio

Ansible is an automation framework. We use Ansible to provision all our environments as part of our continuous integration and continuous delivery.

Git

Git and Dremio

Like most startups we use Git and Github for source control.

Gerrit

Gerrit and Dremio

Gerrit is a code collaboration tool created by Google. We use Gerrit for code reviews.

Want to Learn More?

Dremio is an open source Data-as-a-Service Platform. Learn about Dremio.