Dremio Jekyll

Dremio Loves Open Source

We made Dremio open source so that everyone can benefit from this new approach to data analytics. We also build Dremio Enterprise Edition with features that are essential for large organizations, such as enterprise security and connectivity to enterprise data sources. Learn more about Dremio Editions.

At Dremio we are committed to the open source software model. Many of us have been actively committing to projects for nearly a decade. We use a number of projects to build Dremio, including projects we embed in our products, as well as tools we use to build software.

Apache Arrow

Apache Arrow

We created the Apache Arrow project in close collaboration with many other projects. There are two main goals of the project. First, to define a standard representation for colunarized data for in-memory analytics. Building on the benefits of columnarized data on disk, Apache Arrow makes a different set of tradeoffs for in-memory processing.

Second, we wanted to ensure a common set of APIs that could easily be accessed across processes in different software languages. It turns out that the overhead of serializing and deserializing data between processes is frequently the limiting factor in high performance analytics. Apache Arrow standardizes an efficient in-memory columnar representation that is the same as the wire representation. Today it includes first class bindings in over 13 projects, including Spark, Hadoop, R, Python/Pandas, and Dremio.

Dremio’s distributed SQL execution engine is based on Apache Arrow. Dremio’s cache is persisted as Parquet files, then read into memory as Arrow buffers for processing.

  • PMC Chair - Jacques Nadeau
  • PMC Members - Julien Le Dem, Abdelhakim Deneche, Steven Phillips, Jason Altekruse

Apache Parquet

Apache Parquet

A columnar database organizes the values for a given column contiguously on disk. This has the advantage of significantly reducing the number of seeks for multi-row reads. Furthermore, compression algorithms tend to be much more effective on a single data type rather than the mix of types present in a typical row. The tradeoff is that writes are slower, but this is a good optimization for analytics where reads typically far outnumber writes.

Apache Parquet was created in 2012 and 2013 by teams at Twitter and Cloudera as a columnar data model for HDFS. Since then it has become the standard way people store data for analytics in their Hadoop clusters, as well as other processing systems.

  • PMC Chair - Julien Le Dem

Apache Calcite

Apache Calcite

Apache Calcite is a SQL parser, validator, and engine for implementing a cost-based optimizer. We have developed our own adapters to data sources, and our own model for calculating costs in our query planner.

  • PMC Members - Jacques Nadeau
  • Committers - Laurent Goujon, MinJi Kim

React

React

React is a powerful JavaScript framework from Facebook for building interactive user experiences. We think Dremio’s user experience sets it apart from other products in the market, and we have an exceptional team building this part of the product.

Apache ZooKeeper

Apache Zookeeper

ZooKeeper is a distributed, hierarchical key-value store. We use ZooKeeper to manage state for ephemeral processes.

RocksDB

RocksDB

RocksDB is a high-performance embedded key-value store from Facebook. We use RocksDB to persist metadata.

Ansible

Ansible

Ansible is an automation framework. We use Ansible to provision all our environments as part of our continuous integration and continuous delivery.

Git

Git

Like most startups we use Git and Github for source control.

Gerrit

Gerrit

Gerrit is a code collaboration tool created by Google. We use Gerrit for code reviews.

Want to Learn More?

Dremio is a new approach to data analytics. Learn about Dremio.