We made Dremio open source so that everyone can benefit from this new approach to data analytics. We also build Dremio Enterprise Edition with features that are essential for large organizations, such as enterprise security and connectivity to enterprise data sources. Learn more about Dremio Editions.
At Dremio we are committed to the open source software model. Many of us have been actively committing to projects for nearly a decade. We use a number of projects to build Dremio, including projects we embed in our products, as well as tools we use to build software.
We created the Apache Arrow project in close collaboration with many other projects. There are two main goals of the project. First, to define a standard representation for colunarized data for in-memory analytics. Building on the benefits of columnarized data on disk, Apache Arrow makes a different set of tradeoffs for in-memory processing.
Second, we wanted to ensure a common set of APIs that could easily be accessed across processes in different software languages. It turns out that the overhead of serializing and deserializing data between processes is frequently the limiting factor in high performance analytics. Apache Arrow standardizes an efficient in-memory columnar representation that is the same as the wire representation. Today it includes first class bindings in over 13 projects, including Spark, Hadoop, R, Python/Pandas, and Dremio.
Dremio’s distributed SQL execution engine is based on Apache Arrow. Dremio’s cache is persisted as Parquet files, then read into memory as Arrow buffers for processing.
A columnar database organizes the values for a given column contiguously on disk. This has the advantage of significantly reducing the number of seeks for multi-row reads. Furthermore, compression algorithms tend to be much more effective on a single data type rather than the mix of types present in a typical row. The tradeoff is that writes are slower, but this is a good optimization for analytics where reads typically far outnumber writes.
Apache Parquet was created in 2012 and 2013 by teams at Twitter and Cloudera as a columnar data model for HDFS. Since then it has become the standard way people store data for analytics in their Hadoop clusters, as well as other processing systems.
Apache Calcite is a SQL parser, validator, and engine for implementing a cost-based optimizer. We have developed our own adapters to data sources, and our own model for calculating costs in our query planner.
ZooKeeper is a distributed, hierarchical key-value store. We use ZooKeeper to manage state for ephemeral processes.
RocksDB is a high-performance embedded key-value store from Facebook. We use RocksDB to persist metadata.
Ansible is an automation framework. We use Ansible to provision all our environments as part of our continuous integration and continuous delivery.
Like most startups we use Git and Github for source control.
Gerrit is a code collaboration tool created by Google. We use Gerrit for code reviews.
Dremio is a new approach to data analytics. Learn about Dremio.