We made Dremio open source so that everyone can build a Data-as-a-Service Platform. We also build Dremio Enterprise Edition with features that are essential for large organizations, such as enterprise security and connectivity to enterprise data sources. Learn more about Dremio Editions.
At Dremio we are committed to the open source software model. Many of us have been actively committing to projects for nearly a decade. We use a number of projects to build Dremio, including projects we embed in our products, as well as tools we use to build software.
We created the Apache Arrow project in close collaboration with many other projects. There are two main goals of the project. First, to define a standard representation for columnarized data for in-memory analytics. Arrow is complementary to popular on-disk formats such as Apache Parquet and Apache ORC.
Dremio’s distributed SQL execution engine is based on Apache Arrow. As data accessed from different sources (RDBMS, NoSQL, file systems, etc) Dremio it is read into native Arrow buffers for in-memory processing across 1-1000+ servers.
A columnar database organizes the values for a given column contiguously on disk. This has the advantage of significantly reducing the number of seeks for multi-row reads. Furthermore, compression algorithms tend to be much more effective on a single data type rather than the mix of types present in a typical row. The tradeoff is that writes are slower, but this is a good optimization for analytics where reads typically far outnumber writes.
Apache Parquet was created in 2012 and 2013 by teams at Twitter and Cloudera as a columnar data model for HDFS. Since then it has become the standard way people store data for analytics in their Hadoop clusters, as well as other processing systems.
Apache Calcite is a SQL parser, validator, and engine for implementing a cost-based optimizer. We have developed our own adapters to data sources, and our own model for calculating costs in our query planner.
ZooKeeper is a distributed, hierarchical key-value store. We use ZooKeeper to manage state for ephemeral processes.
RocksDB is a high-performance embedded key-value store from Facebook. We use RocksDB to persist metadata.
Ansible is an automation framework. We use Ansible to provision all our environments as part of our continuous integration and continuous delivery.
Like most startups we use Git and Github for source control.
Gerrit is a code collaboration tool created by Google. We use Gerrit for code reviews.
Dremio is an open source Data-as-a-Service Platform. Learn about Dremio.