Dremio Jekyll

Introducing the Gandiva Initiative for Apache Arrow

Summary

We recently announced The Gandiva Initiative for Apache Arrow. This is a new execution kernel for Arrow that is based on LLVM. Gandiva provides very significant performance improvements for low-level operations on Arrow buffers. We first included this work in Dremio to improve the efficiency and performance of analytical workloads on our platform, which will become available to users later this year. In this post we will describe the motivation for the initiative, implementation details, some performance results, and some plans for the future.

A note on the name: Gandiva is a mythical bow from the Indian epic The Mahabharata used by the hero Arjuna. According to the story Gandiva is indestructible, and it makes the arrows it fires 1000x more powerful.

About Arrow

Apache Arrow is a cross-platform standard for columnar data for in-memory processing. You can think of Arrow as the in-memory counterpart to popular on-disk formats like Apache Parquet and Apache ORC, and increasingly as the standard used by many different systems. The Arrow project provides several components for users to build into their projects:

  • a columnar specification that is optimized for modern CPUs and GPUs, which is designed to represent both flat and hierarchical data structures like JSON and Tensors
  • zero-copy reads that eliminate the need for serializing and de-serializing data across platforms
  • computational libraries in popular languages, including Java, C++, C, Python, Ruby, Go, Rust, and JavaScript

Projects like pandas in Python use Arrow to improve the efficiency and performance of accessing data in-memory through the dataframe APIs. As another example, Dremio uses Arrow in our SQL execution engine: as data is accessed from different sources, we read the data into native Arrow buffers directly, and all processing is performed on these buffers.

You can learn more about the origins and history of Arrow here. Details on the specification and associated libraries are available here.

About LLVM

LLVM is an open source compiler that was originally developed by Swift language creator Chris Lattner in 2000, and is extensively used by Apple. While LLVM is an alternative to GCC for general purpose compiling needs, it also provides Just-in-Time compilation capabilities that can incorporate runtime information to produce highly optimized assembly code for the fastest possible evaluation.

For Gandiva we wanted to take advantage of the just-in-time compilation abilities of LLVM to improve the efficiency and latency of operations on Arrow buffers. Before Gandiva, in Dremio SQL queries were dynamically compiled to efficient byte code for execution by the JVM. While this provides very significant advantages over interpreted processing of SQL expressions, we believed that by using LLVM we could take another major step forward in terms of making more optimal use of the underlying CPU and GPU architecture available for processing.

By combining LLVM with Apache Arrow libraries, Gandiva can perform low-level operations on Arrow in-memory buffers such as sorts, filters, and projections that are highly optimized for specific runtime environments, improving resource utilization and providing faster, lower-cost operations of analytical workloads.

Gandiva Architecture

Gandiva is a new open source project licensed under the Apache license and developed in the open on GitHub. It is provided as a standalone C++ library for efficient evaluation of arbitrary SQL expressions on Arrow buffers using runtime code-generation in LLVM. While we use Gandiva at Dremio, we worked very hard to make sure Gandiva is an independent kernel that can be incorporated into any analytics system. As such, it has no runtime or compile time dependencies on Dremio or any other execution engine. It provides Java APIs that use the JNI bridge underneath to talk to C++ code for code generation and expression evaluation. Dremio’s execution engine leverages Gandiva Java APIs.

Applications submit an expression tree to the Gandiva compiler, which compiles for the local runtime environment. The request is then handed to the Gandiva execution kernel, which consumes and produces batches of Arrow buffers.

How applications interact with Gandiva

Gandiva Expression Library

Gandiva today supports 100s of expressions, and we hope to grow this to 1000s over the next few months. Gandiva supports both filter and project relational operators. In addition Gandiva supports:

  • Arrow Math operators: +, -, /,*,%, ^
  • Arrow Boolean operators: ==. !=, ===, <, >, AND, OR, CASE … ELSE/IF … ELSE, etc
  • Arrow Dates and Times: String → Date, String → Timestamp, Date Extract, etc.

For example Gandiva can process the following expressions:

  • (𝑐𝑜𝑙𝐴 % 5==0) 𝐴𝑁𝐷 (𝑡𝑜𝑑𝑎𝑦 − 𝑒𝑥𝑡𝑟𝑎𝑐t(𝑑𝑎𝑦𝑠 𝑓𝑟𝑜𝑚 𝑐𝑜𝑙𝐵)) < 10

  • CASE WHEN a < 5 THEN 0 WHEN a > 10 THEN 2 ELSE 1

Null Decomposition and Pipelining

Another optimization implemented in Gandiva is something we call null decomposition. Expressions submitted to Gandiva might involve records with NULL values. For some operations, we know that the result of many operations on expressions that include NULL values is always NULL. By separating whether a value is null (validity) from the actual value (data), we can much more efficiently process record batches.

image alt text

With this optimization, we can then determine nullness using bitmap intersections which can significantly reduce branching overhead by the CPU. Data values can then be batched and submitted for efficient SIMD processing on modern CPUs. It turns out this type of semantic is very common in SQL-based processing, and so this optimization is actually very effective, especially for SIMD and GPU-based processing.

Vectorization and SIMD

Arrow memory buffers are already well organized for SIMD instructions. Depending on the hardware, Gandiva can processing larger batches of values as a single operation. For example, considering the diagram below, you might have two sets of 2 bit values that need to be combined in an operation.

Vectorization and SIMD in Apache Arrow

For CPUs with AVX-128 Gandiva can process 8 pairs of these 2 byte values in a single vectorized operation. Additionaly where available, AVX-512 processors can use Gandiva to process 4x as many values in a single operation. This optimization is performed automatically, and many others are possible. There’s a talk on vectorization in Arrow here as well as a blog post with some of the performance enhancements we observed from these types of changes here.

Asynchronous Thread Control

One of the other key things we’ve learned from working with Arrow in Dremio is the need to work to effectively share limited resources in a multi-tenant environment. This is in contrast to other environments, such as a single Python-based client for example, where the single process expects to consume all available resources. In a multi-tenant environment, you have multiple consumers of resources, and potentially different SLAs established for each. (In many cases, you’ll also be running on shared hardware, colocated with several other resource-hungry processes.)

In analytics, we’re typically bound by memory capacity and then entirely performance throttled by CPU throughput (once you have enough memory, the CPU is your bottleneck). Historically systems solve resource sharing of the CPU by creating more threads (and thus delegating the need to balance resources to the operating system thread scheduler). There are three challenges with this approach: (1) you have weak or no control over resource priorities; (2) as concurrency increases, context switching dominates CPU utilization; and (3) high priority cluster coordination operations like heartbeats may miss their scheduling requirement and cause cluster cohesion instability.

We’ve designed Gandiva to allow for more flexibility in how resources are allocated to each request. For example, in Dremio we assign one thread per core and constantly reassess the state of workloads to rebalance work for optimum efficiency.

For example in the figure below, imagine you have three different users trying to use the system concurrently. Assuming 11 intervals of time for a single core, you might want to allocate resources to each operation differently.

Threading control in Gandiva

User 1 is allocated a single interval of time, whereas the third operation is allocated significantly more resources than the first two (eg, a “premium” user). Because Dremio processes jobs asynchronously, it can periodically revisit each thread (based on a quanta we define, which is designed to ensure the majority of time goes to forward progress instead of context switches) to rebalance available resources given the priorities of jobs running in the system.

Gandiva works well with this model, allowing the system to operate asynchronously. It does this by allowing small amounts of work units to be concluded followed by suspension to the calling code. This pattern allows it to be used in both a traditional synchronous engine as well as more powerful asynchronous engines.

Easy to Use in Multiple Languages

We built Gandiva to be compatible with many different environments. C++ and Java bindings are already available for users today. We also hope that we can work with the community to produce bindings for many other languages, such as Python, Go, JavaScript, Ruby, and more.

Easy to use with multiple languages

To make new primitives available for use in Gandiva, the core C++ library exposes a number of capabilities that are language independent (including a consistent cross-language representation of expression trees). From there, all data is expected in the Arrow format, building on the ability of Arrow to be used across these different languages.

Performance Observations

Generally speaking, Gandiva provides performances advantages across the board with very few compromises. First, it reduces the time to compile most queries to less than 10ms. In Dremio, Gandiva also improves the performance of creating and maintaining Data Reflections, which Dremio’s query planner can use to accelerate queries by orders of magnitude by creating a more intelligent query plan that involves less work.

To assess the benefits of Gandiva, we compared the performance of SQL queries executed through Dremio using standard Java code generation vs. compiling the queries through Gandiva. Note that the performance of Dremio using existing Java-based query compilation is on-par with state of the art SQL execution engines.

Five simple expressions were selected and the expression evaluation time alone was compared to process a JSON dataset of 500 million records. The tests were run on a Mac desktop (2.7GHz quad-core Intel Core i7 with 16GB ram).

In general, the more complex the SQL expression, the greater the advantage of using Gandiva.

Sum

1
SELECT max(x+N2x+N3x) FROM json.d500

Five output columns

1
2
3
4
5
6
7
SELECT
sum(x + N2x + N3x),
sum(x * N2x - N3x),
sum(3 * x + 2 * N2x + N3x),
count(x >= N2x - N3x),
count(x + N2x = N3x)
FROM json.d500

Ten output columns

1
2
3
4
5
6
7
8
9
10
11
12
SELECT
sum(x + N2x + N3x),
sum(x * N2x - N3x),
sum(3 * x + 2 * N2x + N3x),
count(x >= N2x - N3x),
count(x + N2x = N3x),
sum(x - N2x + N3x),
sum(x * N2x + N3x),
sum(x + 2 * N2x + 3 * N3x),
count(x <= N2x - N3x)
count(x = N3x - N2x)
FROM json.d500

CASE-10

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
SELECT count
(case
when x < 1000000 then x/1000000 + 0
when x < 2000000 then x/2000000 + 1
when x < 3000000 then x/3000000 + 2
when x < 4000000 then x/4000000 + 3
when x < 5000000 then x/5000000 + 4
when x < 6000000 then x/6000000 + 5
when x < 7000000 then x/7000000 + 6
when x < 8000000 then x/8000000 + 7
when x < 9000000 then x/9000000 + 8
when x < 10000000 then x/10000000 + 9
else 10
end)
FROM json.d500

CASE-100

Similar to case-10 but with 100 cases, and three output columns.

Test Project time with Java JIT (seconds) Project time with LLVM (seconds) Java JIT time / LLVM time
SUM 3.805 0.558 6.81x
Five output columns 8.681 1.689 5.13x
Ten output columns 24.923 3.476 7.74x
CASE-10 4.308 0.925 4.66x
CASE-100 1361 15.187 89.61x

Designed to Be Shared Across Environments

Gandiva was designed to be used in many contexts. We hope that communities using Python, Spark, Node and other environments can all find ways to embed and leverage Gandiva.

image alt text

Because Arrow is cross-platform, and because Gandiva consumes and produces Arrow, each process can efficiently interact with data from a common in-memory standard, and without serializing and deserializing the data.

This is a new approach for performing optimized processing of Arrow data structures. Gandiva builds on the Arrow’s adoption momentum to make processing on these structures even more efficient. This is work each application that implements Arrow would have otherwise implemented on their own, reinventing the wheel so to speak.

Prior Art

Gandiva draws inspiration from many great projects. Systems like Apache Impala and the more recent Weld work from Stanford have done a great job of setting ground rules around the use of LLVM in the context of data analytics pipelines. The work on Gandiva was inspired by many of these projects and looks to move things further forward by providing an in-memory columnar kernel with runtime compilation that is standardized in both data format and interface.

Works Well With Apache Arrow Flight

Recently we proposed Apache Arrow Flight, a new way for applications to interact with Arrow. You can think of this as an alternative to ODBC/JDBC for in-memory analytics. Now that we have an established way for representing data in-memory, Flight defines a standardized way to exchange that data between systems.

If REST is the primary protocol for microservices, Flight is the primary protocol for data microservices. This allows organizations to stitch different data subsystems together in a non-monolithic way.

For example, in Dremio we have been consuming and producing data using Arrow from day one. As Dremio reads data from different sources, the data is read into Arrow buffers directly, regardless of the source. All processing is then performed on native Arrow buffers.

image alt text

For client applications interacting with Dremio, today we deserialize the data into a common structure. For example, for applications such as Tableau that query Dremio via ODBC, we process the query and stream the results as Arrow buffers all the way to the ODBC client before serializing to the cell-based protocol that ODBC expects. The same is true for all JDBC applications.

As soon as Arrow Flight is generally available, applications that implement Arrow can consume the Arrow buffers directly. This provides a substantial improvement in throughput and CPU consumption. For example, in our internal tests we observe from 10x-100x efficiency improvements with this approach compared to ODBC/JDBC interfaces.

What’s Next

Our goal at Dremio is to build and drive development of modular data analysis components likes Apache Arrow, Apache Arrow Flight, and Gandiva. We hope to drive new collaborative innovation in both industry and academic communities. If we can share foundations, it allows people to leverage prior art to build truly innovative things that can also be incorporated into real world use cases. We hope Gandiva can help make technologies like columnar in-memory runtime optimized code generation accessible to more audiences.