24 minute read · July 23, 2018

Using LLVM to Accelerate Processing of Data in Apache Arrow

Copied to clipboard

Siddhartha:Hello everyone. Welcome to my presentation on using LLVM to accelerate processing of data in Apache Arrow. Very quickly about me. My name is Siddharth, I’m currently a software engineer at Dremio as part of the career exhibition team. I’m a committer for Apache Arrow project. Previously, I was employed at Oracle for three and a half years as part of the database engine team. This will be the agenda for today’s presentation. I’ll start by giving a brief introduction to Apache Arrow and Dremio. I will then cover some preliminary basics around runtime code generation in databases, and their requirements. Followed by a deep dive into LLVM and LLVM introduction basics, and how we have used LLVM in Dremio’s code execution engine. And then I’ll keep a few minutes at the end for questions and answers. So Apache Arrow. Apache Arrow is a top-level open source project in Apache Software Foundation. It was announced two years ago, and has seen rapid growth. The main emphasis with Arrow is actually to do high speed in-memory analytics using Columnar format. So Arrow actually provides a standard specification for Columnar format, supporting both flat as well as nested data types, and there are different implementations in multiple languages like C, C++, Java, and also maintaining compatibility across different languages. Developers from several different open source organizations and projects are contributing to the growth of Arrow community. So, one of the main goals with Arrow is to actually support interoperability where there are different data processing engines like Spark, Python, Impala. Sorry, Spark, Pandas, Impala, and each of these processing systems have their own proprietary format, tailor-made for suiting their performance requirements. But when it comes to these data processing systems to exchange information, they have to copy, convert, target, to convert the data to target memory format, so there is a lot of CPU wasted on serialization, de-serialization. Now, if all of these processing systems are given a standard, columnar data format, then we can actually avoid all the cost associated with the serialization and de-serialization. So, a lot of overhead associated with cross-system communication can be saved with Apache Arrow. So in that way, Arrow acts as a standard interface for a high speed data exchange between different systems. This slide gives a brief overview introduction of Apache Arrow across the big data community. For example, PyArrow which is a Python wrapper over the C++ implementation of Arrow columnar format, has seen a lot of interest in the data science community on an average of 100,000 downloads every month. As I said, Arrow basically is a standard in-memory columnar format, and the main emphasis is on CPU efficiency. The columnar format has become the de facto format for building high speed, analytical query engines that accelerate data warehouse workloads, or BI workloads. So, the columnar format allows the engineers to write the query processing algorithms, rewrite the core process algorithms in a very efficient manner. Like, we can write a very simple type code and run through the column values doing the necessary processing, like predicate evaluation, aggregation, and it also allows us to work on the compressed columnar data directly, because columnar data typically uses very lightweight compression schemes, like run length encoding [inaudible 00:03:32] encoding. So, all of these mechanisms put together can actually help us build column processing algorithms for accelerating the data warehouse workloads very efficiently. Arrow’s data structures around the columnar format typically allow us to access a cell value in a column, in a constant time. Modular sums metadata overhead structure, overhead costs that we pay upfront. This is a small example of representing an integer column and a varchar column in Arrow data structures. Like, on the left-hand side we have a fixed width four byte Int column, and this column is represented in two distinct contiguous buffers, the first buffer is a bitmap, the validity buffer, which basically tracks whether the corresponding column value in the data buffer is null or not. And the second example is the varchar column, where just like fixed width column, we have a validity buffer and the data buffer, but because the length of each cell value is not fixed, so we have an offset buffer, which is again, a fixed width buffer starting and ending offsets for each of the cell value in the varchar data buffer. [Arrow infractus 00:04:39], and this is also a brief introduction to GML. GML is an open source software, a high speed distributor in analytical query engine, and the query engine is completely built on top of [arrows 00:04:50] java libraries and runs inside JVM. Moving on and let’s look over some basics around why do we need code generation and databases. Typically, the most efficient query execution plan for any query would be a hand-crafted query plan, which does the exact amount of work for that particular query for those particular operators, data types, and etc., so it basically can not handle any arbitrary type of SQL query. It is still isn’t made for that particular query. And we can write highly optimized code to handle each query specifically. But that’s not the reality, because in real world query engines basically have to support a wide variety of functionality, lot’s of different SQL operators, data types, etc. So, the general model of execution, is the interpreted model execution, where there’s a giant control rock. The control rock is basically responsible for absorbing all the arbitrary run time information about the query like the field types, the data types, Schema, and then handling each of that information as it becomes available, and then using a lot of branching, and then using abstract helper functions which are typically implemented in java while using run time polymorphism, or in C++ while using function pointers. So, this has some overheads associated, and we’ll go over that in the next slide. So, interpreted execution is not very CPU efficient, because we are using a general purpose code function, right, so it has excessive branching, a lot of over it, associated with function codes. And it turns out that in modern pipeline architectures using code with a lot of branching is actually not the best way to write performance critical operations. CPU’s have become really, really efficient but still, the sheer presence of branches in your code actually makes instruction scheduling very harder. And, every time the branch is mispredicted, the entire pipeline has to be flushed. So, we have to do something better to make this code very efficient, and this is the reason why most databases actually do code generation at run time, as opposed to following interpreted execution. So, when the SQL query execution begins in a database, in the query engine, all the run time information for that particular query is available at that time. That information can be interpreted and then we have the opportunity to use a code tailor made for that particular query. So, this means that’s we have to generate query specific code at run time when the query execution has already begun. And, there we have the opportunity to generate highly optimized custom code for that particular query. There are some commonly used run time code generation techniques in databases. In Java [war 00:07:28], what we can do is, this is also what GML does, that we can use predefined templates and use generated Java classes at run time for that particular query and use JDK or GNU to compile them to byte code and load the byte code inside the same JVM and continue with execution. Sorry. The second way in the C/C++, what we can do, is that we can transform, we can use a process called transcompilation, where we can actually transform the query into a corresponding C/C++ code, execv via compiler and then load the executive. As far as I know, Amazon Redshift uses this mechanism, but there are a couple of problems associated with these approaches like in the Java WAR, the sheer volume of objects and the instantiation of objects and [data refrencing 00:08:13], right, that actually hurts performance and also increases the heap overhead. Secondly, because in Java WAR, we are actually compiling to byte code. We are already losing the opportunity to understand the hardware capabilities in a better way, and doing modern optimization like [looper optimization 00:08:28], [SYMTDI 00:08:29] operations, and using wider registers which are typically available on all modern hardware for efficiently processing decimals. In C/C++ WAR, it turns out that the general compilation and optimization process is not fast enough. So, there are some overheads associated with these techniques of doing run time code generation. There are a couple of run time code generation requirements for handling efficient query execution engine. The first one being that the metal that you are using the generate code at run time should in itself be very efficient. Yes, it is supposed to generate efficient highly optimized code, but the process of doing that should be efficient in itself. The third one is that the metal should be able to handle arbitrary and complex SQL expressions very efficiently. And this bring us to LLVM. What is LLVM? LLVM is basically a tool chain to build just in time compilation infrastructure comprising of a modular compilated tools, and it is exciting for the database world because LLVM can be used in databases to generate query specific native optimized code for a specific query at run time, and that has significant potential for improving the performance of execution engines. There are two high level steps in LLVM. The first step is actually to generate IR, which is called intermitted representation, and we generate this from a high level language like java, C/C++, etc. IR is both language independent as well as target architecture independent . The IR is then optimized and then converted, and compiled into native machine code for a specific architecture. Because as I said, LLVM provides you with a tool, a tool set to build compilation infrastructure. It basically gives you all steps that are necessary, that are actually done by any compiler when it compiles the program. So, there are step by step mechanisms for compiling the program, and all the steps are given during the compilation process. Moving on to discuss further more details about LLVM, and a bit more details on intermitted representation. IR is basically the code of LLVM for code generation. It’s like it’s own assembly language, like specification that is used by LLVM infrastructure to represent code during compilation. There are two mechanisms to generate IR. The first one is that LLVM provides a C++ api called IR builder. It basically from a high level language, you can assemble a function in IR. From a high level language, you can assemble a function in IR, instruction by instruction. The second method is called cross-compilation, where you have some function written in C++, and then you can use C languages and LLVM frontend, a C++ compiler to transfer that C++ code into IR, the intermediate representation. Overall, doing LLVM base compilation has three high-level steps. You generate IR from any high-level language. You optimize that IR. Then you compile IR to the native machine code for specific architecture, and then you execute that code. What is our goal with LLVM in Dremio? Our high-level goal is to actually use LLVM for efficient execution of SQL expressions in native machine code, and to significantly improve the performance of our execution engine. This brings us to Gandiva. A brief history about Gandiva. Gandiva is basically associated with Hindu mythology in India, where it was actually a mythical bow used to fire arrows a thousand times faster, and that’s why we have decided to Gandiva to build efficient query execution engine on top of Apache Arrow. What is Gandiva? Gandiva is a standalone C++ library, a very powerful expression evaluation engine on top of Apache Arrow. It uses runtime code-generation in LLVM. Gandiva is completely standalone. It has no association or no compile time or runtime dependencies on Dremio’s query execution engine or any other query execution engine. All the heavy lifting and the main implementation, doing runtime code generation in LLVM execution, optimization; all of that is done in C++, but it provides Java, of course, which underneath, use the JNI bridge to talk to Gandiva C++ implementation. This is the exact path that Dremio’s query execution engine leverages because our execution is written in Java. We started with a subsort of expressions. Because this is a new initiative, we started with a subsort of expressions, like if/else, CASE expressions, and support for all fixed width data types, but gradually, we’ll be making this list more comprehensive to actually meet by a very powerful expression evaluation engine, like support for variable worth of data types, support for complex vectors in Arrow, and Boolean expressions. High-level design of IR generation in Gandiva, there are two ways we do it. One is a function registry of pre-compiled IR modules. For example, a SQL functions, isNumeric(), isDate(), extractYear(), so we have pre-compiled IR modules in our registry, which we do not generate at runtime. It is already available. Then, at runtime, we generate IR using the IRBuilder C++ API. This basically does the more heavy lifting, like going over all the column values in a loop on Arrow vectors, then implementing control blocks like if/else expressions, CASE expressions, Boolean expressions. Then underneath, this guy uses the pre-compiler modules. We use these two LLVM IR modules to actually come over the final most LLVM module, and then we use cross-optimizations. Optimizations across both the pre-compiled IR module as well as the runtime-generated IR module. Finally, the LLVM module with cross-compilation is then … We have a native code for that, and that is what is executed during query execution. The core of Gandiva is LLVM-based code generation is built on a tree based expression builder, where we actually define the operator, operands, and the output at each level in the expression tree. For each node in the expression tree, we can have an if/else node, a node to represent if/else expression, a field, a node to represent a field, a node to represent a function, a node to represent a literal. This is basically a high-level overview of how have built a tree based expression builder. There are three primary steps for doing runtime-code generation in Gandiva. The first step is to actually build a tree based expression. You give a schema, a set of fields, operation types, add/subtract, greater than/less than, whatever, and then build a tree based expression. You execute this step as many times for the number of expressions that you are working with. Each time it returns a root, pointer to the root of the expression tree. The second step is to actually generate code. You have a set of expressions represented in a tree format, now you generate code for them. So, there is this CC++ module. We call this projector in LLVM, which underneath uses LLVM generator to generate IR code for those expression trees, and then optimizes them and compiles them to native code to build an LLVM module. This step is done exactly once for the set of expressions that you’re working with. Then, the next step is to do the actual execution, the actual SQL execution. We have an Arrow RecordBatch. RecordBatch comprises of vectors, a vector representing each column or a field in your schema. Then, for a given RecordBatch, we actually use the projector to execute the LLVM code that was built in the second step. The third step is actually executed as many times for the incoming RecordBatches you have. You can have like 10 RecordBatches. Each RecordBatch comprising of 4,000 records so you just keep executing them in LLVM using the generated projector module. This is some sample code. Let’s just run through this. We are evaluating expression like, if (a>b) a+b else a-b. So, a and b are represented as fields for bit integers. That’s what is done in the first line of the code. Using that, we build Arrow schema to represent these fields. Then, we build an output field to the present the result. We use a number of steps to build a tree based expression. The first is to make a field node representing field a. The next step is to build a field node representing b, and then a function, which is greater than. Greater than will operate on both fields A and B, and the result is Boolean, and then make another function, which is sum, which will again operate on a and b and the result for bit int. Then another function, which is subtract, which will operate on fields a and b and the result is for bit int. Finally, use the condition, the greater than condition to build … The if node in the tree, which will … If the condition turns out to be true, we do sum function of the condition turns out to be false. We execute subtract function, and the result is for bit int. Finally, assemble these guys into a tree based expression. That’s what the final step does make expression with the if node and the result. Once we have build a tree expression, we build a projector to project these expressions and evaluate them. This guy underneath generates code to compile these expressions into native machine code using LLVM. I just create some sample data input record, batch, some number of columns, arbitrary number of columns, and then evaluate these expressions on that RecordBatch and populate output vectors. That is what the final step is. So, here, I’ve just shown the example for just one simple expression. What you can do is that you can do the the initial number of steps or as many expressions you want. Finally, just build the projector once and finally, do the … Oh, I’m sorry. Finally, do the execution as many times as you want. Gandiva uses a technique called as expression decomposition for highly-optimized executions of vast majority of function expressions. These function expressions are like, if the input value is null, the output is null. So, what we do is that because each column in Arrow is represented using two distinct buffers, at least the validity buffer as well as the data buffer. So, what we do is that we evaluate the validity bits independently of the data bits. This removes all the branches conditions from our evaluation code. We can use loop factorization, SIMD acceleration. That’s why this is the most-efficient way of executing expressions. Because the validity of the input data is used to compute the validity of the output data, it doesn’t really matter that you’re evaluating junk data, because at the end, it will still be null if the validity bit indicates it’s null. So, there is an expression, COL3 = COL1 + COL2. You take the validity of column one and column two to a SIMD twice and on all the bits in parallel in a single instruction. You take the data buffers of both the columns, do a SIMD add on all the column values parallel in a single instruction. Then, finally, you can you use these two results to actually project the column, which is column three, without using any branches or conditions. A little bit more on expression decomposition. As I mentioned, that once you have built a couple or one or more tree based expressions, you build a projector for them. Projector internally uses the LLVM generator to generate IR, and compile to build an LLVM module. Now, for each set of expressions that you build a projector for, it will decompose each expression, generate IR for each expression. So, this is done in a loop that’s shown over here. Then, finally, we generate. We optimize them and then generate an LLVM module for the IR of each decompose expression. There are three different categories of function expressions that we are currently handling; NULL_IF_NULL, which I’ve already covered, which are the best case for expression the composition. If the input is null, output is null. In this case, what we do is that the input validity bits are pushed to the top of the expression tree because they will finally dictate whether the output is valid or not. This is the most efficient way of executing expressions in Gandiva, and covers a vast majority of functions, arithmetic operations, and all that. The second one is that output is actually never null. So, functions like isNull, isNumeric, right? Whether the actual data that you’re operating that your parser to the function is null or not, the output is never null, is never give you a true or false. So, in this case, there is no reason to actually push the validity to the top of the tree, but yes, to get the end result whether it is null or not, true or false, we need to use conditional evaluation on the actual data. The third category is actually very rare and slightly different. Let’s consider expression like, castStringToInt(X) + y + z. In this case, if the sub tree representing the expression castStringToInt is null or not, that will actually dictate the overall output as well. What we do is that we generate a local bitmap for castStringToInt(x). That bitmap indicates what is the output of castStringToInt on x and then that local bitmap is fed into y + z, and then we continue with decomposed expression evaluation as I showed in the previous slide, COL3 = COL1 + COL2. Handling case expressions. In Dremio, specifically, the current way we handle CASE expressions is to actually the parser transforms them into a large gigantic block of equal if-else statements. … also transforms them into a large, gigantic block of if-else statements, and by doing that, we lose opportunities for optimization. Like for example, a case expression is represented like this, when, which is a couple of conditions and then do this. And now there can be multiple of these cases. We have seen examples where each case is actually evaluating same condition again and again. So it will evaluate same condition across multiple cases, same validity across multiple cases. And LLVM actually helps with removing redundant evaluation combining these instructions into a single instruction, removing all the repeated evaluation. So what we do is that we detect nested if-else statements and use a single bitmap for them. Only the bitmap will be updated only for the matching if-else block. So the case statements are actually transformed into a large, giant chain of if-else, if-else plots. This is a high level architecture of how we have actually used Gandiva in Dremio. We have actually start … We initially started to only for the project operator. The plan is to gradually move more and more operators for native exhibition in LLVM, so we have just started with the project operator where the actual execution and project happens in LLVM using Gandiva. So for each operator in Dremio, there are primarily two phases, the setup phase and the actual execution and output phase. During the setup phase, what happens is that the project operator uses this native evaluator builder to provide the incoming record batch, the [inaudible 00:23:35] record batch, and the schema. And then at the setup phase, we have all these logical expressions materialized. So we add all those expressions to the evaluator builder. And then finally we will generate code for them, the native code for executing those expressions. As part of building, what we do is there is a module called native LLVM generator, in [inaudible 00:23:59] Java. What it does is that for each logical expression, it builds a tree. Because the actual tree building logic is written in Gandiva, we have built our app over it in Dremio, which is the Gandiva expression builder, and this basically talks to Gandiva Java APIs to build a tree for each logical expression. Then once we have built one or more expressions in tree form, we build a projector. Building the projector, generating IR for those expressions compiled into a native code, all of that is done in Gandiva C++, so what it does is that uses the JNI bridge to talk to Gandiva C++ and build the compiled LLVM module. The next step is to actually execute and project the column values for the expressions that you are working with. So what it does is we use an evaluator in Dremio. The evaluator talks to the LLVM generator, which has already generated a LLVM module in the previous steps, and it then talks to the Gandiva Java, which underneath again uses the JNI bridge to talk to Gandiva C++ where all the heavy lifting is done and the execution of native machine code in LLVM is done. So this gives an, four to five steps in how we have actually moved our project operator from Java based run time code generation to run time code generation in LLVM, using Gandiva. The initial performance numbers look aggressive, so what we did is that we tried to compare our existing infrastructure where we use Genino to generate bytecode at runtime using Java just in time compilation, and then compared that with our newly built infrastructure in Gandiva for our time code generation in LLVM. We chose five simple expressions like project N number of columns, project more columns, case statement with small number of cases, case statement with large number of cases and evaluated these on a JSON dataset having 500 million rows. And the tests were preliminarily run on a Mac machine which is basic hardware, nothing fancy. These are examples of the two tests. I’ll give the details in the next slide. So the improvement is really massive. For example, in the test for CASE-100, the improvement is 90x when we are evaluating them after runtime code generation in Gandiva as opposed to using Java just in time compilation. And the other test that we did is that Project 5 columns, Project 10 columns for a set of expressions, project column after evaluating a sum expression. For example this one. Select some of column one, plus column two, plus column three, sum of column one and doing some fancy arithmetic and then projecting the values. So the performance improvement seemed reasonable and kind of are an indication that this whole idea of doing runtime code generation in LLVM can be really useful for improving the performance of execution engines. That’s all I have. Gandiva is a new initiative by Dremio. It’s a fully open source project with an Apache license. Please get involved. Check out our Blog Post and the [inaudible 00:27:12] on github. Also get involved with the arrow community. It’s a relatively new open source project with very few committers and not that much code, and we are accepting [inaudible 00:27:23] links in different languages like Rubigo, Rust. And also Dremio has some sort, has a community edition as well, so please check that out. Thank you.

Try Dremio’s Interactive Demo

Explore this interactive demo and see how Dremio's Intelligent Lakehouse enables Agentic AI

Using LLVM to Accelerate Processing of Data in Apache Arrow

Table of Contents

Try Dremio’s Interactive Demo

Get Started Free

See Dremio in Action

Talk to an Expert

Make data engineers and analysts 10x more productive

Table of Contents

Try Dremio’s Interactive Demo

Additional Resources

Apache Iceberg: The Definitive Guide

Intro: Data Lake vs Warehouse by Dremio

Apache Iceberg: An Architectural Look Under the Covers

Get Started Free

See Dremio in Action

Talk to an Expert

Make data engineers and analysts 10x more productive