12 minute read · September 19, 2018
Adding a User Defined Function to Gandiva
· Senior Architect, Dremio
You’re probably already aware of the recently announced Gandiva Initiative for Apache Arrow, but for those who need a refresher this is the new execution kernel for Arrow that is based on LLVM. Gandiva provides very significant performance improvements for low-level operations on Arrow buffers. We first included this work in Dremio to improve the efficiency and performance of analytical workloads on our platform, which will become available to users later this year. In this post we will provide a brief overview for how you would develop a simple function in Gandiva as well as how to submit it to the Arrow community.
Fundamentally, Gandiva uses LLVM to do just-in-time compilation of expressions. The dynamic part of the LLVM IR code is generated using an IRBuilder, and the static part is generated at compile time using clang. At run-time, both the parts are combined into a single module and optimised together. For most new UDFs, adding a hook in the static IR generation technique is sufficient. More details about Gandiva layering and optimisations can be found here.
Function categories
The functions supported in Gandiva are classified into one of three categories based on how they treat null values. Gandiva uses this information during code-generation to reduce branch instructions, and thereby, increasing CPU pipelining. The three categories are as follows:
1. NULL_IF_NULL category
In this category, the result of the function is null if and only if one or more of the input parameters are null. Most arithmetic and logical functions come under this category.
For these functions, the Gandiva layer does all of the null handling. The actual function definition can simply ignore nulls.
For example:
int32 my_average(int32 first, int32 second){ return (first + second) / 2; }
2. NULL_NEVER category
In this category, the result of the function cannot be null i.e the result is non-nullable. But, the result value depends on the validity of the input parameters. The function prototype includes both the value and the validity of each input parameter.
bool is_bigger(int32 first, bool is_first_valid, int32 second, bool is_second_valid) { If (!is_first_valid || !is_second_valid) { return false; } return (first > second); }
3. NULL_INTERNAL category
In this category, the result of the function may be null based on some internal logic that depends on the value of the internal values/validity. The function prototype includes both the value and the validity of each input parameter, and a pointer for the result validity (bool).
int32 convert_utf8_to_int32(const char *first, int first_len, bool is_first_valid, bool *is_result_valid) { If (!is_first_valid || first_len == 0) {* is_result_valid = false;* return 0; } // .. Code here to convert string to integer .. // is_result_valid = true; return int_value; }
Steps for adding a UDF in Gandiva
Ok, now we’ll get into the details explaining the steps required for adding a UDF in Gandiva. We’ll show a simple example using the NULL_IF_NULL category function (as previously described) that returns the average of two integers.
1. Download and build Gandiva
Clone the Gandiva git repository and build it on your test machine or desktop. Please follow the instructions here.
2. Add the code for the new function
We will add our simple function to the existing arithmetic_ops.cc. For more complex functions or types it’s better to add to a new file.
FORCE_INLINE int32 my_average_int32_int32(int32 left, int32 right) { return (left + right) / 2; }
3. Add function details in the function registry
The function registry includes the details of all supported functions in Gandiva. Add this line to the pc_registry_ array in function_registry.cc
NativeFunction("my_average", DataTypeVector{int32(), int32()}, int32(), true /*null_safe*/, RESULT_NULL_IF_NULL, "my_average_int32_int32"),
This registers our simple function with:
- External name as my_average
- Takes two input parameters: both of type int32
- Returns output parameter of type int32
- Function category NULL_IF_NULL
- Implemented in function my_average_int32_int32
4. Add unit tests
For this simple function, we will skip adding a unit test. For complex functions it’s recommended to add a unit test for the new function.
We’ll add an integ test by building a projector in projector_test.cc that computes the average of two columns.
TEST_F(TestProjector, TestIntMyAverage) { // schema for input fields auto field0 = field("f0", int32()); auto field1 = field("f1", int32()); auto schema = arrow::schema({field0, field1}); // output fields auto field_avg = field("avg", int32()); // Build expression auto avg_expr = TreeExprBuilder::MakeExpression("my_average", {field0, field1}, field_avg); std::shared_ptr<Projector> projector; Status status = Projector::Make(schema, {avg_expr}, &projector); EXPECT_TRUE(status.ok()); // Create a row-batch with some sample data int num_records = 4; auto array0 = MakeArrowArrayInt32({1, 2, 3, 4}, {true, true, true, false}); auto array1 = MakeArrowArrayInt32({11, 13, 15, 17}, {true, true, false, true}); // expected output auto exp_avg = MakeArrowArrayInt32({6, 7, 0, 0}, {true, true, false, false}); // prepare input record batch auto in_batch = arrow::RecordBatch::Make(schema, num_records, {array0, array1}); // Evaluate expression arrow::ArrayVector outputs; status = projector->Evaluate(*in_batch, pool_, &outputs); EXPECT_TRUE(status.ok()); // Validate results* EXPECT_ARROW_ARRAY_EQUALS(exp_avg, outputs.at(0)); }
5. Build Gandiva
Build Gandiva and run the the tests.
$ cd <build-directory> $ make projector_test_Gandiva_shared [ 71%] Built target Gandiva_obj_lib [ 78%] Built target Gandiva_shared [ 85%] Built target gtest [ 92%] Built target gtest_main [100%] Built target projector_test_Gandiva_shared $ ./integ/projector_test_Gandiva_shared Registry has 478 pre-compiled functions Running main() from /Users/ravindra/work/Gandiva/cpp/debug/googletest-src/googletest/src/gtest_main.cc [==========] Running 11 tests from 1 test case. [----------] Global test environment set-up. [----------] 11 tests from TestProjector <SNIPPED> [ RUN ] TestProjector.TestIntMyAverage [ OK ] TestProjector.TestIntMyAverage (21 ms) [----------] 11 tests from TestProjector (638 ms total) [----------] Global test environment tear-down [==========] 11 tests from 1 test case ran. (638 ms total) [ PASSED ] 11 tests.
6. Submit your function and raise a PR request
First you must push the changes to your fork and create a PR against an upstream project. This is best done by pushing to your local repo, and raising a PR request on the Gandiva page using the diff with your repo.
Once the PR is created, the community can review the code changes and merge.
The full code listing for this simple function is present in this PR - it includes both a projector test and a filter test.
Other Interesting experiments to try
The following experiments are not required but are interesting to explore and see what optimizations are being done by the LLVM function passes.
Check out the pre-compiled IR code
You can look at the pre-compiled code for your newly added function using the llvm-dis utility.
$ cd <build-directory> $ llvm-dis < ./irhelpers.bc > /tmp/ir.txt
Open the ir.txt file and search for myaverage. You should see the IR code (unoptimized)
; Function Attrs: alwaysinline norecurse nounwind readnone ssp uwtable define i32 @my_average_int32_int32(i32, i32) local_unnamed_addr #0 { %3 = add nsw i32 %1, %0 %4 = sdiv i32 %3, 2 ret i32 %4 }
Check out the post-optimised IR code
First, modify the optimiser function to dump the IR code. The easiest way to do this is by modifying this line (move the call to DumpIR to outside the if condition.)
Build and run the test again.
$ cd <build-directory> $ make projector_test_Gandiva_shared [ 71%] Built target Gandiva_obj_lib [ 78%] Built target Gandiva_shared [ 85%] Built target gtest [ 92%] Built target gtest_main [100%] Built target projector_test_Gandiva_shared $ ./integ/projector_test_Gandiva_shared
You should see the optimized IR code on stdout. You’ll notice that the:
- function has been inlined (no function calls)
- code has been vectorized
As this runs the vectorized snippet shows the function processing four integers at a time: adds 4 integers to 4 integers, divides all the 4 integers by 2 and so on…
vector.body: .. %16 = add nsw <4 x i32> %wide.load18, %wide.load %17 = add nsw <4 x i32> %wide.load19, %wide.load17 %18 = sdiv <4 x i32> %16, <i32 2, i32 2, i32 2, i32 2> %19 = sdiv <4 x i32> %17, <i32 2, i32 2, i32 2, i32 2> .. store <4 x i32> %18, <4 x i32>* %21, align 4, !alias.scope !8, !noalias !10 store <4 x i32> %19, <4 x i32>* %23, align 4, !alias.scope !8, !noalias !10 %index.next = add i32 %index, 8
In this article, we gave an overview of adding a simple function to Gandiva. In subsequent articles, we’ll extend this to functions for other categories and functions using libraries from c++ std or boost.
We also have more features coming that deal with supporting pluggable function repositories and more optimizations (eg. special handling for batches that have no nulls.) More to follow!!!