12 minute read · September 19, 2018

Adding a User Defined Function to Gandiva

Ravindra Pindikura

Ravindra Pindikura · Senior Architect, Dremio

You’re probably already aware of the recently announced Gandiva Initiative for Apache Arrow, but for those who need a refresher this is the new execution kernel for Arrow that is based on LLVM. Gandiva provides very significant performance improvements for low-level operations on Arrow buffers. We first included this work in Dremio to improve the efficiency and performance of analytical workloads on our platform, which will become available to users later this year. In this post we will provide a brief overview for how you would develop a simple function in Gandiva as well as how to submit it to the Arrow community.

Fundamentally, Gandiva uses LLVM to do just-in-time compilation of expressions. The dynamic part of the LLVM IR code is generated using an IRBuilder, and the static part is generated at compile time using clang. At run-time, both the parts are combined into a single module and optimised together. For most new UDFs, adding a hook in the static IR generation technique is sufficient. More details about Gandiva layering and optimisations can be found here.

Function categories

The functions supported in Gandiva are classified into one of three categories based on how they treat null values. Gandiva uses this information during code-generation to reduce branch instructions, and thereby, increasing CPU pipelining. The three categories are as follows:

1. NULL_IF_NULL category

In this category, the result of the function is null if and only if one or more of the input parameters are null. Most arithmetic and logical functions come under this category.

For these functions, the Gandiva layer does all of the null handling. The actual function definition can simply ignore nulls.

For example:

int32
my_average(int32 first, int32 second){
   return (first + second) / 2;
  }

2. NULL_NEVER category

In this category, the result of the function cannot be null i.e the result is non-nullable. But, the result value depends on the validity of the input parameters. The function prototype includes both the value and the validity of each input parameter.

bool
is_bigger(int32 first,  bool is_first_valid, int32 second, bool is_second_valid) {
  If (!is_first_valid || !is_second_valid) {
      return false;
      }
      return (first > second);
      }

3. NULL_INTERNAL category

In this category, the result of the function may be null based on some internal logic that depends on the value of the internal values/validity. The function prototype includes both the value and the validity of each input parameter, and a pointer for the result validity (bool).

int32
convert_utf8_to_int32(const char *first,
                                    int first_len, bool is_first_valid, bool *is_result_valid) {
   If (!is_first_valid || first_len == 0) {*
     is_result_valid = false;*
     return 0;
   }
 	// .. Code here to convert string to integer .. //
 	 is_result_valid = true;
return int_value;
}

Steps for adding a UDF in Gandiva

Ok, now we’ll get into the details explaining the steps required for adding a UDF in Gandiva. We’ll show a simple example using the NULL_IF_NULL category function (as previously described) that returns the average of two integers.

1. Download and build Gandiva

Clone the Gandiva git repository and build it on your test machine or desktop. Please follow the instructions here.

2. Add the code for the new function

We will add our simple function to the existing arithmetic_ops.cc. For more complex functions or types it’s better to add to a new file.

FORCE_INLINE
int32 my_average_int32_int32(int32 left, int32 right) {
  return (left + right) / 2;
}

3. Add function details in the function registry

The function registry includes the details of all supported functions in Gandiva. Add this line to the pc_registry_ array in function_registry.cc

NativeFunction("my_average", DataTypeVector{int32(), int32()}, int32(), true /*null_safe*/,
            RESULT_NULL_IF_NULL, "my_average_int32_int32"),

This registers our simple function with:

  1. External name as my_average
  2. Takes two input parameters: both of type int32
  3. Returns output parameter of type int32
  4. Function category NULL_IF_NULL
  5. Implemented in function my_average_int32_int32

4. Add unit tests

For this simple function, we will skip adding a unit test. For complex functions it’s recommended to add a unit test for the new function.

We’ll add an integ test by building a projector in projector_test.cc that computes the average of two columns.

TEST_F(TestProjector, TestIntMyAverage) {

  // schema for input fields
  auto field0 = field("f0", int32());
  auto field1 = field("f1", int32());
  auto schema = arrow::schema({field0, field1});

  // output fields
  auto field_avg = field("avg", int32());

  // Build expression
  auto avg_expr = TreeExprBuilder::MakeExpression("my_average", {field0, field1}, field_avg);
  std::shared_ptr<Projector> projector;
  Status status = Projector::Make(schema, {avg_expr}, &projector);
  EXPECT_TRUE(status.ok());

  // Create a row-batch with some sample data
  int num_records = 4;
  auto array0 = MakeArrowArrayInt32({1, 2, 3, 4}, {true, true, true, false});
  auto array1 = MakeArrowArrayInt32({11, 13, 15, 17}, {true, true, false, true});

  // expected output
    auto exp_avg = MakeArrowArrayInt32({6, 7, 0, 0}, {true, true, false, false});

  // prepare input record batch
  auto in_batch = arrow::RecordBatch::Make(schema, num_records, {array0, array1});

  // Evaluate expression
  arrow::ArrayVector outputs;
  status = projector->Evaluate(*in_batch, pool_, &outputs);
  EXPECT_TRUE(status.ok());

  // Validate results*
  EXPECT_ARROW_ARRAY_EQUALS(exp_avg, outputs.at(0));

}

5. Build Gandiva

Build Gandiva and run the the tests.

$ cd <build-directory>
$ make projector_test_Gandiva_shared
[ 71%] Built target Gandiva_obj_lib
[ 78%] Built target Gandiva_shared
[ 85%] Built target gtest
[ 92%] Built target gtest_main
[100%] Built target projector_test_Gandiva_shared

$ ./integ/projector_test_Gandiva_shared
Registry has 478 pre-compiled functions
Running main() from /Users/ravindra/work/Gandiva/cpp/debug/googletest-src/googletest/src/gtest_main.cc
[==========] Running 11 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 11 tests from TestProjector
 <SNIPPED>
[ RUN      ] TestProjector.TestIntMyAverage
[       OK ] TestProjector.TestIntMyAverage (21 ms)
[----------] 11 tests from TestProjector (638 ms total)
[----------] Global test environment tear-down
[==========] 11 tests from 1 test case ran. (638 ms total)
[  PASSED  ] 11 tests.

6. Submit your function and raise a PR request

First you must push the changes to your fork and create a PR against an upstream project. This is best done by pushing to your local repo, and raising a PR request on the Gandiva page using the diff with your repo.

Once the PR is created, the community can review the code changes and merge.

The full code listing for this simple function is present in this PR - it includes both a projector test and a filter test.

Other Interesting experiments to try

The following experiments are not required but are interesting to explore and see what optimizations are being done by the LLVM function passes.

Check out the pre-compiled IR code

You can look at the pre-compiled code for your newly added function using the llvm-dis utility.

$ cd <build-directory>
$ llvm-dis < ./irhelpers.bc > /tmp/ir.txt

Open the ir.txt file and search for myaverage. You should see the IR code (unoptimized)

; Function Attrs: alwaysinline norecurse nounwind readnone ssp uwtable
define i32 @my_average_int32_int32(i32, i32) local_unnamed_addr #0 {
  %3 = add nsw i32 %1, %0
  %4 = sdiv i32 %3, 2
  ret i32 %4
}

Check out the post-optimised IR code

First, modify the optimiser function to dump the IR code. The easiest way to do this is by modifying this line (move the call to DumpIR to outside the if condition.)

Build and run the test again.

$ cd <build-directory>
$ make projector_test_Gandiva_shared
[ 71%] Built target Gandiva_obj_lib
[ 78%] Built target Gandiva_shared
[ 85%] Built target gtest
[ 92%] Built target gtest_main
[100%] Built target projector_test_Gandiva_shared
$ ./integ/projector_test_Gandiva_shared

You should see the optimized IR code on stdout. You’ll notice that the:

  1. function has been inlined (no function calls)
  2. code has been vectorized

As this runs the vectorized snippet shows the function processing four integers at a time: adds 4 integers to 4 integers, divides all the 4 integers by 2 and so on…

vector.body:
 ..
 %16 = add nsw <4 x i32> %wide.load18, %wide.load
 %17 = add nsw <4 x i32> %wide.load19, %wide.load17
 %18 = sdiv <4 x i32> %16, <i32 2, i32 2, i32 2, i32 2>
 %19 = sdiv <4 x i32> %17, <i32 2, i32 2, i32 2, i32 2>
..
store <4 x i32> %18, <4 x i32>* %21, align 4, !alias.scope !8, !noalias !10
store <4 x i32> %19, <4 x i32>* %23, align 4, !alias.scope !8, !noalias !10
%index.next = add i32 %index, 8

In this article, we gave an overview of adding a simple function to Gandiva. In subsequent articles, we’ll extend this to functions for other categories and functions using libraries from c++ std or boost.

We also have more features coming that deal with supporting pluggable function repositories and more optimizations (eg. special handling for batches that have no nulls.) More to follow!!!

Ready to Get Started?

Bring your users closer to the data with organization-wide self-service analytics and lakehouse flexibility, scalability, and performance at a fraction of the cost. Run Dremio anywhere with self-managed software or Dremio Cloud.