Vectorized Query Processing With Apache Arrow




Hello, everyone. My name is Siddharth, and I'll be talking about vectorized Query Processing using Apache Arrow, briefly touching upon some of the performance characteristics of using Arrow memory format. My name is Siddharth, I'm currently a software engineer at Dremio and also Committer for Apache Arrow project. The agenda is that we'll start with some brief introduction on vectorized query execution, then some ... I'll talk about SIMD processing along with columnar compression, like comparisons between early materialization and late materialization, and then some internal details on some of the vectorized implementations of query operators inside Dremio.

Vectorized execution, the conventional model of query processing is tuple at a time by planning where in the query execution tree each operator calls the next function under the child operator and the child operator produces the tuple. That's how tuples move up in the query execution tree. Also, the query processing logic is written keeping the tuple or the row format in mind. Now, on the other hand though, vectorized query processing model, there are two significant differences from the conventional model. The first is that the query processing logic is now shifted to column oriented. It is like you process individual columns at a time. For example, you can write tight for-loop and just run through a fixed set of column values and do the required processing, like evaluate filter or do the arithmetic. The memory representation of the columns is containing a data structure called a vector.

The second difference with the conventional processing model is that instead of pushing tuples through the execution tree, we push a block, containing a fixed set of ... fixed number of records as keeping the columnar format intact. Now, this query processing model has significant advantages for analytical workloads because these workloads generally have operations that touch a subset of columns in your Dataset, but a large fraction of rows along those columns and do operations like simple scans, aggregations. What you load from disk into memory, are columns. So you save on the disk higher bandwidth. Then when you want process those columns, you load only the required column data into the CPU. So that writing query processing algorithms is actually ... turns out to be very CPU efficient and improves the performance of analytical queries.

The main difference between the vectorized query processing model as well as the conventional model is that we amortize the number of function calls and the whole query algorithm turns out to be very CPU efficient. This is a diagrammatic representation of the two models. It's a simple query where we are selecting a column and also computing the average. On the left-hand side, the scan operator starts getting the tuples and then pushes the tuples to the filter operator. Then it starts pushing the qualifying tuples up into the AG operator, which does the computation. On the right-hand side, a similar sort of thing happens, but yes, instead of pushing tuples, we are pushing a fixed number of records as individual vectors. That's the fundamental difference between the two models.

Moving on to the SIMD acceleration. SIMD stands for Single Instruction Multiple Data. It's basically a form of instructions which allow the CPU to operate on multiple operands in a single instruction. It's basically used for optimizing computationally expensive code paths by consuming more data in a single CPU instruction. They have been widely used in graphics and other related areas because they do kind of repetitive operation ... numerical operations on large arrays of numbers.

Now, why do we tend to use them in databases? The main reason is that the columnar format lends itself naturally for efficient SIMD processing, like well-aligned, dense areas. They are amenable to SIMD processing modular, some alignment and pairing requirements that should be taken care of in the code. The other main reason is that all the modern data processing system, especially in the analytical space, they are ... they tend to be a memory centric and the main emphasis on improving the CPU efficiency. So leveraging any characteristics associated with hardware acceleration can turn out to be useful. The fundamental building block for using SIMD inside ... for columnar databases is that you ... in a single CPU instruction, you operate on multiple common values.

Most instruction set architectures have support for SIMD instructions. If you take example of x86 family, it dates back all the way to '99 when they came up in Pentium processor. Now, they have significantly advanced all the instruction sets related to SIMD. In today's world, I think the AVX512 with that, you can actually process like 16 four-biter integer values in a single instruction. The example that I've given here is for ... with SSE 128 with SIMD register with which you can actually operate on four 32-bit floating point numbers in a single instruction. I believe all of these instruction sets have support for both single precision and double precision floating point numbers.

Now, how do we actually use SIMD in our code? The first, most obvious way is to actually write a SIMD code that leverages these instructions, but that's not a very profitable approach. The second approach is actually to just delegate into Compiler and hope that Compiler will actually auto-vectorize your code, but in databases, the expressions and the code is generally not that simple for the Compilers to optimize, especially when, in my experience, with SIMD programming, if you're using function pointers and global variables, the Compilers can actually fail to vectorize the code, where you as a developer could have done that.

The most obvious way, and I ... most preferable way of using SIMD instruction is actually to use a intrinsics. They are basically C style functions which allow you to leverage the instructions of the underlying architecture without actually writing assembly code and they are not portable. Here is some pseudo code with using SIMD instructions, using SSE 128-bit registered. So underscore, underscore, that's actually the way they actually ... the naming of the data types as well as the vectors. This represents a 128-bit vector that can actually store four 32-bit point ... floating point numbers. Here, what we are doing is that we are computing the expression that you add the values of two columns, column one plus column two, and store the result in the third column.

The normal way of writing this code is that you will actually run through the column values and do like ACI is equal to AI plus BI. Then in each loop, in each iteration, you will do like as many loads, as many stores and then add instructions. What SIMD is doing is that in each iteration of the loop, you're actually operating on four common values in a single instruction. You are significantly reducing the number of loads stores in that, and that improves the performance of query processing. That was just a simple example of how we can potentially use SIMD in query processing logic.

Here's another example where let's say the query is you ... you're doing a clause, a predicate evaluation of one column, and then selecting one or more other columns. You load the SIMD ... the data buffer into the 128-bit SIMD Register and then do a SIMD compare. The general output of a SIMD compared is actually an element mask indicating where each item in the mask is actually all ones or all zeroes, 32-bit numbers, indicating which column value actually passed the filter and which didn't. Then because we have the buffers for the vectors, you load the validity and then do assembly and on all the values in parallel. And then finally you've ... you would have ... What you have done is that you have evaluated what ... which column values ... which of the values in the column actually passed my filter.

Now, once you have produced the final bit mask, you can use that in a simple loop just to do bit shift and populate the output vector without using any branches. One of the apparent benefits of SIMD is that yes you can ... it gives your data a little parallelism. You can operate on multiple column values in a single instruction. But the other benefit, which is not very apparent, is that you can potentially minimize or at least eliminate the branches from your code. Because on most modern pipeline architectures, the branch predictor is good, but the sheer presence of branches, that actually prevents the CPU from the instructions properly. The lesser of the number of branches, the core tends to be more efficient. That's also one of the advantages of SIMD.

Moving on to columnar compression. Some of the general purpose compression algorithms that are heavily used in OLTP database is like LZO/ZLIB, they get phenomenal compression ratio butt at the cost of query performance. In columnar, well, we have some lightweight compression schemes that actually trade off ratio for query performance. We can actually ... because the columnar format actually stores each column individually, so we can decide to encode or compress columns on a per-column basis. We can use characteristics like cardinality of the column, whether the column is sorted, what is the data type of the column, and decide upon the compression strategy for the column. You can actually use these compression strategies on varchar columns and then compress them into fixed width areas which are then ... further compliment the SIMD processing technology. One of the compression schemes, the dictionary encoding where you can ... which allows you to rewrite FILTER on varchars ... varchar column as to FILTER on fixed width dictionary values.

We'll talk about an example. Here, you have a column which is a country column which has values like United States, China, India, France, and the column ... and the query is that you select something from table T, where country is France. You do predicate evaluation on the country column. You first consult the dictionary, because the column is dictionary encoded, and give that ... and get the dictionary value for the string France. Then you load that into the SIMD Register, which is the value for ... all the values into the register. And then you load the encoded column values, which is like two, five, four, one, three, seven of the column into ... again, into SIMD Register and you do parallel compare. What we have done here is that a filter on a varchar column has now been turned into a more efficient algorithm by rewriting as a filter on a fixed width column.

Moving on to materialization in column stores. Whether the column store is processing data in a columnar format or restoring data in a column format, some point in the query plan, we need to decide when do we actually stitch the values from different columns together into tuples and project the end result back to the user? Now the question is, when do we do this? The first strategy is called as early materialization, where as soon as we read the desired columns that are touched by the query, we found tuples. We read the columns in ... from disk or if it's only a memory on the database, we just a touch those columns and get the values and then form the tuples. Then feed those tuples into a conventional tuple-based query processor and then do the processing.

The disadvantages of this technique is that you have potentially lost all the opportunity that could have of optimizations that could have been applicable on the underlined columnar format. Now, because you have formed tuples, you have lost the ability to directly operate uncompressed columnar data because farming tuples requires you to decompress the data up front. Secondly, because no processing has been done, like you haven't done any filter processing or any other kinds of processing, you have potentially formed more tuples.

Because the format on which you'll be writing the query processing logic is not tuples or rows, that is far less CPU efficient as compared to the columnar format. Those are the disadvantages of early materialization. The fundamental idea behind late materialization is that you work on the columnar format till very late in the query plan, until it's absolutely necessary to actually stitch the values together from different columns and formed tuples. The storage format is also columnar on disk or in memory and the query processing logic is also written to work on the columnar format. And that's the idea behind late materialization.

So let's say there is a query where you actually do a filter on one column and you project the other column. Once you have done efficient filter processing on a single column, on the individual column, what you construct is some sort of position list. We call it a selection vector. Basically tells you the offsets of the individual column values or cells that passed the filter. Then you use that to actually predict the values from a different column. You just operate one column at a time as opposed to operating on tuples. This compliments the ability or directly operate on compressed data because we do the entire processing on columns and finally found the tuples, when it comes to projecting the results to end user.

Move on to some design and implementation of how we have implement vectorized hash aggregation at Dremio. We did some performance testing and figured out that as far as the insertion and look up in hash table is concerned, the columnar format doesn't turn out to be very useful. We have some sort of a mixture implementation where once the incoming vectors or the batch of data arrives into the operator, as far as the key columns are concerned, we pivot them into row-wise representation. The hash table is kind of segmented. It's segmented into two primary blocks. The first block contains only the fixed width key column data, and the second block, this contains the variable width key column data. When the data arrives into the operator, we pivot them into the row-wise representation and then this row-wise representation is then used to insert data into the hash table.

The other columns, which are the aggregation columns, some which you want to compute, some average of other arithmetic. They are kept intact in columnar format. Then we build association between the hash table buckets, or the positions in the hash table, and the corresponding items which are accumulated in the sum vector or the main vector of max vector. That's how we have sort of a mixed implementation of hash aggregation, which is not fully vectorized but it's not fully row-wise either.

Most of the operators in our query execution engine require some sort of copy at some point in the pipeline. For example, you are actually filtering ... doing filter processing on one column and then you're projecting other columns. As I mentioned in my previous slide, reconstruct selection vectors as a result of the filter processing and then use that ... the offsets captured in the selection vector to copy the data from the column that you're projecting into the result buffer. This copy code has been written in a very efficient manner where we have actually bypassed all the APIs to interact with underlying vectors. We directly interact with underlying memory, like using APIs like platform dependent. It's like very high-performance CC++ style code.

Dremio's query execution engine actually leverages the on-disk Parquet columnar format for storage of materialized views, for accelerating the analytical queries. When we read these materialized views into ... back into our memory format, which is also called Columnar, our initial implementation of the Parquet reader was row-wise, which was actually glossing over all the advantages that could have been taken that both the source and the target formats are in fact columnar. Then we rewrote the implementation of our reader to be vectorized, as opposed to being row-wise, where we can actually take advantage of the underlying columnar format, the encodings that it is using, and it turns out to be far more efficient than the row-wise counterpart. That's all what I have. Join the Arrow community. Please check out our GitHub. Please, play with the source code. Thank you.

Speaker 2:

Hey. As far as like columnar datasets, do you think that there will ever be a DBMS that can do both run time optimization on ... kind of like a [dag 00:16:54] on aggregation-level queries and then optimizing that to join on a hash map for relational queries?


Yeah, so some databases, they are actually built for like both, both worlds, giving you the best of both worlds. Where you actually have a row format giving a high performance OL , like single role selects or DMLs, but then they also have a separate sub component which is just the columnar format for accelerating the analytical queries. This conversion from row format to the columnar format  sometimes happens upfront, like what Oracle does. You can ... you're allowed to maintain both row format as well as columnar format and sometimes it happens, where depending upon the queries you can actually ... they were the representation back and forth between row-wise and columnar. Yeah, I think it's a valid idea to have such kind of a database with having both representations. Does that answer your question?

Speaker 3:

So this work that you did that ... Is this also applicable to the GPU data plane?


Let me clarify, like some of the things that I have talked about here are ... some are still in the pipeline that are still going on. Some of the stuff, yes, that has already been completed. We'll work towards accepting open source projects as well. Does that answer your question?

Speaker 3:

Yeah. No, I'm just curious because what you mentioned or what simply was CPU?


Mm-hmm (affirmative).

Speaker 3:

Is that ... that's a different thing for GPUs, right?


Yeah, but like when we actually plan to do this stuff inside, let's say Arrow. Our goal would be to do this in, actually, a ... in a generic manner. Also, like not have multiple implementations for different languages, but also be able to use that on CPUs as well as GPUs. I think that would be our goal when we eventually do this inside Arrow.

Speaker 3: