The Columnar Roadmap: Apache Parquet and Apache Arrow



Julien Le Dem:

I'm Julien, so today I'm going to talk about the columnar roadmap. In particular, I'm going to talk about Apache Parquet and Apache Arrow. So, I'm a, Full disclosure, I co-created Parquet while I was at Twitter. Currently, I'm an architect at Dremio, and we're building data analytics tools on top of a lot of opensource work, and I've been involved in various Apache projects over the years. Today, we're going to talk about building community driven standards around Parquet and Arrow. I'm going to introduce by explaining a little bit what are the benefits of columnar representation, and why everyone is doing it, and then I'll talk in details about the better vertical integration between Parquet and Arrow, and then we'll tell you what that is, and Arrow based communication.

The Columnar Roadmap: Apache Parquet and Apache Arrow

So first, I think the most important part of this is how to build a community, because in open source, really, the most important thing is not the source. I mean, it needs to be good quality, but the community around it is much more important and that's what makes it successful and useful.

The Columnar Roadmap: Apache Parquet and Apache Arrow

First, Parquet came from a common need for on disk columnar representation, and it's inspired from a lot of work in academia and Google Dremel paper, and you know, a lot of databases are using it, like Vertica, are using columnar representation to speed up analysis. Arrow is similar, coming from a common need for in memory columnar, so if you look at papers like MonetDB, papers that are the beginning of vectorized execution, it's the next step in making sequel execution and all those things much faster.

Parquet, I started prototyping something at Twitter, and the Impala team at Cloudera was prototyping columnar representation for Impala, and we started working together and merged our design, and they were coming more from the native code C, C++, and I was coming more from the Java side and we put our efforts together, and after that, more companies joined. Criteo, which is an ad optimization company, Netflix, joined the effort and before they knew it, it started being integrated in many of those projects spark sequel made its default format drill, made it default format, obviously Impala it's its default format and started being integrated in everything hive, all of those things.

So building on that success, on having done already the effort of building that community and of all those projects talk to each other and agree on how we're going to represent stats, how we're going to represent things to have a standard because it's just that much more valuable when we all use the same format and you have a lot of options to use for all of those things.

So building on that, we created Arrow because all of those projects we're also looking at, well the next step is vectorizing execution to make it 10 times faster and we need come up with and in memory columnar representation. So to benefit everyone we may as well come up with the same one and agree on it. So we started Apache Arrow as a top level project and it's involving people from all of those projects on the PMC to make sure that we first agreed on what we are doing and start building it in able, making it a standard for the ecosystem.

Before, when you wanted to integrate things together, so we have on top, we have more simple execution framework and on the bottom we have more storage layers. This is not exhaustive, right? This is to give you an, an idea. Then you need to figure out a way to common representation of data for each of those things, right? They are all slightly different, they have different memory representation. Most often now it is they use Parquet, but they built, drill, or Impala, their own optimized vectorize reader that integrates with what they have and so there a lot of duplicate effort and also in many cases, like for example if we look for Spark and Python integration and Pyth Sparks, there is a lot of overhead of finding a common representation. You know, often it's the lowest common denominator and theirs a lot of overhead just serializing, de serializing data and converting it from one representation to the other.

The columnar roadmap Apache Parquet and Apache Arrow

So, thanks to arrow the goal is now if we agree on a common representation, that is very efficient for doing data processing of many sorts, then we move, first it's much more efficient and second we remove all of the overhead of converting that representation, because it's already the native representation of a lot of those things, so there's very little cost of serialization.

There's no cost of serialization to serialization, and it's already a much better representation for doing core execution for example. So, I'm going to talk a little bit about the Columnar representation, so easy right? So, before we are doing rows, now we are doing columns.

Columnar representation, so if you think of a table, so here you have a two dimensional representation of a table. You have columns, you have rows, but truly when we write a table to computer on disk or something, we end up with a linear representation, right? We need to take those rows in a raw oriented format, and just put every row, one after the other. Right? First row, second row, and so on.

So, you end up interweaving values from different types, from different columns and most of the time when you do a query, you actually query a subset of the columns, and so it's much more efficient to skip the data you don't need when you put all the columns you want together. You know in Columnar representation we put one column after the other. So, when we select A, it's much faster to scan all of the A's and keep everything else.

Another benefit, because we encode multiple values that are of the same type instead of having strings, integers, date and so on, for example, we have all the strings together, all of the integers together, all of the dates together. We can do encodeings that ae much more efficient and that compress much better. So, usually Columnar representation compress much better than raw entered one.

Another question is, "Look, you have Parquet as a Columnar representation already, why didn't you just put it in memory?" Well, the thing is there are different trade offs. So when we store on disc it's mainly storage, it's going to be written ones, and read many times, and read from different point of views. We may access different columns, with different filters and so, there's a need more compactness and there's, we're optimizing more for scanning. Doing scans very fast on this data. So, there's mostly streaming access and so there's a priority to reducing IO. We still want fast CPU, low CPU costs, but reducing IO is more important.

The Columnar Roadmap: Apache Parquet and Apache Arrow

For in memory, it's more transient, usually it's been loaded in memory for a core in particular even though you could do cacheing as well in keeping things around. The data is more transient and we want the highest priorities to CPU reboot. So it still needs to be compact, but we favor in making it fast rather than making it more compact, because it's in memory, latency is lower, and you want to be able to have random access. Access something from its index in constant time.

So, if we look a little bit into Parquet, so both Arrow Parquet they bought Necitida structure and Parquet is compact format, because it's usually stype aware in codings. So you know it's a string, it's in integers, if you know it's an integer you know the max value, you can use fewer bits. Things and that and you get better compression than just showing general purpose, brute force compression algorithm like Gzip. So you usually want to use Gzip. You can use Azio or Snappy or a more modern Z standard or Broccoli, but just simple thing when you know the types, you can just sort a simpler algorithm that can compress better and much faster, because you don't have to just try to compress random bytes. They understand them better.

It optimizes IO by doing projection push down, which is reading just the columns you need and predicate push down which is pushing the features.

So the NastyDoor presentation is borrowed from the Google DryMold paper and that's not what I'm talking about today, so if you want more information go to that blog post. I'm going to share the slides as well, but basically it's a generalization, instead of using a bit, 0 is null 1 is define, like you do for a flat scheme. You use a number which is 0 for it's null, 1 is define at the first level, 2 is define on the second level, 3 is define at the third level. So, it's kinda recording where is the nul is in the tree of the scheme and same thing for repeated values like lists.

It will record at what level you are starting when you list, when you have listed lists. So, and still the advantage is still very small. Integers you can store in a few bits and small overhead any generalized listing data searchers.

To summarize, Parquet, with production push down you read only the columns you need. With predicate push down you can narrow down to only the rows you need, or fewer rows to approximate, so you minimize IO so that you can only read as little as you can.

The Columnar Roadmap: Apache Parquet and Apache Arrow

So, this line is about more details, about the Parquet representation. So this is your Parquet file and it's columnar per row group, because you want it to be able to load one row group, which is a group of rows, in memory at once. So you split, inside the file it splits in row groups to make each of them in a reasonable size, which you need to fit in memory, and in each column it stores a column inside of it. So each of these columns is one after the other, and inside a column we have multiple pages with some meta data before it. So a page is the smallest unit of storage in Parquet, and it has a sort of number of values, and this is encoded and compressed.

If we look inside a page, what we have, you would have a header followed by the repetition and definition levels and so this is a small overhead, that contains the information if the data is listed or is nul able, of where the nulls are, where are the lists starting, and then you have all of the values. It would be the same thing as flattening all of the values for that column and storing them one after the other, and then you would have extra information to figure out where the nuls are. Which values are a part of what list if the data is repeated.

The Columnar Roadmap: Apache Parquet and Apache Arrow

The way we store, if we, to simplify you can consider this definition level as just being 01 once to say whether a value is defined or nul. So we encode that, we have a hybrid encoding and there are two modes to it.

One is very simple it's Run lancing coding and Run lancing coding is if you have many times the same value. Let's say it's all defined or it's all nul, you would have a lot of zeros or a lot of one, we have a very simple encoding that says we have that many ones. Right? You store how many values you have and what the value is. So, you know, very small. If you have a lot of zero and ones inter weaved, because it's kind of random, which values are nul and which values are defined. Then we use bit packing to use only a single bit to store that information every time.

This is a little important because we go back on how you integrate Parquet Arrow to make a very fast reader by peeling out those levels of abstraction. So if we look at arrow and it's slightly different in it's representation, like Parquet it's well documented. The goal is to make a language agnostic in memory representation. Whether you access it from C++ on Java, it doesn't matter and you can use it to communicate between Java or C++ process, but it's designed to take advantage of modern CPUs and get the maximum extra boot out of it. So that's the kind of thing that you get through vetorized execution and using a columnar representation.

It's embed able so usually, you know, it's not visible to the end user it's what the execution engine is using, and there is a HUD to go faster and the inter repairable with speed. So it's embed able and inter repairable.

So, like Parquet supports nested data structure and we'll show an example to show that it works. It maximizes CPU output, so there are three main properties here that it optimizes for. One is by planning because processors don't execute instruction one after the other anymore. They're trying to pipeline and to stagger execution and instructions. Cindy is single instruction, multiple data. Similarly, modern processors will have operators that can execute on multiple values at a time. So you can say, "Do that same instruction on those four values in parallel" and you get four times the same through put when you do that and when the data is columnar and you put all of the values one after the other of the same type, it's really easy to use in this manor and say, "Hey, look. Instead of doing a loop that's going to do one value at a time it's going to do four values at a time, or eight values at a time and go that much faster."

Cache local ET is this other trick that processors do to go faster, because CPU can process much faster than it can fetch data from main memory in the bus. It will have some local memory inside the processor that's much faster, but it's much smaller, so every now and then it needs to fetch data from the main memory to put it in the cache, and this of course every time the CPU does that it has to wait for this to happen.

So you have the latency as the CPU waits, gets data, keeps processing on what's local and because the data is columnar and you focus on one column at a time you get actually much better cache locality, because when the data is raw oriented remember you have all the data for all the columns that are mixed together, and so you need to bring all the data at once. But, when you have columnar representation and vectorized execution you can focus on one column at a time, and actually you move a lot fewer data at the same time.

The last property is scatter gather IO, so this property that I alluded to earlier than arrows representation in memory and on the wire to sensor the network is the same, and we'll see that in a subsequent slide, but it's the same representation. It's all relative offsets and there's no pointer, absolute pointer to calculate, so you can just take the buffers from memory directly to disk, or directly to the network and back to memory without having any transformation from the CPU happening, like you would have Avro, or for protocol buffers. All those types of representation that requires turning absolute pointers into cellulized representation.

So this is a CPU pipe lining I was talking about, so modern CPU they execute, they split each instruction in multiple steps in a pipeline, so here I imagine four steps in my pipeline. So, ideally you start executing the first instruction and as soon as the first step is finished you start executing the second one, and so on, right?

So, as they process through the pipeline, everything is happening almost in parallel, right, but if there's any dependency between those instruction, you can actually not start the next instruction before you get the result of the previous one. For example is there's branching, right? If you have an if statement and depending on the result of the previous instruction, which is a test, you need to decide which next instruction you execute.

And so that introduces bubbles, right, and because I hear the pipeline is four steps, but modern pipeline is more like a dozen of steps, and you lose that many cycles every time it's happening. So that's where the CPU is trying to be smart and as it's branch prediction algorithm, anyway if he doesn't know it, it would have to wait, so it's going to guess.

You have a branch, you have an if statement, and you do either that, or that, so it's going to pick one and start executing that, and so that's another property that will be important in the way you implement your reader to go faster.

So, this is the arrow representation, so if you have fixed widths, values, it's simple you just put one after the other in a vector, so here age 18, 37, you just put two values one after the other. If it's a valuable, values like a name, you put all of the values one after the other, and you have an extra offset vectors that point to the beginning of each value.

This is composable, for example if I have a list of valuable widths values here, I put all the values after one another, this value, this value, this value, and then if there's an offset vector to point at the beginning of each value. Like the first for phone number, the second phone number, and the third one and so on, and there's another offset vector to point at the beginning of each list, so it's composable, so that's how you deal with nested data structures in a simple way.

Then when it goes onto network, you can just have a simple data header that points at each of the buffers and then you put each vector one after the other, and because those offset are all relative to the beginning of the buffer, then it's all locatable and so you can compete to the network and back to memory and you don't need any transformation.

So, now if we take advantage of understanding how those things work to do a better reader for Parquet into arrow, so if we take this simple example data, so this would be the gist in our presentation A, B, certain values nul, then there's C, and then a null value, and D.

In Parquet an arrow will be the main difference, in Parquet all the values next to each other, and we encode, and compress them together, and then we use definition level, which for a flat representation is really as simple as 0 means nul, and 1 means defined, and we store that, and we try to be compact.

In arrow to enable random access we actually leave empty slots, for null values, right? It's just these values undefined. There's a slot in this vector, it's not used. We know because there's a 0 here we don't try to read it, it's nul. That way we get constant access to any of the values, and you have bit vectors that says if it's defined or not.

So, a simple way to bring Parquet into Arrow is while you iterate on all of the values if the definition level is 1, that means it's defined, and that would mean we set it in the right slot. Then there's a bit vector so every time we set it we need to get the byte that contains that bit. Figure out which bit to set, so we use a mask to set it, and then set it back. So that's how you set a single bit in a byte, right?

So there's some obstruction and if you remember what I said about pipe lining, right in the middle of that for loop we have a big data dependent if, so that's really bad for the branch prediction for pipe lining, because that means the branch prediction will be often wrong, because it's kind of random were the nulls are, and so you are going to spend a lot of time just waiting for that pipeline to flush and start over.

It's actually huge things, right, because you can lose, every time there's a misprediction you lose ten cycles, so you can go ten times as slow just because of that. So, instead software is like ogres. It's like onions. It has layers, right? So, layers of abstraction are good, because it helps understanding stuff, but in that case really we want to peel away those abstraction layers, and understand better.

Instead of just iterating on the thing we need to understand how they are encoded and that's why it was important to follow the other slides. I hope you did. Look at wait, wait a minute, we have actually two cases if you remember. One is your bit back once, zero in once, and you need to figure out where to put those values, and the other one is run lance encoding.

So, let's look at the bit back case to start with. So, the main difference with this implementation in two steps, is it remove the if statement in the middle. So, first we can look at the definition level, remember it's 0, 1, and figure out the index of each nul value, so if you have a simple running index of what the current index of the value. If the value is nul, then it's not put in a slot, so there's not going to be a value in Parquet, so we can just increment by one, and then we can just set this value of keeping the nul num indexed.

So, values that are num nul we will maybe set more than once, but the advantage of this loop, is there's no if inside of it, right. It's just doing the same thing over and over, and so it's going to be very fast from the pipe lining perspective. It's going to take advantage of the full circuit of the CPU. There's no if statement in there. It's a very tight loop.

This advantage is we have to materialize those tendencies, but we're hoping that the extra instructions is negligible compared to the circuit gain of not having a branch miss prediction, and then you can just decode the values and put the value in the right index, right? Figuring out those tendencies and it's much faster.

In the other case when it's run lance encoding, so no branch. In the other case it's actually even better, so if in your data you actually don't have any nuls, then you can remove entirely the overhead of dealing with nuls, right? Because, now if we don't have any nuls we are going to say hey look here I have four, but it could be ten thousand ones, right?

So you just always define, so you can fold back to that cave, where look it's so defined we don't deal with nuls anymore. We know all the values are next to each other, we can just write them where they are. In the case it's nul, well look we have ten thousand nul values, there's really nothing to do. We just going to increment our index by ten thousand, and we're done.

Instead of doing a loop that has this abstraction when we read every value, strew the abstraction that returns 1 ten thousand times, and then 0 ten thousand times, and do nothing. So, just by peeling away that abstraction layer we can do much faster, and actually fold back to the case where we didn't deal with nul at all.

So, that's for the simple reader. Now the next important part is predicate push down. How do you make those future evaluation, because you integrate tightly the top level, which is execution with the bottom level, which is reading from the storage layer, you can actually be much faster.

So, if we take that example, so remember that Parquet file, and let's assume we have four columns in it A, B, C, D. I've color coded it to make it a little easier, and let's say you select B, C, D, E, and you have a filter on A, right? So, you're going to keep the lines in B, C, D, only when A matches the center.

So, the simple version is, well you know we have those pages that contain data, and we can decode the values. Load them in vectors, and then you apply the filter and we keep. So, I said where A equals 3. I have in bold here the rows that match this filter, and we keep only the values that we're interested in, right?

That's simple, but actually we can actually do better, because we understand this better. So, if we do, again, we peel away the obstruction and we do more lazy, what some system call lazy evaluation. You can first evaluate the feature, so you can take A, and actually every of those header here for each page has stats. It's knows the minimum value, and the maximum value in this page.

The Columnar Roadmap: Apache Parquet and Apache Arrow

So, since you want to filter out where A equals 3, you know if the min and max tell you that if the min is 5, you know it's not going to be any value in there. You can skip reading that page. You can skip decompressing that page. You can skip decoding that page. Here I'm assuming I'm keeping only one page from the one's I have based on those stats.

Especially if the data is sorted. If you started sorting on the column A first, it's going to drop down to a lot fewer page, and then we can decode the values that are left, and evaluate the filter, and instead of evaluating the filter, directly on the well decoded things. You can first evaluate the filter and here we store the deltas instead, so for example we have the first value matching, so that means we have a 0 here, which is how many values we skip to get there.

So, we keep the value we store 0, and then we skip one row, right? This value doesn't match, so we put the 1 to say we skip ran, row and then we get the next one, so you have as many values in the delta rows that have matched, but you keep track of how many rows you skipped in between, right?

The advantage of that is you can take the other column, and reapply those deltas to select the values that match, right? Because once we evaluated the filter on the first column, we can select the values that match in the other column, but again because in each page we know how many values there are, if the deltas skips all the value in the page, then we can skip the entire page all together, right?

We don't need to decode it. We don't need to decompress it. There's a lot of CPU you saved, and also depending on the encoding if it's a fixed widths encoding skipping values says that you don't have to decode intermediate values to get to the next one, right, you can just keep that number of bytes, and depending on the encoding you can take a lot of advantage.

So, you can use, first skip the pages that anyway you didn't have anything, so we skipped entirely the second page so there was nothing after a certain number. We can skip all of the pages that have rows after the last one that matched, and then when we decode we just apply the deltas to get the next data.

So, you get a much faster. You skip a lot of decompressing and decoding by just understanding the obstruction layers, and going through them. All this is not yet c. It's work that we've been doing and that's going to, it was supposed to be open source for this presentation, but it's going to be over the summer, and so you're going to see that soon.

Now I'm going to talk about things, so we are talking about that road map things that are design is in progress, and people are welcome to join. The other things that, so one thing that Arrow brings, and that this first part was about is speed, right? We do vectorized execution we can things much faster and we can have this vectorized reader that's going to be standard for a bunch of those things.

Now the second thing is also make it to standard, right? So, for example there's work that is happening in Spark and it's making the world buy spark integration faster. So there's a lot of overhead in the existing Spark in converting between the Jevions based representation that Spark uses to the Python representation to work on, to write UDFs.

This adds a lot of overhead. So, switching to Arrow based representation makes it much faster because now there's no CPU spent serializing and de-serializing. The representation is the same so you can just write Arrow on the Spark side, send the record patch over to the Python process, and one is running on the JBM, the other one is a native process. The Python code, a lot of work has been done on the Panda side to work directly on top of Arrow. So it works directly on top of Arrow so it doesn't need to be converted to Python objects like the Spark, existing Spark does. It's much more efficient.

So in this representation you can imagine sequel engine, so it could be, you know, Spark sequel, could be, so for now there is Spark integration, but in the future you'll have Impala, Drill, Press Tool integration and the next step is to used shared memory to remove the cost as well to serialize, de-serialize, to copy those buffers. So we don't add the cost of serializing, de-serializing because it's the same representation, but we're still copying the buffer for now.

So, in the future, you can use shared memory and because it's the same representation you can just have a read only point around this and read your input, cross process, write you output, and then share it back to give it back to the sequel engine. So let's say you know, you could be using Drill using the Python you have.

The advantage of this is because this representation is standard. Now, your library of user defined function is portable across the sequel engine. You don't have to write UDFs for Hive, you don't have to write UDFs for Impala, UDFs for drill, or for Spark sequel. Right now, and maybe you like Python maybe you like Spark, now all of those UDFs are portable across things and there's no cost, you know, a lot of work is spent making sure when Impala calls UDFs defiance in Java, they spend a lot of time making sure that, don't spend too much cost doing the conversion between native code and Java code. So, this standardizes it and so we can combine all of those efforts and do it once and do it well and do it very efficiently.

So, that's one example of using communication using user defined function and the other one is using interprocess RPC communication, right? You can define a generic way of entering data in Arrow format, generic way to serve data, and simplify the integration across a good system. It's really just an Arrow pipe and we can send data from one process to another in string fashion or in batch and that would work.

The Columnar Roadmap: Apache Parquet and Apache Arrow

So, one example is how you would retrieve data in parallel for data storage layer. So, imagine you can have either have Proxy files on Hedge EFS or Hedge based, Cassandra, or KUDU, or quester and the main thing is, is to be able to read this data in parallel, right? So it has its own storage and KUDU is columnar, Proxy is columnar so KUDU can do the same fast read directly in to columns that can be really fast because it factorizes one column at a time. Then serving directly to the next layer and it's directly there in memory or representation. So you remove all the overhead of converting.

Today, KUDU serves a raw oriented representation. It has a columnar representation and then it will assemble into rows so there's some cost to that because most interfaces that are introduced today are raw oriented and then this core engine like Drill, for example, will turn it back into column in memory for its fast execution. So, it's kind of non-sensible. It doesn't make any sense that we turn columns into rows back into columns just because that's the common way to do it today, right? It would really solve all we need is to move to that standard. So that's this effort to have this standard columnar representation to avoid this.

Similarly, you can do Arrow based cash, you know and that's how you would do core execution and so Drill, for example, does something similar with Arrow based execution for reading Parquet and so you have this part, which is the Victorize read in the fast way like I was presenting in the first part and then when you do partial aggregation you do partial aggregation based on how many machines you have on the secondhand that do the final aggregation so you prepare the input for each of the next step on it and then you send it to each machine based on consistent caching and then final aggregation is done here and you send the final result to the user.

So, you can set that up as all of those are streams of data that are sent from one machine to the other and using the Arrow representation. So if you're interested in the current results. There's a lot of performance resolved, Wes Makini wrote a lot of those block posts. He's been doing a lot of investment on the Python side, is Panda's creator. So, you have a lot of their numbers on the preference gained from that end.

We already have a lot of language binding. So Parquet works in Java, and C++, and there's Python, a Pandas integration that also combines with Arrow. There are many great engine integration today, it is faster to list the ones that do not support Parquet today. Arrow is much younger, but already has a lot of language bindings. Java, C++ are the main one. On top of the C++ one you have Python R, Ruby, there's a plain JavaScript imp imp limitation and an engine integration has started and so it makes sense to keep going and doing the integration.

The Columnar Roadmap: Apache Parquet and Apache Arrow

To give a sense of activity. We started a year ago, a little less than a year ago, at the beginning there was agreeing under metadata and the format and you can see that we were trying to accelerate so orange is number of days. To do the release more and more often, so now we're trying to do once a month and so there have been a lot of changes, a lot of activities, and its settling. Now we have the format is pretty much finalized and we're getting to that phase where we want to integrate with a lot of things. There would be small changes more often and all of those integrations are happening in parallel.

Current activity if you are interested you can fold Spark integration as an activity and there's going to be multiple follow up steps in that to integrate it deeper and deeper. There are a lot of discussions currently happening to add the pages index in the Parquet footer so those push downs, make them more efficient, right? We want to be able to just keep reading the pages altogether. We want to do as much as possible future evaluation on the statistics. Each match for each page so we can skip reading a lot of stuff and then do a deep integration in the future evaluation so that you can, you know, make it very efficient. If you are filtering a lot of stuff and your data is sorted, for example, then you're going to be that much faster because you're going to be able to pinpoint exactly where in the files that you need to read.

The Columnar Roadmap: Apache Parquet and Apache Arrow

The Arrow REST API is being defined so if you want to chime in on this ticket, all the meta data is define, it's just about how do we define the encapsulation for those messages. There's bindings are making progress as well. There are some people that are very interested into having a JavaScript implementation so that you can have a back end that serves arrow and that can do a lot of computation in the browser. Browsers now, I've seen demos where people do things with tens of millions of rows of values in there, in the browser and doing computation in that the only thing is that you don't want to represent it a JavaScript objects because you blow up your memory you wanna much better representation like Arrow and you can actually do a lot of visualization and do a lot of fast computation because of that.

So, if you want to get involved here are mailing lists. It's an Apache project. Contributions are always welcome. There are not required. You are welcome if you have questions, if you have suggestions, you wan to contribute. We have a slack channels for like more interactive discussions. We have mailing lists if you have questions, if you want to be oriented. Contribution are welcome don't feel like you don't know or if you aren't sure if you are allowed to contribute. Of course you are allowed to contribute and don't hesitate if you want to help with something and like I was saying, you know, like we're in that phase where integration is where multiple things can happen in parallel. So if you have a pet project and you think it's moving too slow on that end, you know, if you want to help with this you are very welcome.

That's it and I think, Do we have time for questions?

Julien Le Dem:

Yes, so usually in Parquet and Arrows is team evolution, So Arrow has some special type. It is a union type to deal with, you know, your tabs used to be a string and now it's an it and so there's this key mother type definition support that and you can do things like that. Sorry, that's the last question. I'm going to be outside to answer more question if you want, but I have to leave the room for the next speaker.