Dremio Jekyll

Columnar Data: Apache Arrow and Parquet

Transcript

Jeff:

Column oriented data storage allows us to access all of the entries in the database column quickly and efficiently. Columnar storage formats are mostly relevant today for performing large analytics jobs. For example, if you are a bank and you want to get the sum of all the financial transaction values that took place in your system in the last week, you don't want to iterate through every row in a database of transactions. It's much more efficient to just look at the column for the amount of money and ignore things like time stamp and user id. You don't want to look at every single aspect of a row. You just want everything in a single column.

Julien Le Dem co-created Parquet, a file format for storing columnar data on disk. Jacques Nadeau is a VP of Apache Arrow, which is a format for in memory columnar representation, and they're both part of Dremio. They join the show to talk about how columnar data is stored, how it's processed and how it's shared between systems like Spark, Hadoop, and Python. This is a topic that is only going to grow in importance in the near future as data engineering becomes a bigger and bigger component of a software company's important stack. And I had a great time talking to these guys. So I hope you enjoy it too.

Jeff:

To understand how your application is performing, you need visibility into your database. Vivid Cortex provides database monitoring for MySQL, PostgreSQL, Redis, MongoDB, and Amazon Aurora. Database uptime efficiency and performance can all be measured using Vivid Cortex. Don't let your database be a blackbox. Drill down into the metrics of your database with one second granularity.

Database monitoring allows engineering teams to solve problems faster and ship better code. Vivid Cortex uses patented algorithms to analyze and surface relevant insights, so users can be proactive and fix performance problems before customers are impacted. If you have a database that you would like to monitor more closely, check out vividcortex.com/sedaily to learn more. GetHub, Digital Ocean and Yelp all use Vivid Cortex to understand database performance. At vividcortex.com/sedaily, you can learn more about how Vivid Cortex works. Thanks to Vivid Cortex for being a new sponsor of software engineering daily and check it out at vividcortex.com/sedaily.

It's a pleasure to have you onboard as a new sponsor

Jeff:

Julien Le Dem and Jacques Nadeau are with Dremio. Julien and Jacques, welcome to software engineering daily.

Julien:

Thank you Jeff.

Jeff:

Today we're going to talk about the columnar data format, the Apache ecosystem, how these things are evolving. Let's start off by describing the columnar format and how it differs from other formats, why this is an important topic. Julien, could you start us off with that.

Julien:

So, traditionally, data processing started with raw oriented, and it's just a more natural way of treating data. But as the processors have evolved and become more sophisticated, columnar oriented representation is a much more efficient was of representing the data for processing. And this is because of multiple optimizations that processors are doing under the hood.

Jeff:

And so in contrast to the row oriented set up, like if you think of a row, it's maybe got ... a row for a user maybe has user id and user name and user purchase id, or something like that. But if you're looking at the columnar format, you're looking at an entire column in a database. So you might only be looking at all of the entries of user id, but nothing else. Is that accurate?

Julien:

So the natural way of the oriented, like you were saying, so if you have three columns and they are different types, like maybe a string, an integer, a date. Then when you're represented, you're going to inter leave data of different types. And it has different draw backs on how you can efficiently process that. When you do a columnar representation, so the way we visualize data is two dimensional with visualized rows and columns. But when we're represented in memory on disk, then we have to split it into a linear representation. So user write one row at a time or one column at a time. So in the raw oriented is inter leaved and you have one inter leaved data of different types, one row at a time. But in columnar, you put all the values for a given row at a time together. And the advantage of this is that first you'll be encoding data of the same type, so you can do several optimization relative to that.

So in Parquet for example, when you encode multiple integers together, you can use tricks like run lancing coding, or bit packing, because you are connecting things of the same type and you can encode them together in more efficient format.

Jeff:

So, before we get into the specific systems like Parquet and Arrow, let's talk about the broader ecosystem a little bit more. So there's this problem in the open source big data ecosystem, where you have all these tools. You have Spark, Hadoop, Cassandra, all these different systems. And they need to share data with each other. So how does data sharing typically work among these systems? Or how has it worked in the past?

Julien:

I can use one example that currently has been a problem or is not working as efficiently as it could. Like in data science, Python has been one of the popular tools for analyzing data. And you have tools like Numpy and Pandas to do a fast in memory processing in Python. And people are very happy with it. And it works well, as long as the data fits in memory on one machine. And then, like a popular way of working on bigger data and splitting it on multiple machines is to use Python with PySpark. In this use case, then you always have a drop in performance due to distributed processing, because you have the overhead of communicating from node to node in the distributed fashion and then joining parallel results together. But you get also the extra overhead of serialization and de-serialization over the wire.

And one other problem is, like Spark runs on the JVM and its a Java based process. When Python is a native code running in separate processes and so they come from two different ecosystems. And the serialization and de-serialization of objects on the wire end up being a very inefficient mechanism to communicate.

So the main goal with Arrow is to have this common representation that we remove entirely the serialization de-serialization costs, because in memory representation becomes the working format. And so, we can use tricks like shared memory and make all that interaction much more efficient.

Jeff:

So we're talking about this difficulty in data sharing. We're also talking about this columnar data format. What's the connection between these two issues? Like help me understand why we need to frame the problem, before we get deeply into Arrow and Parquet, help me frame the problem a little bit better, where we're talking about both the issue of data sharing and the issue of the columnar data format.

Julien:

Yes. So those two problems are a little bit inter leaved. So Arrow is interesting, because it's both an efficient representation for data processing and a standard. Because if you have only an efficient format, but if you actually use two different efficient formats between two systems, then the interaction of the two will have to translate from one to the other. And that's where you get serialization de-serialization cost.

If you use a common format between things, but it's not a format that's efficient for your internal representation, then you still need to translate between the communication format and your internal representation. So the goal or Arrow here is to be both an efficient internal representation for data processing and we'll go later in the details of why columnar is more efficient. So also, efficient and the standard representation so that it can be copied from one system to the other without any kind of translation.

Jeff:

The performance bottleneck that you get without a system like Arrow or a system like Parquet is, you have serialization and de-serialization, because you have to serialize it into a generally readable format. And then pass it to another system, like you're passing it from Python to Spark. But with Arrow, you can just put it in this one format and then put it into a location in memory somewhere from Python and then Spark can just consume it in the same format, so you don't have that serialization de-serialization cost.

Julien:

Yes, that's correct.

Jeff:

So what are the other performance penalties that exist in this data sharing ecosystem? Is it just that serialization de-serialization thing, or is there anything else?

Julien:

Well, the main goal of moving to columnar, is to take better advantage of modern CPUs. Things like better advantage of the pipelining in the CPU for the cache locally and things like SIMD instruction, which is single instruction multiple data. Which enables working faster ... that work faster when using a columnar execution scheme, vectorized execution which depends on a columnar representation in memory.

Jeff:

Is the majority of data analytics work that's done with these big data tools like Spark, is it primarily operating on columnar data? Is that the ideal circumstance, or are there a lot of operations where you still want to do operations over rows?

Julien:

Most operation you'd want to do on columns. And a lot of those existing big data systems that either have already partial vectorized execution or they're moving towards it, Spark SQL, Drill, Impala has been planning to move to vectorized. They all partially do it. The thing is, there's some complexity to move from raw oriented to columnar oriented execution, because you have to flip around the way you execute things. Like transportation of the systems. But this is the natural way things are going. Because it goes much faster.

Jeff:

And maybe for people who are less familiar with this area, explain that in terms of an example. What would be an operation that would be useful to perform on columnar data? And why is analytics moving towards this columnar analytics fashion?

Julien:

So there are multiple ... so if we take the example of evaluating an expression. Maybe you're doing AB, A plus B plus C divided by something. And you're evaluating that expression. So when in a typical volcano execution and that space on the volcano paper, that's one of the initial SQL execution paper that was around and is more row oriented. You would take one row at a time and it would go through the expiration evaluation, whatever this expression is, one row at a time. And part of this, there are a lot of virtual calls. You will do a call of eval on this expression that will evaluate itself sub expressions. And this end up being jumps in evaluation of the code. And the reason I'm talking about that is because modern CPUs ... sorry. Let me take a step back before-

Jeff:

Okay.

Julien:

I think I was taking that the wrong way. So let me step back.

So, when we evaluate an expression, we're going to go through a bunch of code path. And modern CPUs are not executing one instruction after the other anymore. Initial, first processors that were created would evaluate one instruction after the other. It was simple. You have your program and it's just going to execute each instruction after the other.

After a while, there's this notion of pipelining, which is an optimization that CPUs do, which is starting to execute the next instruction before the previous one is finished. So that each instruction is split in a pipeline of steps. And modern CPUs have like 12 steps or this order of magnitude. A pipeline is actually a lot of steps. Like a dozen steps. And it starts executing the next one before the previous one is finished, so that the overall throughput of the CPU is faster. And in that mode, there's this trick, that before ... that the CPU will try to predict what the next instruction is. So whenever you have a jump in your code, which can be an if statement, it can be a loop, or a virtual function call, there will end up being jumps, which are based on a decision. And result of an instruction will decide what is the next instruction to execute.

And so the processor will try to predict whenever there is one of those jumps, it's going to try to predict what next instruction is happening. Because it's trying to start executing that next instruction before it has a result of the previous one that determines which is the next, which is the result of that "if". If we're doing another loop or if the loop is finished, and we're going to the next instruction.

And so because of that, the columnar execution goes faster, because it makes things that are very easy to predict for the processor.

Jeff:

Encapsulate is a cloud service that protects applications from attackers and improves performance. Encapsulate sits between customer requests and your servers. And it filters traffic, preventing it from ever reaching your servers. BotNets and denial of service attacks are recognized by Encapsulate and blocked. This protects your API servers and your microservices from responding to unwanted requests. To try Encapsulate, go to encapsulate.com/sedaily and get a month free for software engineering daily listeners.

Encapsulate's API gives you control over the security and performance of your application, whether you have a complex microservices architecture or a wordpress site. Encapsulate has a global network of 30 data centers that optimize routing and cacher content. The same network of data centers that are filtering your content for attackers are operating as a CDN and speeding up your application. To try Encapsulate today, go to encapsulate.com/sedaily and check it out. Thanks to Encapsulate for being a new sponsor of software engineering daily.

Jeff:

Like, you're a company like Spotify, for example. Spotify has all these analytics jobs that they want to run over all of their data. For example, take all of the lengths of time that anybody listened to a song across Spotify and perform analytics and find the average length of time that somebody spent listening to a song over the previous day. In order to perform an operation like that, all that you need would be a single column. Just the column of how long did a person spend listening to this song. So I'm just trying to emphasize, give people a better picture for this focus on columnar data, because we have all these types of analytics jobs on the backend, where you don't want to have to get, you don't want to have to force your query engine or your database to look at every single user row and just pluck out the relevant column in that row. You want to be able to say, "You know what, just give me this entire column of data. Just give me this one specific aspect of the entire database table so that I can perform operations across that entire column."

Jacques:

That's a good point. I think that one of the things that we may have not entirely talked about is that really there's these two classes of columnar that, kind of, we just say columnar and we're talking about it in two different contexts. And so there's columnar on disk, which is a way to structure data on disk to make it efficient to pull data off of disks for analytical purposes. And then there's columnar in memory.

And so, columnar on disk, one of the really key things about that is this situation, where you might have 100 different columns in a particular table. And I think this is exactly what you were just describing. You have 100 different columns in a table, but you really just want to look at group ... look at sales by region. And so, really only two columns, which is region and the amount of sales.

And so, if you're row-wise on disk, it's going to be very expensive, because you're going to have to read all the other columns, the other 98 columns, in addition to the two that you're interested in. And so, columnar formats on disk have become very prevalent with Parquet, sort of absolutely leading the space there. And Julien, who's on the chat here, being the creator of that. And so that allows you to use your disk efficiently and only pull the things out that are relevant. And also when you're on disk, you have a bunch of benefits around putting things that are like each other next to each other, and so you can do a lot of really advanced compression.

And so then in the second category is this columnar memory stuff, or columnar in memory stuff, like the Arrow stuff, where it's about, again, trying to make things efficient, but the concerns that you have in memory are different than the concerns you have on disk. And so in most cases, you'll be taking columnar data off of disk, moving that into memory. And you already get to the point where you're only interacting with data that is relevant to you. And from there, it's about how fast can we make, how efficient can we make the processing when the data is in memory, and how can we make it so that we can move that data around between all the different sort of loosely coupled analytics systems that you use today, make that movement very, very sort of easy and cheek.

Jeff:

Okay. That was really useful. Because I think this might start to paint a bigger picture for people, like your big data analytics stack, probably looks like you have HDFS, which is the Hadoop file system and you're storing the file ... you have in that file system, you have Parquet files. And Parquet is this, like you said, this format for storing columnar data on disk. And I was looking at the Parquet format. I didn't go into the weeds very much when I was researching for the show, but it was pretty interesting. It's like ... I'm sure you're familiar with this Julien and probably you too Jacques, but it's this tree format basically. You've got this nested tree, where the nodes that are above the leaves, so all of the nodes that are higher than the leaves, describe the schema of the columns of your table. And then the leaves of the tree are the actual data.

And this is pretty cool, because it lets you iterate through the tree until you get to the specific column that you want to access. Is that an accurate description?

Julien:

Yeah. That's accurate. So that's the part of the format that's based on the Adrenal paper from Google. The Adrenal paper has two main contributions. One is a distributed execution of queries. And the other one is this columnar presentation, which enables a story and a nifty data structure into a flat columnar representation. So then the schema descriptions support structures and lists, which they call the repeating fields in groups. And the trick is, is you have a repeated structure and you want to reprimand that in a flat columnar representation.

So to do that, the general idea ... if you use a flat schema representation, you can use a bit, a zero or a one to represent whether a value is null or not. So you're going to have a columnar representation when you put all the non null values together. And then, next to it, you will have a bit field that we tell you for each value whether it's null or not. And the adrenal column you're representation expands on that notion. So you could imagine if you had a nested data structure, where you have multiple levels in your schema, instead of storing zero or one to say whether the value is null, you will store a small integer that store at which level it's null. So when you access a field, if it's null at the first level, then you restore zero. If it's null at the second level, you store one. And if it's defined, then it will be the depths of the schema, which makes it that defined.

So in a simple case, zero means it's not defined and one means it's defined. In a more general case of a nested data structure, anything that is smaller than the depths of the schema means it's not defined. And it tells you at which level it's null. And the maximum value means that it is actually defined.

Jeff:

Okay. So now you've given a brief overview for the Parquet storage format. Let's go into Arrow a little bit. And then we'll talk about pulling data from Parquet into Arrow. So Apache Arrow is this project that focuses on the columnar in memory analytics. And you're often pulling your data from Parquet files on HDFS into an Arrow format in memory. And then you can perform different operations on that in memory representation, whether you're performing it from Panda's in Spark, or if you're doing operations on ... I'm sorry. Panda's in Python or if you're performing operations on Spark using Java, they can all operate on this one columnar in memory analytics format.

So, give us a little bit more of a picture for what the problems that Arrow is solving for the in memory analytics format.

Julien:

So the Parquet and the Arrow nested representations are slightly different. Because they optimize for different trade-outs. So Parquet is more for persistence, as Jacques described earlier. When you query data, you usually select a subset of the columns and you want to access the columns faster and make sure IO is reduced to a minimum. So you access only the columns you need and those columns get compressed using data where encoding.

In Arrow, you will optimize more for CPU throughput. So the cost of IO to the main memory is much smaller than the cost of IO to disk. And so here, we'll optimize more towards CPU throughput. Which is why in the in memory representation in Arrow, we keep empty slots for null values, so that we can get faster access by index on things. And you have other advantages like random access to data.

So in Parquet you can have up to iterate through all the values, once you selected the subset you want. And in Arrow you can randomly access by index each individual values in memory.

Jeff:

Now that you've painted this picture of these two different systems and how they are optimizing for different things, what happens in the pipeline when you're pulling data off of a Parquet file and pulling it into Arrow.

Julien:

So when you read from Parquet and pull it into Arrow, we need to convert the definition level representation, which is the one I was describing earlier, defined by the Adrenal paper. We need to convert from those definition levels to indices of where to write them. What index to write each value in memory in the Arrow vector. This can be done pretty efficiently, as basically it's kind of converting a marker of defined and undefined into indices. And these can be done very efficiently in a vectorized manner. And also we need to decode the values from the Parquet presentation into more bare bone CPU efficient Arrow vector.

And this again, can be done very efficiently and driven by what values are defined and whether we have very few nulls are like all data is defined or there are nulls internally in the data, we can switch to one way or another to decode this data very efficiently into Arrow.

Jeff:

We had a show that was entirely about Apache Arrow. But I want to give people a picture for how Arrow affects the overall ecosystem. Because I get the sense that it's quite an important project, because of how much the serialization de-serialization challenges of inter-operability between different systems slows down analytics. Maybe you could tell me that's accurate and how Arrow affects the overall ecosystem.

Julien:

So, maybe we can use an example for that.

Jeff:

Okay.

Julien:

If for example you have a storage layer with is using Parquet or Kudu, which is a columnar database open source project. They both have a columnar representation on disk. And then when you query that data, and they both support projection pushdowns, which is selecting only a subset of the columns and predicted pushdowns which is pushing the future down to the storage layer. And they both are able to preprocess the data to reduce how much you read from disk. And do as much as you can. The storage layer before you push it to the engine, which is going to evaluate the query evaluate expressions and so on.

And they both can do that, but they will each provide an API in a certain representation. And the default API is a raw oriented one. Because most of the existing systems, whether you do a plain Spark jobs or big job, or things like Hive, naturally they were raw oriented. If you write a cascading job or Spark job, which are the typical tools for doing distributed processing. They all by default are big, they all use a raw oriented API. So those default APIs are raw oriented. But if you use a vectorized execution engine, like Spark SQL is starting to do, like Drill, which Drill is doing, then you end up reading from a columnar representation, converting into a raw oriented representation for interrupt pair BDB, and then converting back to a columnar in memory representation for fast vectorized execution.

So there's a lot going on when each system is actually columnar, having error in the middle means that when the storage layer like Kudu or Archaic, read the data into Arrow directly, they produce directly the working representation for the execution engine and it's going to be much faster to send it over and process it. So it will skip totally conversion from the columnar to raw oriented and then from raw oriented back to columnar. You will remove totally this overhead and have a much faster execution, when each storage layer can produce columnar data in Arrow and use that directly for the execution engine to process.

Jeff:

Is this what you call Arrow based storage interchange?

Julien:

Exactly.

Jeff:

Okay.

Julien:

So that the Arrow based storage interchange and so things like Kudu, Parquet, can use it and you can use it also for data caching directly in a fast, in a format that is used for processing.

Jacques:

So one of the things we've seen ... and so this kind of came out of this, this is a classic case of good place for people to collaborate in open source, is that we saw people interacting with these different systems. And if you want to read Parquet or some of these other technologies, the traditional way is that each of the systems builds its own driver library, some way to connect to that source. But what happens is, is what you don't want to do is high speed processing as possible, people start to have to build their own custom interfaces into this types, these different types of storage to get the highest level of performance. And the reason this is, at the end of the day, is that most of these systems expose a row level API, which is kind of sucking through a straw, if you will.

And so, one of the opportunities with Arrow is that rather than everybody having to write their own way to integrate with all these different technologies, that there's actually a common format which is high enough performance that they don't need to write custom code to interact with all these lower level things. And so, we're actually seeing this start to happen, which is that we saw activity for the Parquet community to build an Arrow reader which reads from Parquet into the Arrow representation, and now the Pandas community is using that to expose, and is working on integration to expose Pandas, it's Arrow structures to Pandas and by transitivity also Parquet access for Pandas.

Jeff:

We're talking about a overall model where you've got much of your data stored on disk in Parquet or Kudu and you pull that data from disk into memory. There are also these other systems that are basically, say, the future is everything is in memory. Like I've done a show about Alluxio, which is basically just everything is in memory. Is that ... could you do that with Arrow, just use, put everything in memory, or have either of you looked at the Alluxio system or another entirely in memory system?

Jacques:

Absolutely. We've talked with the Allexio guys a number of times about what are the opportunities to allow Arrow and Allexio to work very well together, so there are very complimentary approaches. So Allexio provides this in memory grid at the file system API layer, which is very powerful. It's a very common interface. And so people know how to write to it and use it very efficiently. But it's the same situation which is that the on disk representation is not necessarily the best representation to do fast processing. And so Arrow is this modern API specifically for how to move data around when the purpose is doing analytical processing.

And so absolutely, you could cache huge amounts of data in memory in an error representation. And then allow people to interact with that with different systems. Its actually one of the exciting opportunities that we see with Arrow, is the ability to use the data set across multiple systems. So the thing to keep in mind is that, so if you have an on disk representation, generally speaking you're going to have to transform the data before you can process it. That transformation can be inexpensive to expensive, depending on the representation on disk. The JSON is more expensive than say Parquet, for example.

And when you do that transformation, you then have to hold the data in memory in that transformed version. If you maintain your data in memory in an error representation, that is data that is ready to be processed as is, and that allows you to actually ... and one of the things we're working on is an on system and a cross system using things like RDMA, the ability to do sort of shared memory processing.

And so what we see with most organizations is a set of hot data sets that are critical to the organization. And if you want to work with that today, you actually hit this challenge which is, you either have a situation where you have to keep a lowest common denominator version of the data in memory. And so say, a disk based representation, or you have to keep multiple copies in memory that are more efficient for processing, because each system has it's own representation.

And so, if you have the common representation in memory, it still means that each system needs to do its own transformation, and actually use more memory to hold the alternative representation when it's actually doing its processing. And so with the common representation that it can also be in shared memory, that actually allows each of the tools to actually do their processing directly on the stuff that you're holding. And so what your opportunity is, is that if you today use Spark and Pandas and technologies X, Y and Z to do processing on a cluster, you're actually typically going to probably hold ... if you're using for example Allexio, you might hold the disk representation in memory. Then, when each of those things are going to be doing processing, you may want to hold their internal representation in memory. And what happens is you actually get into the situation where you're paying for the memory many, many times.

And so many customers that we work with today, they can't ... it's not realistic for them to hold most of their hot data sets in memory. They can hold a few of them in memory, but not lots of them. But the problem is sort of compounded by the fact that different tools have to have their own in memory representation, which means that you actually may be paying five times to hold the data in memory to satisfy all the different users of that data.

And so there's actually this huge opportunity with Arrow, which is, if you can have a shared, in memory representation, which is available via IPC to the different tools, then you can actually hold substantially larger data sets in memory with the same hardware that you're using today.

Jeff:

Saagie is an end-to-end data platform that lets you focus on deriving business value from data. With Saagie Present, you can turn your data into a story, easily sharing a presentation from your data lake. Explore data and drill down into details, creating rich visualizations. With Saagie Manager, you can easily build data pipelines of jobs mixing Spark, R, Python and more. Saagie is great for predictive analytics. Whether you're predicting the failure of a machine in your factory, or predicting which sales leads will be the most valuable. Saagie helps you take control of your wide variety of data sources and gets them in one place. Check it out at saagie.com. Thanks to Saagie for being a sponsor of software engineering daily.

So if I understand you correctly, what you're saying is even in the Alluxio world, where you have replaced your file system with a memory based file system, the file format is still something like Parquet or Kudu, but it's more like just lower down in your in memory hierarchy. Because you have to do a conversion from Parquet to Arrow and you will only do that for your hot data sets and then for your less hot data sets, you might just keep them in the Parquet format, in memory.

Julien:

Yeah. So the trade-offs are accessing data on disk, or in main memory or on solid state drives. The latency is different. And so the trade-offs of the time you spent in retrieving the data, versus the time you spent decoding the data. The trade-offs would be different. And so you will pick a different representation, depending on all of those variables.

Jeff:

Is there a richer cache hierarchy that we can talk about here? Because we're talking about in somewhat binary terms, like you've got these data sets that you don't really care about right now that are in Parquet and then you've got these data sets that are operationally relevant right now that are in Arrow. Presumably, there is a gradient between those two things. So maybe you could paint me more of a picture for how that gradient exists.

Julien:

So it used to be that it was either on disk or in memory. And you would have in memory cache or you would spill to disk data if it didn't fit in memory, or things like that. Now, this split is becoming, there's a larger gradient of options that are starting to appear. So first now these carriers are spinning disk that have a longer latency on getting to a piece of data. And then a higher throughput in accessing this continuous stream of data. And you have solid state drive that are lower latency to get to something. And it's going also in memory with non volatile memory. So in memory is basically using flash memory in the sockets that tax a dim main memory RAM. And so the main reason this gives a spectrum of different latencies of spinning disk as a higher latency and then flash drives are faster and then non volatile memory is faster and then main memory is the fastest, but it's most expensive. So you have a gradient of cost of storage versus latency affecting the data.

And so you already have these four levels that are starting to appear and the trade-off would be different in the representation of the data. You will use different trade-off on how much time it takes to suck the data through. So you will want better compression and more dense encodings, versus something that will be faster to process, like ... so Arrow is more at the in memory, end of the spectrum, when Parquet is more of the on disk end of the spectrum and you have different trade-off from one to the other of when you use one or the other to getting the best performance, whether IO is your bottleneck or CPU throughput is your bottleneck.

Jacques:

To that point, the other thing I'd note there is that that spectrum, we expect that that will become more and more rich, in that Arrow is the place where we're focused on the in memory representation. And we're starting with the representation that's very useful for a lot of different situations. It's highly efficient for processing, it's reasonably compact. But there's actually the alternative memory representations that can be beneficial for certain kinds of data. And so we have some options actually already some sort of capabilities already inside of Arrow to support sort of a sparse versus a dense representation for certain types of sub data structures.

But what we expect to see happen, is that we actually have a specification already for dictionary encoding in the context of Arrow. And so what we expect to see is that that whole spectrum of representation, it's not going to be Parquet or Arrow, it's going to be several different representations and potentially data sets where portions of the data set, you might say, "Hey. I'm going to maintain these six columns in a Arrow dictionary representation. These four columns in an Arrow Sparks representation and these 12 columns in a Parquet representation in memory and these 25 Parquet columns in a Parquet representation on disk."

So the representation and the location of the data, what media, will both be configured based on the requirements of that data in relationship to your workload.

Jeff:

You two both work at Dremio. I interviewed your CEO Tomer I think last year. And we talked mostly about Drill in that conversation. And I think at that time, Dremio was ... I know Dremio is still in stealth and we don't need to talk much about Dremio. Today, I'm sure that when Dremio is on the market, I think you're in Beta right now. When Dremio is on the market, we'll do a show about it. I'm sure I would love to at least.

But I'm curious about what you can talk about in terms of why the conversation around the Dremio product has shifted from being focused on Drill, to a focus on Arrow.

Jacques:

Well, I think it's all about this sort of loose coupling. We have people here who are involved in a variety of different open source Apache projects. And we make a lot of contributions there, and care very much about doing so. What we've recognized or what we see as the interesting direction we're going, is that we are moving from a situation where we have monolithic systems. We think of Oracle as this very traditional, monolithic data system where everything has to happen inside of Oracle, to what is a very, very loosely coupled analytics ecosystem. Where it's not one tool I'm using, it's 12 that I'm combining together and it's best of breed for the particular use case I have. So today it might be Spark and tomorrow it might be Drill and the next day it might be Impala that solves a particular use case very, very well.

And so, what open source has been especially good at is building the common components that connect those things. And so that's really where we focus a lot of our time. And so, Parquet is an example of a common component that's used in many, many technologies. We're working on Arrow being incorporated into a number of these technologies as well, as a common component. We're also involved in a project called Apache Calcite, which is a query optimization framework used in a bunch of different projects. And so, really where Dremio is focusing a lot of its time is on contributing to these common components to make a better ... to improve the ability for people to operate in this loosely coupled ecosystem. Because we all know that loosely coupled can be very powerful, but it can also be very challenging, especially if you start to say, "Hey. Well, I can use these tools together, but all of a sudden, it's very, very expensive to use them together."

And so what we're really driving towards is this, how do we make it so that we help to better achieve the promise of loosely coupled, while reducing the pain of loosely coupled.

Jeff:

I was interviewing Tomasz Tunguz, who works at Red Point. And we had a slight conversation about Dremio, because he worked on the investment from Red Point in Dremio. And we were talking about other technologies and ... we were talking about graph QL and, I don't know if you know about graph QL or a similar project from NetFlix called Falcor, but these are projects where it's also working on this data interoperability problem, except it's kind of at a different layer of the stack. It's more at the request response stack. Like if I'm a user logging on to Facebook, Facebook has to fetch all this data from desperate data sources. And they've got a bunch of databases and they're going to pull the likes from one database, the photos from one database, the news articles from one database. It's such a mess of different things, but you can represent all those requests in one big blob of a JSON like structure that is called graph QL structure, and then that request hits a graph QL server. And it gets federated to all these different databases.

And we had a brief conversation about that. And the more I thought about it, I was like, "Well, you know, right now we have this paradigm where there's this idea of offline data and online data. There's like, the offline data that's stuff that your Hadoop or Spark or whatever is going to be accessing and then you have the online data, which is like I log onto Facebook and it fetches my photos and stuff and it's easily accessible." But that is kind of a false dichotomy, because you can imagine a situation where over time, your request to a graph QL server gets federated to systems that could just access Arrow data structures. And then it's in memory and then it's fast. And then it's like, "Oh my gosh. That's a world of possibility. Because if you take all the offline data and basically make it online data that's accessible to users, I mean, that's a big shift."

Jacques:

Yeah, absolutely. Although I would say that one of ... there's two sort of, it's probably a two by two on this. Which is that there is the concept of offline data and online data. And I think that anybody who's worked in this space, those two worlds get very blurred very, very quickly. And then I think there's a separate sort of set of things to look at, which is analytical purpose versus what I would describe as more operational purpose. So people have the need, for example, to respond to a user request and give them the right information for that request. Which is what I describe as a very operational thing. I need to tell you how many likes this post has. Or I need to share this message with you from your friend, or whatever.

And it's a very operational purpose you're dealing with. Small amounts of data on each individual sort of transaction if you will. But you have a large quantity of these types of operations. I've got a hundred billion, or a billion users are hitting the servers all the time. And then you've got a second scenario, which is a lower quantity of interactions of data, but typically on a wider breadth of data. So that is why we call it an analytical purpose, where I might be wanting to understand how customers are working, or how many users are using a particular feature and I want to understand which ones give up on using the feature, versus those that like to use it a lot and how long do they spend using it.

And that's sort of an analytical purpose. And so I think that what's happening is this kind of, you're seeing the ... traditionally it's been very much like analyticals offline and operationals online, is kind of the way that that chart was laid out. But I think that you're absolutely right, which is these lines are getting blurred and so, one analytical, you're using streaming systems to be able to do what is effectively very short term or very present time analytical work is very, very common now. And these operational workloads are starting to look very analytical in their nature, in that you may want to not only see, "Hey. This message just appeared from you, but also how many other people are excited about this or interacted with this message, which is a little bit more of an analytical type of metric."

And so I think that you do see this sort of vision where these things are coming closer and closer together and having common ways to interact with them, is very powerful. That being said, I think that one of the things that I still see a lot is a split in the types of users for each of these things. And you've got what I would say more frequently developers who are doing operational types of uses, where as you see more analysts, business owners, those types of people, who are much more interested in the analytical types of answers.

So you still see some of those separations in terms of the users, but the technology are getting very, very close together, for sure.

Jeff:

Now, I know we're nearing the end of our time, but just to put things in context, Jacques, you ran the distributed systems team at MapR and Julien, you lead the data analytics pipeline at Twitter in your lives before joining Dremio. So these are big companies where you have, you get a lot of exposure to the biggest challenges that exist in the big data ecosystem.

Maybe you could each give your perspective on what kinds of things you saw, what challenges you were exposed to, whether it's related to the challenges of columnar formats or not?

Jacques:

Yeah, so I'll go first. So I think that what I see is an extraordinary amount of technical innovation happening in a lot of these places in data in general. What's happened in the last 10 years in data has been phenomenal, in terms of not only the rise of all of these big data systems, the rise of NoSQL, the ability for developers to be attitudes more agile and creation of applications and what they were able to do 10 years ago. And part of that is the democratization of data control from central wide IT and DBAs to an engineer being able to sit down, spin up a Mongo instance, build an application and start working on it.

So you get this explosion of capability and that explosion of capability has given people huge amounts of powerful capabilities. I think it's also made developers lives substantially better.

What I've also seen though, is I've seen actually, I would say some level of degradation of a huge portion of the analysts work lives. And the reason is that you've got this data all over the place in all these different systems, so if you think about it from an analysts perspective, the monolithic world was kind of pretty nice and pretty friendly and pretty clean. So the DBA made sure all my data was nice and organized and I knew exactly where all my data was and I could get access to it. And the process to find new projects was very structured so that we always made sure before we started a project that the data that needed to be tracked for an analyst to be able to do his job was done very well.

And so what we've moved towards is this world where things are much more dynamic, much more agile, people can build things much more quickly. And you get these business applications, which can get into the world in one tenth the time that they used to take. However, what's happened is that the analyst has kind of not always thought of immediately as part of that. And so they often have to deal with the ramifications of move fast, break things in the analysis side. So I need to deal with 20 different versions of the data, because we keep changing how the UI is storing things or how we're tracking things or whatever.

And so, what I see is I see a huge amount of technology that is trying to help this, but that technology is often complicated to use and requires the very, very talented resources to be able to use it. And so you get a lot of success of using that technology in the more sophisticated organizations. For example, a lot of people in the silicon valley area here, they have these amazing teams that have ... these amazing data engineering and infrastructure teams which can do all sorts of amazing things with all these technologies. But when you get into a lot of other parts of organizations, it's very hard for them to find the level of talent and to be able to assist these things together to solve real business problems.

And so I think that there's a huge amount of capacity and technology. I think that sometimes the challenges are that the pieces don't work well together. That it isn't easy to get them to solve your actual business problem. And so that's really one of those driving forces that I see in what we're doing at Dremio. And obviously we can more about this as we launch, but it's how do I make it so that the common user, the millions of sequel analysts that are out there or the millions of people who are non technical who are sitting there using Excel or Tableau, that they have the ability to work and achieve their business goals with the power of all this technology, even without having the massive data engineering force that someone like Facebook or a Uber or a Twitter can staff.

So how do we give them the same set of capabilities, the same sort of power that these organizations that don't have that staff, as the people ... what's happened here? How do we get that to other people in the world?

Jeff:

Julien, do you want to share a little bit of your experience at Twitter?

Julien:

Sure. So what I've seen in a company like Twitter and that's true of many of those big companies. One of the harder problems is to build an ecosystem for people to interact with data. Because pretty much the entire company, every team in the company interacts with data and they're not just consuming. They're consuming data, producing data, producing derived data sets and they all start to depend on each other. And instead of the old world of the data warehouse, where you have a source of truth of all the data and people use it, where there's very much a producer, consumer relationship, we move in a world where you have a huge graph of dependencies between different teams inside of companies, and the data keeps evolving and the structure of the company keeps evolving and you need to build tools to allow different teams to depend on each other reliably. And there are many different aspects to this. It comes from data collection of knowing what's happening and then you will have a spanned diction team that will consume data to try to figure out why this span versus what's a real user of the system.

Some people will work on categorizing users so that the service can be better catered to their specific needs and they will all start using those different source of information. And once users have been categorized, it can be used for spanned diction or for relevance. And so you have this complex set of tools and that can be data collection, that can be scheduling, that can be tools for fast, interactive analysis or dashboards we're using. And all those systems, you have this big living entity, which is the company, that is doing a lot of things.

And one of the challenge is enabling everyone to be independent and agile and do their own work, who are still depending on each other. Because the more you have links between teams, the more there's chance for things breaking and changing and breaking everything. So once you build all those dependencies between all those teams, something breaking will impact a lot of people in the company. And so that's where you need to build tools that will make everything more reliable and easier to work in a decentralized manner. When people can build those relationships and dependencies in a way that's going to be reliable and it can build trust and be efficient in what they're doing.

Jeff:

Alright. Well that's a great way to close off the show. Julien and Jacques, thank you for giving me your time. This was a really interesting conversation and I'm excited to see what develops in the world of Arrow and particularly with Dremio. Thank a lot guys.

Jacques:

Appreciate it. It was good to be talking to you.

Julien:

Thank you.