Data Access for Data Science




My name is Jacques. I am a co-founder and CTO at Dremio. And this talk is going to be about Arrow and Dremio. Dremio is an open source technology. And then I work actively in a bunch of open source projects that are beyond what Dremio is. One of the main ones that I work on is I'm the PMC Chair of Apache Arrow. I was one of the founding members of that project and sort of really got that started. I'm also heavily involved in Apache Calcite, another Apache project around data.

So I'm gonna go through a lot of content. Hopefully that'll make it valuable to you guys. I'm gonna start by talking a little bit about Apache Arrow, then I'll talk a little bit about sort of Dremio and how we sort of approach the world, just for a couple slides and then talk about data access specifically. So this talk's really about our goal to try to make it easier for people to get at data and make it sort of a less painful part of the process. So we'll go through some examples with Dremio and notebooks. We'll then talk a little bit about caching. One of the things that Dremio focuses a lot on is trying to figure out different ways to cache data, to make things for performant, reduce the duplication of work that people sometimes do, and sort of give an example, a quick notebook of that.

Really, at the core of this is that our perspective is that getting data ready for analysis is hard, right? When you talk to a lot of analysts, they spend a lot of time trying to get to data, trying to get it somewhat cleansed so they can actually start to make analysis. They start doing analysis, and then they actually see there's more problems with it and they have to go back and sort of cleanse some more before they can actually get to the meat of analysis and science. And so that's really what were trying to focus on and solve. I think if you were at the talk just before, he was talking about many different systems restoring data. It's not just an oracle database anymore, obviously. And so really, that makes it more complicated. Each system has its own internal patterns as to how you get to data. Some of the ones like relationals are easier to get the data, things that are more no sequel, sometimes it's harder to get to the data in sort of a standard pattern.

Data Access for Data Science

Data access is frequently slow so some of these systems are more designed for point look ups and if you're trying to do analysis, you're trying to do bulk reads and that can be problematic. And really at the core of it is just like we believe that there's some types of issues that currently can only be solved by IT tickets but they really shouldn't be a problem with IT tickets. And the other thing is that when you look at data and you're working with data, there are some parts of data preparation or data cleansing, the data curation, that shouldn't be repeated over and over again but there aren't necessarily the systems in place for the people to sort of collaborate around these things and share them. And so that's why we said, "Hey, this should be this new sort of self service access tier." So I'm gonna start by talking about Arrow because Arrow's really sort of the foundation of all the stuff that we've built.

So if you're not familiar with Arrow, maybe you heard some of the other talks on Arrow. Arrow, at its core, is the standard in memory common representation of data. Okay. And it was a bit of a chicken and an egg when we sat down. We said, "Hey, we want to make it easier to move between different systems but the way to do that is to try to get rid of the serialization, de-serialization that we typically have to pay. The problem with doing that is that if internal systems had their own representation and each system has its own internal representation of data, you're gonna always have to pay that serialization, de-serialization cost. So rather than looking at memory transport as the problem, you have to look at processing that's the problem. And so we basically said, "Hey, how do you structure data to make it the most efficient way to process? Let's build some tools around that and then people will start to adopt for processing purposes and now when you want to share information, all of a sudden you have this same representation."

So it's a bit of a chicken and egg, like I wanna share data and so that's what people think about in terms of Arrow but really, at its core, Arrow was designed to be a very, very efficient way to process data. Arrow is an Apache Foundation Project, driven by a lot of different people from different projects, sort of contributing to that together and so, as I said, there's two pieces to this problem. One is, on the left hand side we've got how do you move data between systems efficiently, right? If you look at most systems today, each system has its own internal representation of data and then when it wants to hand something to something else, it has some kind of serialized format that it sends over to the other system and then the other system sort of eats from that. And in many cases, that representation is actually a cell by cell interface. So it's not actually a data representation but rather an API but if you're moving lots and lots of data, working cell by cell is very, very inefficient.

Data Access for Data Science

And so Arrow was designed to say, "Hey, let's make it so that representation can be shared and can be passed by different systems." And this ultimately will allow and this is what we're driving towards will allow actually multiple systems to share the same memory and the same data. So if several people were working with the same hot data set, there's no reason that they each have to have their own copy in memory. We can actually used shared memory to solve that. That's a little bit longer term. Right now we're just trying to say, "Hey, even if I am not on the same node, I still wanna make it as fast as possible to move data between systems." And so the other side of this, the processing side, the right hand side of the world, Arrow was designed to take sort of very good advantage of modern hardware and that's both CPUs and GPUs. You see that by the different projects that have been adopting it.

And so the concepts, many of you are probably familiar with commoner disk. Commoner disk is very efficient for analytical workloads because you might have a data set that has hundreds of columns but for any particular operation you might only be wanting to read a few of those. And so there are benefits to commoner disk. Commoner in memory also have benefits but the dynamics are a little bit different but the main thing that happens with commoner and memory is that all of a sudden you can start to line up data in memory that is the similar types of data. You can organize it in a way that you can get very quick access to things without a lot of overhead structures and then you can actually sort of pipe that into CPUs and do a lot of things much more efficiently. So many CPUs today have capabilities to do multiple instructions simultaneously.

Data Access for Data Science

They wanna have a pipeline that moves through and works through stuff very quickly and you wanna be able to keep stuff as cache local as possible. And so by structuring Arrow memory so that data that's associated that's of a particular field or column is next to each other. You have the ability to take advantage of all those things. So when I talk about Arrow, there's really three main components to Arrow. The first is the core libraries. It's basically dealing with the data structures, moving those things around, serialized to other formats like I wanted the Arrow components to exist right now is the ability to just move back and forth between Parquet and Arrow. The second piece is, what I call, building blocks and then the third piece is sort of major integrations, places where Arrow's kind of started to be plugged in and used for sort of more sort of complex scenarios.

So at the core is the core libraries and so a lot of progress there so, as the previous speaker was talking about, one of the key sort of pain points that we see is a lot of big data systems today have data inside of a JBM. It's very hard to get it out of the JBM. It doesn't play well with any kind of other applications and so a lot of people are using it for that so the Java Library is very mature and then the C Plus Plus Library is very mature. And those two work very closely together to sort of move data between different systems. Then there's things like the Python Library, which is layered on top of the C Plus Plus Library and also very mature. There are also a bunch of other libraries that are a little bit less mature but are also getting a lot of engagement and adoption. There's a C layer on the C Plus Plus Library and there's a Ruby on top of that and then there's actually native Java Script and now there's just one that people have just starting work on, a Rust Library to interact with Arrow.

So the goal is that whatever language you're working with. You still can interact with this data representation so that when you're wanting to move from one system to another you can do that very efficiently. The second part of Arrow that I talked about is what I call building blocks and so these are sort of projects within Arrow that are trying to solve sort of specific sort of sub-components of the problem that are much more broad than sort of Arrow as a sort of core spec. And so Plasma is one, which is basically the ability to maintain Arrow memory and use that in shared memory across different applications. Feather is an ephemeral file format that's used to sort of very quickly drop Arrow to disk and bring it back into memory, if you have something where you wanna move something quickly through a disk. And then there's two additional ones which were working on and hope to have some things available soon.

One is Arrow RPC and so right now Arrow is at memory representation but there isn't a formal communication mechanism that sits around that and so people are solving it in their own ways right now. What we're trying to do is come to a consensus in the community of a standard way to do this. It'll start with a single stream model but it ultimately will support a parallel stream model so you can actually open up a parallel Arrow stream from one system to another system if you wanna move data very, very quickly. And the other thing are what we're calling Arrow Kernels and Arrow Kernels are sort of components of processing technology that'll allow you to do something very quickly. And so imagine that you want a very fast source, you would implement a Arrow Kernel for a very fast source. Or if you wanted to find a very fast distinct set of values, you would input an Arrow Kernel to do that.

And so these are also building blocks that someone can incorporate into their application. And so if I have a problem and I don't even know about Arrow and I just have a problem, which is I want faster sort, then maybe what I would do is say, "Hey, if I move my data into an Arrow representation, all of a sudden I can have faster sort." And then that gives them a reason as a processing engine who may not be as interested in the sort of interchange side of things as it gives me a reason to sort of incorporate that into their technology because they get a direct benefit through these Arrow Kernels. So those are the sort of building blocks and then lastly there's some major integrations and there's more than this. This is some of the ones that I just listed here. Two of the big ones are is that Pandas is incorporating Arrow into it, into a bunch of its components as a sort of primary representation for data and for the newer versions of Pandas' operations.

Data Access for Data Science

And the other that Spark has adopted Arrow as a communication mechanism for interacting with Pandas today but we believe that, that will be more things in the future. I'm gonna skip over Dremio because I'll come back to that in a second but as I mentioned before Parquet, there's also a lot of work in terms of moving back and forth. Parquet, sort of what I'd consider the canonical on disk format for commoner data and Arrow the canonical in memory. There's a lot of sort of synergies there so those work very well together. And then lastly, one of the things that's nice that happened recently is that the GPU Open Analytics Initiative also adopted Arrow as the representation of data for GPU analytics. So what you see is it's basically gonna get adopted in different places. I think I've heard some older technologies that have been around 20 and 30 years, they're starting to adopt Arrow as a representation because it has a number of benefits.

And then the last one, that I've got on the slide here, is Dremio and that's what I'll talk about next, which is basically we said, "Okay, Arrow is sort of this building block for a loosely coupled data system, where you're gonna have lots of different things for it to interact together." What we want to do is build on top of that a data access layer, a way to get to different parts of data and sort of curate and collaborate around those things and that's what Dremio is. And so Dremio's actually built entirely on top of Arrow and all the date internally manipulated in Dremio is done using Arrow. So before I get into the Dremio, just for a couple of slides, quickly at a high level, Arrow's seen great adoption of a bunch of different projects. It's growing quickly and so hopefully it'll add value to you. And one of the things to keep in mind is that some people may work with Arrow, other people may work with technologies that use Arrow, right?

It's not necessarily true that everybody's going to be using Arrow directly. That doesn't necessarily make sense. But what is gonna happen is that the goal is that Arrow will make it so that whatever you do use, works better together. So on top of Arrow we said, "Let's build a self service data access platform." Okay. So Dremio, we're a fairly new company, we're about three years old. We launched the product publicly a little less than a year ago. We called us a self service data platform. It's a new tier, it's not something that existed before. The name of the Narwhal is Gnarly and I thought I had stickers but it turns out I don't so I can't give you any stickers. But Dremio is an Apache licensed open source project. It's build on sort of three main Apache projects, Arrow, Calcite and Parquet and it's really designed to sort of extent and interact with different systems as much as possible and, as I mentioned, all the execution as well as the input and output is Arrow representation.

So Dremio basically looks at the world as it has to be Arrow and so any edge for a Dremio is Arrow until it has to be something else. So, for example, if you interact with Dremio using ABC or JABC the data actually moves all the way to the client and it's Arrow right at the edge of the client and then we have to expose that API. If you want to interact with Dremio as a create a new source for Dremio then that means the moment that we bring in data from that source into Dremio it's Arrow. And so the hope is that the edges now become more mature and support Arrow as well so that everything is Arrow throughout the entire system. So the way we talk about Dremio is Google Docs for your data and really it's about having a UI to be able to sort of manage all the different parts of your data and also to live curate your data based on the different needs that you have.

So the four main components of it, data caching, which I'll talk about in a little bit, but it's basically data access, data cataloging, data curation. So I wanna find what I want in data catalog, I wanna maybe clean it up a little bit and share that with other people so they don't have to do the same work again. And so I need to be able to access the data no matter where it is so that I can find it and do these things. And then lastly, I shouldn't be having to be repeating the work over and over again and that's where data caching comes in. So what I'm gonna do is I'm gonna jump over and just show you a quick ... Let's go to browser. So this Dremio. I don't have anything in there right now but this is Dremio. It is a UI and a distribute system as well. But basically everybody logs into it. I'm in my home space right now so I'm in my home space, I'm logged in as Jana Doe and I have different assets that I might wanna create for myself but the goal really at the end of the day is to be able to create assets and sharing them among the different people.

Data Access for Data Science

And so Dremio can access that many sources. You can see sort of a different set of sample sources that we have here. We're always adding sources. The goal being that no matter where the data is that we can expose it a common structure. And so, for example, here I've got elastic search, I've got some Yelp data in a couple of these systems to sort of show you the thing. And so if you know elastic search, elastic search actually has indexes and it has aliases and it has types, right? If you don't know elastic search, it's basically the way that they organize data. We try to map that to what might people understand is databases and tables or data sets and databases. And so we do that with elastic search. For Mongo, it's a little bit closer, Mongo has databases but then they have these things called collections. So we map collection, databases in collections to something in Dremio, something like HDFS or a NAS or S, is a file system, right?

And so you can just traverse the file system, look at different parts of data and see that there's different things that are in all these different folders. Now before I get into that, basically we try to say wherever you are, we're gonna start by sort of exposing it in a way to try to make sense so you don't have to think about what system you're interacting with. So in this example, I've actually named the sources based on what type of system it is but realistically that's not something people should be having to worry about, right? That should be based on the names of the types of business concepts that are associated. So one might be the marketing system, one might be the inventory system and so that's actually this has been named, so that the users, we can extract away the complexity of what particular type of data source that you're interacting with and just let you focus on sort of your task at hand.

And so we do that in different ways for different systems. So, for example, for HTFS. So one of the things that we see here and I'm gonna go back onto one of these directories here, so I'm actually gonna go over here and I'm going to remove format. So this has already been set up so I'm gonna start here. So if you came in, what you would see initially would be just another folder and so if you go into the folder, well, if my network is gonna be friendly, oh, there you go, okay, is a bunch of CSV files or text files, okay? And so this is an example of something where you might ght say, "Hey look, I just dumped all these files someplace, maybe it's on a NAS someplace. I want someone to analyze them and figure it out." And so what Dremio does is try to say, 'Hey, let's try to figure out how quickly you can turn this into a data set." So if there's a folder of files or a set of folders that all have files in them, you can actually say, "Okay, I'm gonna promote this into a data set."

Dremio's gonna say, "Well, there's some example files." But actually I can see that here this is actually. Okay that looks right to me, it doesn't have headers. I'm gonna say okay and so now this is a data set inside of Dremio. Okay. And it means that whatever tool you're interacting with, now this is an end point. And so if you actually look here you can grab this name. I'm just gonna paste it in there for you to see it. If you can see that, basically every data set inside of Dremio has a path against a route. So whatever you're looking for, whatever you're trying to work with you always have a path that's a known location for this kind of thing and you can access that from whatever system you want. So now that's a data set, that was the effort to take a directory of files, turn it into a data set that is now sharable. I can set permissions around that so that different people have permission. I can set all sorts of different things on this.

Data Access for Data Science

So that at its core is a goal of Dremio. Now there are different places where the data is different shapes, right? And so for Mongo, for example, well, if you know Mongo well enough you know that Mongo has no formal schema associated with it, right? Like every record can be a completely different schema and so Dremio also tries to solve that for you. So when we go and interact with sources, we start by sampling the scheme of that data and presenting to you what we see, okay? And that can be a simple schema, it can be rows and columns, it can be complex data types, it can be mixed types so we actually support a concept of heterogeneous types so maybe one field is an integer, the next row, that field is an array of strength. And we actually give you tools inside the product to interact with those different things. And so what happens though is that, as you might know, this is sampling is not necessarily sufficient to interact with things and so what Dremio also does is it has a model of schema learning.

So as we're interacting with the data, if we later or in future records find more data that has different schemas associated with it, we can actually reattempt the operations that you were trying to do, record that information and not actually have you have to worry about it. And so the goal is that to sort abstract that away from your concerns. And so let me drop over here to a notebook. And so I have a quick example here of ...


I'm gonna drop over here to the notebook. And so I have a quick example here of a working with a notebook against Dremio. So this is sort of if you're thinking about it from like a sort of more of a rebel experience. So just to go quickly through this, assuming that the demo gods are good to me, so I just imported a couple of things here as some options to make things look a little better. I'm using ODBC today. As I mentioned, we're working on something called Arrow Rpc and that will be the last step. And so for example, Pandis works with Arrow today. Dremio speaks Arrow today, but we don't have a standard yet. So we're working on that standard so that it actually goes Arrow all the way through. And so this example is running with python ... I'm sorry, with Pi ODBC.

So I'm gonna connect up here and it looks like ... let me just restart. Let me reload here and see if the demo gods are more kind to me. Hey, that worked better. Okay. So I'm going through here. I just could find a couple of utility methods here. And so as I mentioned ... so just like it is in the UI, everything is exposed as a single name space. Now this is where you all of a sudden, you struggle with different concepts. So Dremio provides an arbitrarily deep nested namespace so you can have folders of things, you can collaborate or down different datasets. But when you get to ODBC as a standard, ODBC only has two fields.

It has schema and table. And so we expose basically schemas as sort of the nested structures up through the last leaf node, and then the tables of the leaf node of a name. Once we have native Arrow interface, then this would be more standard. But anyway, you can see here's a bunch of schemas and I can actually just say I'm going to use what's called the Mongo Yelp Schema. So that's dropping me into that particular subspace. And I can see that I have some tables that are business checking in to review.

Oh yeah, sure. Sorry, I should have asked that question. Is that ... how's that going? Okay. Sorry about that. So obviously that's not very exciting. Okay. I get some data from some place. Then you start to ask questions about like how do you deal with complex data? And so when we expose complex data to ODBC or JDBC, we always expose it actually as JSON. So whatever the underlying system is, maybe it's Parquet, maybe I'm reading an Avro files, maybe I'm reading a Mongo DB, maybe I'm reading from some relational store. If the data structures get complex, then we actually expose that as JSON because we feel like that's the sort of easiest way for systems to interact with that. And so for example, here, the ... I actually didn't show you over here. But this business dataset, if you actually look at it, you can see here that there's actually complex values in this field.

This isn't ours, this is a complex structure, a map here was struck. Ad then actually the categories field over here is an array of strings. Okay? So some of the data types are complex. And so if you go into Dremio ... I'm sorry, into the notebook here, and we run this dataset, you'll actually see that these come back as JSON strings. And so easiest thing to do here, and I just create a little helper method that basically does this, is you can actually extract out individual values and say, "Okay, now I've got a structure inside of Python and I can interact with that and potentially pull a single value of that, like what's the close for Friday?"

Right? So I can interact with it on this side. If it makes sense, if it's something that's commonly done, maybe I want to do it inside of Dremio so I can share that easier with other people so that they can also use that as a data endpoint. And so here's the same thing, which is I want to get the Friday close. In Dremio, I can actually also de-reference that directly if I want to because it's actually, while we return it as JSON, we actually know that it's complex inside of our system. The JSON reason is because right now the APIs are not really mature for this complex arbitrary scheme of things. And so, I run that. Then, all of a sudden I'm, "Okay, now Dremio is giving me a back to 2:00 rather than me getting the JSON block reading it and parsing inside of python and doing it there."

But either's fine. So, that's sort of basic data access. Okay? Now I just created this. I just went to this Mongo yelp business stuff. There may be ... I may look at this data set and realize, "You know what? What I really want is I want zip code out of this data and I want to be able to use that zip code on a regular basis." And so Dremio tries to help you with these things. And so I can see here that the street address actually has zip code in it. So I can actually say, "Hey Dremio, try to estimate how to take out the pattern I just highlighted." And so it gives me a couple of samples, gives me more samples down here and says, "Hey, here's some different ways that I think maybe it makes sense to extract data."

And we're always trying to add new algorithms for this. And then we score these things based on how often they are matching. So here we can say, "Hey, you know what? It worked most of the time. Maybe it's not perfect, but I'm going to just go with this one and call it zip code." So I apply that. And so Dremio is gonna say, "Okay, I'm gonna apply that." And now that's a zip column. So now I've created a drive dataset. Okay? At this point, I haven't copied in any data, but I want to make this available to other people in my organization. And so, I've actually created a little folder column access down here and I call this "My Business". Okay?

So at that point I'm like, hey, here's this new access point. I can grab this. I can go into my notebook here and I can say, okay, I'm going to just ... I think I actually typed this already, but I'll type it again. I can say, "Okay, select star from access." And now I get access to that data and there's the zip code field. As an end point ... now, this is something that I can share with others. Possibly I secure the original dataset and only expose this version of the dataset because maybe people were supposed to see this version instead. Maybe I want to constrain and take some fields out that people shouldn't be able to see. And so that's great. But you know, what I actually realized is I also want to analyze by categories. And right now, it's a complex structure.

So I'm going to just gonna un-nest that. And I'm actually going to rename it, but I misspelled it. Let me read it. Let me read it right with the right spell here. And so save this. And they're like, okay, great. I've saved that now. And now when I go back in here and I go back to here, I can rerun this thing and I can see now hopefully if I scroll over here, I see category there, so now category here, right? And so the idea being, this is actually the model that we're building for, is that it makes sense to do some things in your analysis tool, but some things are actually end points that you want to expose to other people.

So you should do those inside of another layer so that that can actually be a shareable thing that isn't code, right? It's very hard to share code. It's very hard to share a part of a notebook so that other people use the exact same steps. And so that's something that we want to be able to expose through here. So this is just another end point now that you can interact with. And a user that comes in doesn't even need to know necessarily that a data endpoint is a complex data endpoint or is a a physical data endpoint or a virtual data endpoint. And the goal is that Dremio also will, at a bare minimum, will not persist anything that is unnecessary to persists. But then when we talk about caching, we'll actually say, "Hey, we can make choices about what to persist to make things go faster, independent of these business concepts that people are interacting with."

So a user can interact with my business access. Today, that might be a virtual version of the data. Tomorrow, we might decide that it's such an expensive set of operations that we're going to persist it. But the user continues to interact with that without having to worry about those details. So, I think that's what I got in this little notebook. Oh, you can also ... So the other thing is that if you are collaborating around datasets, you may want to go over here and you may want to upload a data set yourself. And so we also allow that. So if you want to collaborate with other people, you can upload something like an excel document here. And assuming that we're good here, construct at the field names, because I know that there's field names in this doc.

And, let me see, I don't think that ... yeah, I think. It's getting behind me. Sorry about that. So I can extract out the field names, save that, and now that's another dataset. Right? And so it's another data set. This happens to be a physical data set my home space, but I can access that just like any other path. And so everything inside the system is always a path that people can interact with. So it's not trying to replace the tool you're using to do analysis. It's just trying to make it so that the way you get to the datasets is faster. And so that also means that we have search. So if I search for Mongo, which is not a very interesting thing, I can see here's all the different datasets that are Mongo, score it out for how common they are regularly usually used and what kind of thing.

So on top of that you may also want to understand ... So here's another example of a data set. I'm going to go in and edit this guy. This is a dataset which someone else built for me. And I don't really know where it came from. But I can click right here and I can learn. And so we always show lineage of all the datasets in relationship to each other, so you can always see where things are coming from. So in fact, this is actually a data set which is a combination of three different datasets, two of which are from elastic search and one is from Mongo. So I can always understand where my data is coming from, but I don't necessarily have to worry about it. If I just want to interact with that data endpoint, I can. So that's a quick overview of access.

So the second part of this is about caching. So yeah, it's great that I can access these data sets and a can find them. But what we see over and over again is people are doing the same operations again and again, and that becomes very, very expensive. And so the way I think about this is in terms of distance to data. So when you're trying to get to what you want, the data is something else, right? And you're going to do some work to get to what you want, right? And distance in data can be like the work that needs to be done. What are the resources required for doing that work? It may be that the work takes 10 seconds, but I need a thousand nodes to do it in 10 seconds. So it may be a pain point of cost or it may be a pain point of latency.

Data Access for Data Science

So there's different dynamics for different parts of the organization, but it's really about how do you reduce that distance, right? And so the obvious thing to do is you create an intermediate version of the data that's closer to what you want, right? And actually, everybody does this already. You make summary versions of datasets and you start working with those instead. Okay? And so the goal being that, "Hey, guess what? All of a sudden now I've reduced the original distance to data, some new distance to data, and this was reduced." So if 10 different people do this, I get a substantial reduce in the over amount of work that gets done and hopefully we get better performance for everybody. And so the concept of reflection, which is the other piece of Dremio's sort of capabilities, is about allowing that to happen without the user having to worry as much about it.

And so, these relationships can be actually complex as well. So it might be that certain representations of data satisfy certain answers to questions, but other representations, maybe you need some portion of the raw data along with cached versions of the data to get the answer. So,  the relationship of these things can be very, very complex. And so, as I said, you already do this today, so everybody here probably creates, on a regular basis, summarization data. And then they start working against that data. Okay. Your partition is dated differently. You clean it. And you sort of summarize it for some purpose. The challenge that I have seen is it comes when you're saying, "How does a user sort of know what is the right data set to use?" Right? So as organizations get bigger and you have lots and lots of these summary data sets, sometimes what happens is people start managing all their own summary data sets and so they don't know about other people's data sets.

Maybe it's all being shared in a common space, but no one knows what any of them mean. And so people are building the same ones over again. And so, even if you can solve the perfect world, which is that every ... there's perfect information and everybody knows about every dataset and which summarized version of the data set is the best person to start with. Okay? Even in that situation, the reality is that the dynamics of what you need change as you're doing the analysis. So even if the first question you ask is sort of good for the raw data because it's such a broad question that you really need to go to the raw data, as you're sort of drilling down into sort of what you're trying to understand with your analysis, it might be that it makes sense for people to switch to another summarized version of the data set.

But usually once you've built up analysis, you really don't want to rework the first half of the analysis with a different dataset. Right? And so people are not inclined to change even if they pick the right one to begin with. And so Dremio's goal is basically to do that automatically, right? So if today on the left hand side, you basically copy different versions of the data, you summarize in different ways and then you have the users pick, Dremio's goal is that you've actually interacted the same data endpoint, or interact with logical end points that people are interested in, but logical end points are completely independent to the concept of what should be physically persisted to improve performance.

And so, then Dremio can say, "Hey, I'm going to pick the right version of this data to make things go faster." So, this can get very complicated. But the simple ... to give you a simple review of what's going on here is that Dremio is choosing to save different versions of data that are alternative version to the original data. Right? And so let me go through these little bit examples here. So, imagine that you want to get some data, which is it's a data that's being ... I want columns. I'm scanning some data to say table one. I'm projecting two of those columns. I want columns A and C, I don't want all the columns. And then I actually wanted to determine an aggregation where I'm grouping by or rolling up A and looking at, excuse me, looking at the sum of C.

Data Access for Data Science

Okay? And then after that, I really only want to look at sums where C prime basically is less than 10. Okay? So this might be an operation that you want to do as you're sort of composing an analysis that you want to do. Dremio looks at all of its reflections and it tries to identify reflections that might be able to satisfy some portion of this question. Okay? And so imagine inside the system, Dremio has many different ... It may have 100 reflections associated with the Dataset or this set of data set that you're interacting with. As you ask for different answers, Dremio will go and look at the different options and say, "Which versions support this question or parts of this question?"

And so, in this example, maybe we have a materialization that we call reflection one. Okay? Which is scanning that same target table, the table one, but it's aggregating by a roll up of A and B and then does some of C. Okay? Well, if you think about it a little bit, you'll realize, "Hey, you know what? Sum of group of A, B, sum C is actually sufficient to answer the question that's being asked in the left, which is the group of A, sum of C."

Okay? It's not the answer yet, it's a part way to the answer. Okay? And so Dremio will recognize that pattern and say, "You know what? This materialization, reflection one, can actually be substituted for the original operation." And then, we'd realize we need to do one additional roll up, right? We can't just give you back that data because it's going to be not summarized all the way. And so we actually roll things up, so that we roll up to just the grouping by the A. Right? And then we also realized that guess what? The original data wasn't filtered, but I can apply this filter on top of the new version of the data. So I'm still gonna apply that filter. Right? And so a user comes in, they ask the question on the left, Dremio recognizes that this thing that was already saved is a part of the answer. And so it will replace it and say, "Use this part of the answer." And then it will do the rest of the work that it needs to do.

And so, if you imagine this data set as being a large data set, odds are this aggregation is gonna be something that's going to substantially contract the data set and therefore substantially reduce the amount of time that people have to wait and the amount of resources required. And there are lots of different patterns, right? We have over 10,000 tasks that are specifically focused on how you identify different patterns and replace and match these things. Okay? Two other examples that I have here just quickly. One is when you're doing joins between different datasets. We may actually recognize that the joins that are happening, that if you're doing an aggregation on a join, the properties of the aggregation are such that if we have an aggregation on one of those tables, we can replace that and still the joint after the aggregation, right?

We also look at different things like sorting and partitioning. And so, in some cases, you may have data sets where you will do needle in a haystack operations, right? And so, an example that we always see is that you have a bunch of support reps sitting on the phone that are trying to interact with some kind of portion of data and they need to look at that data very quickly, but it's a different portion for each user and that may be one workload. But you may also have another workload where you're trying to look at large chunks of data by a different perspective. Right? Maybe by I wanna look at everything for the last 24 hours very frequently to see what's going on there.

And so, you can actually create different versions of the data that are sorted by different things and partitioned by different things. And then as those operations come in, Dremio will actually say, "Hey, you know what? I can get to this data quicker if we go with this version of the data versus that version of the data." Even if we're interacting what is full granularity or full resolution data. So reflections, basically, it's a materialization designed to accelerate the operations. It's transparent to the data consumers. So, those endpoints don't change. So you start interacting with an endpoint, if someone in the organization realizes that, hey, we can benefit by making this endpoint be cached or some portion of this endpoint be cached, then all of a sudden that end point is faster.

But the user doesn't have to constantly change what they're doing inside their workbooks or whatnot. And the other part is that they're not required. So you can start using Dremio to interact with different systems and not even have to worry about the reflections. But as you see sort of different opportunities for sort of caching, you can take advantage of them. Scales up ... and the one other thing to understand about reflections is that we focus very much on using memory, but we don't use memory to persist. We actually choose to persist reflections. But then, we have a very fast way of returning them back into Arrow. So we go basically Arrow very very quickly and get them back into memory when people need them.

But what we found is that most people want to have the working space for memory. They can't afford to pin large data sets in memory. So, we actually persist the reflections. I actually realized that I didn't do all of my data access stuff. So before I go into the reflections overview, I want to do a little bit more on the data access stuff. So,  the other thing that I didn't talk about is that we focus very much on leveraging underlying source capabilities. And so, it's very nice that hey, we create a simple facade for things, right?  But if we're not intelligent about how he uses the resources below, it's not going to help you much and actually things are going to be kind of crappy, right?

Data Access for Data Science

And so we try to use things as effectively as possible. And so for example, elastic search, one of the sources that we interact with, we expose it as a SQL interface, just like anything else. In this case, elastic search, one of the things we support is that you can use what's called contained. So this is actually ... contains, as a concept, is a standard SQL relational database concept. We extended it so that we could actually support using Lucene query language to interact with elastic search to be able to do continuous operations. And so, if you know who Lucene query language, then what we have right here is we want to see the records that have awkward and date as two words and we want them prox-


The records that have awkward and date as two words, and we want them proximity close to each other, right? So, we want them within two words of each other, okay. Now, if you think about it, this is an operation which search engines are great for, so Elastic Search can do this but most systems can't do this right so we've exposed this through this capability but obviously if we try to do this in our system without indexing that'd be very, very expensive. Even if you try to pull it back into Python to do this operation, you've got millions of records, it's going to be very expensive to do this and so if you've got indexes though that are designed specifically for that then it goes very well. So, I think I actually need to reload this because they seem to die after I suspend.

Let's go down here, this first operation we got back some different values and I'm going to go look at the first of these and see, hey here's the text so here's an example of I'm pulling data from Elastic Search and I'm extracting out, I want to do a search, I want to do a proximity search. I'm getting back the data that I need and now I can de-reference it and interact with it as a data frame like anything else. Okay.

Behind the scenes is actually fairly complicated because Elastic Search doesn't expose this kind of interface. They're working on something now but it's still very early on now. So how do we do this? So here's another example which is also kind of interesting. This one's very simple which is I want to just look at for a year what is the number of reviews, so this is looking at the Yelp data set again. I will look for a particular time period. This is standard sequel, I can do this stuff and it all makes sense. What I want to do though is I'm going to run this query and hopefully in a second ... So now I get back all the data for what were the counts for different years for stars.

To give you some sense and I pasted it in here to give you some sense of this ... So what Dremio is actually doing behind the scenes to answer that question is submitting to a cluster of Elastic Search notes; so we actually interact with individual shards to try to get as much performance as possible this operation. This is actually the Elastic Search operation that we're sending back to Elastic Search. It's actually saying hey I want to do a range, the from period is this, I want to include the lower value as inside the range so there's no upper part of the range and then I want to do an aggregation with a term but then another aggregation aside with another term and inside of that I'm actually going to execute a groovy script. This is because we're hooked up to Elastic Search 2, Elastic Search 5 we actually write painless scripts instead.

I'm going to do a script which follows null semantics for sequel so that we do things correctly. How do we deal with missing values in the context of Elastic Search because Elastic Search doesn't index things that aren't there ... and return that data. The goal being that Dremio has figured out a way to encapsulate all of that details so you don't have to care at all that this is Elastic Search but we can still answer back as quickly as possible, right?

Elastic Search is a perfect example of something where it's a system that is very good at certain operations but not as good as others. So if you want to talk about raw, read through-put for Elastic Search, it's generally two orders of magnitude slower per thread but  something like a parquet file. If you wanted to read all this data back into a system, pull it into whatever client you're using and start trying to analyze it there you're going to be waiting weeks. But, indexing is really good so if you can take advantage of the indexes inside of Elastic Search and get down to the portion that you're interested in, that can really benefit you.

Mango, another example. Mango has its own way of exposing things called the Aggregation Pipeline. And so here's another query where we're saying hey let's look at just doing some counts, we're doing a little bit of an operation here where we're doing an aggregation by how many values are there, how many reviews are there for each number of stars? How many four-star reviews are there? How many three-star reviews are there? And we actually are also doing an operation where we want to say only reviews that have the word "amazing" in them. So in this case I can't use the Elastic capabilities because we're not inside a search engine so I have to use something that's more like a light clause which actually you can convert it into a Regex inside of Mango.

If I run this operation, what you'd expect actually does happen, we get the answer back here let's see. Review gods be nice to me. Review gods aren't going to be nice to me, all right. There it is, okay. And so, as we might expect, the people who say amazing are likely to rank high of course. They said they had an amazing experience, people don't usually use amazing in the negative so that makes sense. And so again, what does Dremio do? It converts it into a different version of the syntax that makes sense to that system so this is the example of the Aggregation Pipeline has generated with a project grouping, we do some more projections, we then do a match which is the filter then do some more projections out. And it's actually ordered this way.

So, you're like okay what happens if my data's already in a relational database? Well, odds are it's not in one relational database, it's in many. And so we actually tried to make that be consistent as well. If you know each relational database has some basics are the same, like from and select and that stuff's all the same, where, group by, but then you start getting into like day functions or how you're going to limit results and that kind of thing. And so, for each different system we try to basically take a canonical representation of sequel and then turn it into the appropriate representation for that system without you having to worry about it.

And so, here we've got a there's an operation which is a ... It's just a select limit one. But actually the oracle query which we're hitting because we're having an oracle database in this case, actually uses ROWNUM. Because that system doesn't support the limit syntax, if you're looking at say sequel server and you don't know this, like hey I did an analysis where I'm extracting the year from some dates and I'm doing some stuff with, I used the extract syntax which is the standard one but it turns out the sequel sort of doesn't support the standard one and so, oh by the way sequels uses brackets instead of single ticks for quote identifiers. Dremio then says okay well I'm going to write this as the operation so I'm not actually going to figure out that I did an inner query where I'm converting hire date into a year format using the year function because that's how Sequel Server does it.

The goal is that you don't have ... You can worry less about these details because Dremio's going to worry about them for you. And so it's very easy ... This is one of the things to really understand about this, this is very easy to create a system which connects to sources and allows you to interact with the sources. Most of the systems, even if they try to provide an abstraction, are not very powerful at trying to leverage the underlying sources capabilities. So that's one of the places where Dremio focuses is on taking advantage of the sources capabilities so that things go quick.

That's a little bit more about the data access but then going just lastly through this data, the reflections ... I'm going to go to I just have more setup up at the top here. We have this larger data set inside of Dremio right now. This is a one billion record data set, I'm doing a count on it right now, counts are always optimized so that's not impressive time for the count but it's a one billion record data set. This is just so you know this is a three or four moderately, not very big 16GB, four cores on Google computer I think something like that. This data's set inside of Hadoop, it's actually a set of .csvs, I don't remember how many. 500 GB of .csvs, something like that, I don't remember the exact size. Anyways, big in records.

I can interact with this end point and see the data. Here I'm running a query of, this is not a very exciting query, I'm just getting a little three records and it takes 0.28 seconds, okay whatever. I don't care. So, I can look at the granular data, I'm interacting with the granular data. But at some point I might get to a point where I want to do an analysis that's more aggregated in some way.

So here I'm going to do an analysis which is I'm going to look at the number of trips by year and I'm just going to plot that in a simple chart. If I run that I just got the answer back in 0.12 seconds. This was an operation that went into Dremio and said hey, for a billion records give me back the answer for how this stuff is aggregated together. I didn't change who I'm interacting with so I'm interacting with the same data set here that I was interacting up here. I can see the granular data when I'm interacting with this but Dremio has recognized I have reflections for this data and I have something that I can take which is something that can be then turned into this representation. It's not exactly that and so if you jump over here to the product, you'll actually see ... This is where all ... Just so you know, every one of these operations that was just happening they're all live against this system so you see there's probably one that took a longer time. Yes, I think it was the Mango one, that one took a little bit longer there, a few seconds.

Anyway, if you look these last couple operations actually have a little flame next to them and what that means is that we considered and actually decided to use one of our reflections so in this case we actually decided to use this reflection. This reflection is one hour and two minutes old. So, we basically said hey we persisted something an hour ago, and that's actually going to be solving a large portion of the question you're asking and so we're going to use that instead of the underlying data.

Of course this brings up the question of staleness and so there's a bunch of tools inside our Dremio to control how much you're willing to look at stale data. You might say for this data set I am not willing to look at any stale data but this data set one day old is completely fine to me. So you can control that on a data set level and create those reflections. As a user I can then do these operations and take advantage of different representations.

That first example probably I'm taking advantage of some kind of pre-aggregated partially pre-aggregated data. Here's another example where I'm actually wanting to interact with raw data. So this example here, I'm actually curious, I want to know, let's say I'm doing an analysis where I want to understand what people are doing on holidays so I'm looking at the Fourth of July and I want to see the raw detail, bring that back to Python and then do a bunch of different work on top of that. Because it may be that I want to bring all that data back. I'm actually going to run this operation. A billion records ... Okay, so I did a count, that's not that exciting because I did it inside of Dremio. I'm going to bring back 100 000 records ... What is that, a million there? So 100 000 records there. I'll bring back 100 000 records into this so it took me four seconds for the 100 000 records, let's see what it takes for the whole data set here.

Let's see, I don't know exactly how long this takes. I think it's a couple ... Yes, it's just under two million records. So the goal being ... There we go, so we got the two million records back. If you were imagining today that you're siting there and you've got a billion records and you need to slice that down to a portion of the data so if you're wanting to interact with that's relevant to the particular analysis you're doing and you want to go from a billion records to a million records quickly, this is the example of how you can do it. What's happening here is we're actually taking advantage of a reflection but it's not going to be an aggregation-based reflection, it's going to be one based on a different set of partitioning. In this case we have a reflection that's based on partitioning based on dates and so we can very quickly jump to the data set that is relevant to the data that you're interacting with.

Again, if you look back we didn't actually change what you're referencing but we said we're going to do something quick so that you can get the answer back more quickly.

In conclusion, the way that we model the world is that there's some stuff that you really should be collaborating more about and you're probably doing it somewhat in notebooks today, somewhat in just scripts and you should probably be sharing those as data endpoints and data assets that other people can work with. But you're still going to do all your analysis where you do it today but you have to worry less about how to get to data and have to worry less about how to solve the performance of that data and working with it.

Data Access for Data Science

It's on Git-Hub, check it out! Go to our website, you can download it pre-compiled and if you're interested in Arrow please join the community there as well. That's all I've got, thanks guys.

Speaker 1:

Great, thank you so much, great talk. Time for a couple of quick questions.

Speaker 2:

You showed an example where you went to Elastic Search to do a query and to do with the distance between two words. Is it in Dremio, suppose my data's currently stored in noSQL but I really would like to have that indexed. Is there a construct in Dremio at the moment that would push data into one of those systems? All of the examples or select data from bring it into a notebook.


If I understand your question correctly, you're understanding how can I take advantage of different systems in terms of performance? And so for example I have some data that's in one system today but I realize that some portion of that data would be better used for analysis purposes in something like Elastic Search. How does Dremio work with that? What we do today and what we just released today ironically, is something we call External Reflections. Dremio's initial focus is very much on making it so that you don't have to choose which system to use to interact with the data but what we actually have the support for in the release that you should be able to download now or in the next couple of hours is the ability to actually tell us how data relates to each other that you may have created externally.

Dremio doesn't yet say that ... The only way that Dremio persists data today from materialization purposes is in set up parquet data. In the future we may support driving to other systems but if you have a flow, if you can solve the move it to Elastic Search, you can actually tell Dremio that my data is in Elastic Search relational database X and my Hadoop cluster. And tell Dremio, this is how the three relate, provide one endpoint and then Dremio will correctly use the right system based on performance. We're going to look at what the reflections that are available and this is what we call External Reflections, it's basically ... So the normal reflections inside of Dremio that were Dremio maintained is that Dremio persists the stuff and knows all about it and maintains it, updates it. But if you have a second system that you want to do it externally, you can do that now, ascribe what that means in relationship to the original data set and we will automatically pick the right one for you.

Speaker 1:

One more quick question in the back.

Speaker 3:

Architecturally, what is Dremio ... What type of hardware can it run on? Does it run in docker containers? From a deployment perspective what's required with it?


Deployment-wise, we actually make it so that it's easy to try out. For a non-production deployment there's downloadables for you, if you're on Linux but you can also download Windows on a Mac app. You can actually just try it on your machine. Now, obviously if you're trying to do large amounts of data processing that might not be the right thing so Dremio basically has a concept of executors and then a coordinator or set of coordinators. Coordinators are the thing thing that runs the UI, maintains all the meta-data working with and usually it will have one or a small number of those things that can be deployed inside of dock or they might be deployed on an Edge Note in a Hadoop environment. They might be deployed on a more long-lived note inside of Amazon or ECQ or something like that. The executors are scaled based on the amount of work that Dremio itself needs to do and they can be ephemeral so you might spin out 50 of them to do a piece of work and then bring it down to a three which is the thing you persist normally or something like that.

We also have an integration specifically with YARN because a lot of the people who are using this are Hadoop users. You can actually have Dremio interact directly with your YARN resource manager to expand and contract and provision against the YARN resources Dremio executors.

Normally, the smallest space we would do for production, just one more piece on that. For production deployment, typically we would say a smallest unit of node for executors would be like four cores and 16GB but some customers run up into the hundreds of GB of memory per machine.