Dremio Jekyll

Using a Data Lake Engine to Create a Scalable and Lightning Fast Data Pipeline

Transcript

Justin Dunham:

Great. Well, hello, everyone. Welcome to the Using a Data Lake Engine to Create a Scalable and Lightning Fast Data Pipeline webinar. We are really excited to chat with you today. We're going to be demoing Dremio a little bit at the end, we'll be talking about modern data pipelines and where Dremio fits into that.

Justin Dunham:

I'll start by introducing our two speakers for today. We'll hit to the next slide here. And so today, you'll be talking with me, Justin Dunham. I'm Senior Director of Strategy here at Dremio. And we also have on the line Doctor Ryan Murray who is a principle consulting engineer here. For anyone who's curious, Doctor Murray's thesis was on the interaction of atoms in intense laser beams. If you have interest in that topic, feel free to email R-Y-M-U-R-R at dremio dotcom after this webinar. But today, we will be talking about data lake engines and data pipelines.

Justin Dunham:

On the next slide, I'm going to tell you quickly about how to ask questions on this webinar. So there are a few buttons you'll see at the bottom of your zoom window. The easiest thing and the best way to make sure we see your question in a timely manner and get to it is to use that Q&A button, and we will be answering questions at the end of the webinar. Feel free to ask as we go along, however.

Justin Dunham:

Using a Data Lake Engine to Create a Scalable and Lightning Fast Data Pipeline

So with that out of the way, I'm going to spend just a couple minutes talking about Dremio the company and our product, the data lake engine. And then I'm going to hand it over to Ryan. So as a quick overview on us, Dremio has been around for a few years, and we are the data lake engine. We're based in California. Customers in all industries use our data lake engine from Microsoft to TransUnion, UBS, NCR, Software AG, who we'll touch on a little bit toward the end of this presentation, and lots of big names all over the world are using the data lake engine to radically simplify their data architecture.

Justin Dunham:

We're also the co-creators of Apache Arrow, if you're familiar with that project, and Apache Arrow is the new standard for columnar and memory analytics. And we're seeing about four, five plus million downloads a month on the Arrow project. So a very exciting thing for us to be a part of.

Justin Dunham:

On the next slide here I'm going to talk to you real quick before I hand it over to Ryan about exactly what a data lake engine is. So Dremio provides an opportunity to take advantage of your data lake storage directly, which opens up all kinds of possibilities for your data architecture. And the first thing that Dremio provides is lightning fast queries directly on that data lake storage.

Justin Dunham:

Ryan will talk a little bit more about this in a bit. But we obviate the need for a lot of ETL, data warehousing, all of those things, because what we've seen in the past is folks need to add a lot of layers and a lot of mashinations just to make data in their data lake storage useful. So we actually let people interface directly with that data lake storage, and we make it super, super fast. And we also add a self-service semantic layer, so we make it really easy for people using Tableau, Power BI, Python R, a whole range of your favorite data science and VI tools to access that data directly in your data lake and for that to be performant without having to worry about all the implementation details and where things are kept and so on.

Justin Dunham:

There are a couple other things that we provide, as well here too, that I'll talk about before I hand things over to Ryan. So one thing that's exciting for a lot of our customers is that we also do live joins between your data lake storage and lots and lots of other database and storage services, you have to further reduce the need for ETL. So if you are interested in multi-cloud strategy, use Dremio to do joins between S3 and Azure data lake storage. If you have data stored in Oracle or Sequel Server, use Dremio to join those databases to your data lake storage and still get lightning fast query speed.

Justin Dunham:

Even non-relational databases, storage services, and data warehouses, so if you have things, you've ETLed some things into Redshift of Snowflake but you still have a lot of data in your data lake, Dremio can still help you with that architecture and provide a lot of these benefits to you.

Justin Dunham:

Using a Data Lake Engine to Create a Scalable and Lightning Fast Data Pipeline

And lastly, one of the things that we really believe in at Dremio is openness and flexibility, and one of the things that we love about using Dremio directly on your data lake storage is that you continue to control the data. It's in your storage. You deploy Dremio on your infrastructure, and you continue to keep data in open source formats in your account, your data center.

Justin Dunham:

So that's a quick overview of what the data lake engine is, and I'm going to go ahead and hand it over to Ryan to talk about what we wanted to talk about in today's webinar. Ryan, over to you.

Ryan Murray:

Hey, everyone. Thanks a lot, Justin. For those of who are planning to ask me about my thesis, it's been 10 years, so I'm not going to be able to help you very much with that. Anyway, before we go any further, let's start out with what do we mean by data lake, or a data pipeline?

Ryan Murray:

So a data pipeline, at least for the context of this talk, is simple as the process that starts when data's created and ends when we start to extract business value from that data. So that could be a lot of different things. One example is in, say, a machine learning pipeline. You're going to be collecting a lot of pieces of data from a lot of different data sources, pulling all that together, transforming it, munging it all together before putting it through the machine learning model. The end of that data pipeline is the calibrated machine learning model.

Ryan Murray:

For something like a IOT setup, you're going to be creating ... you have millions or tens of millions of devices, they're all creating measurements. Those measurements have to be collected, collated, possibly pivoted, and then eventually presented to downstream end users. So in that, your data pipeline starts at the devices and ends at your downstream users.

Ryan Murray:

There's two things that these pipelines have in common, and any data pipeline should have, and that's they need to be automated and they need to be timely. So the faster you can get this data and in a more automated fashion you can get this data, the better, more accurate, and faster you can make your business decisions.

Ryan Murray:

Unfortunately, there's two things that these data pipelines have in common that aren't as good. They're both complex, and they cost a lot of money. We can see that from a few of these diagrams. When you start looking at data pipeline, you need to have knowledge of hundreds or even thousands of different services and technologies, you need to understand how all these technologies work together, and then cater for how these different technologies work together and fail together, and you need to spend a lot of time, a lot of effort, a lot of money making sure all of these are orchestrated together.

Ryan Murray:

Using a Data Lake Engine to Create a Scalable and Lightning Fast Data Pipeline

For me, that's really unfortunate. For everyone that's really unfortunate because at the end of the day, data analysts, data scientists, they just want to get their data and start doing their work on it. And what they end up with is this complex and inefficient stack. So we see this a lot. We have a data lake, this data lake could be a new cluster on-prem. Nowadays we see a lot more data lakes off-prem, with your blob storages like S3. The common pattern for these data lakes is people just start pouring data into these data lakes. They start accumulating a lot of stuff and no one really understands what's in them, or how to use them, or even necessarily what the data means. And you end up with the, common terminology is, you now have a data swamp.

Ryan Murray:

So what do people do? The first thing they do is, they start putting data warehouses on top of this data lake. Data warehouse is a 30-odd year technology, and it has a lot of baggage with it. So we now have to copy our data into this data warehouse. Creates a lot of fragile jobs to get the data in, and then we end up paying twice for storage. And we have to pay for and maintain this middle tier.

Ryan Murray:

When that's done our data scientists or analysts now can at least access their data using Sequel, but it's still not very fast. They still don't have a really clear of idea what's in the data warehouse, and they still need to wait to get new data into the. So what people end up usually doing is creating this third tier. And this third tier's going to be full of data cubes, usually a lot of BI extracts, or something like a Tableau extract, or something like that. And in the worst case, you start getting a lot of CSBs and spreadsheets all year around.

Ryan Murray:

So the kind of world you end up in is somewhere where you don't have control of your data. You don't know what your data is. You have dozens of people crawling all over this architecture, trying to make sure it keeps running. And your business users still aren't very happy. So some of the data might be fast, but it still takes weeks or even months to provision new data. And there's still a lot of situations where the data pipeline falls over and the data isn't accessible.

Ryan Murray:

Using a Data Lake Engine to Create a Scalable and Lightning Fast Data Pipeline

So not a very good picture. What can we do to make it better? Simply put, we just take all of that out and put Dremio in it. Sounds a bit extreme, but what we can do is, Dremio can talk directly to a data lake. So you don't need the data warehouses and cubes, and stuff. You can write Sequel in Dremio and have it go down directly to your data lake files. It can then join data inside of Dremio with other databases, NoSQL, SQL whatever you have Then you can present this semantic self serviced data layer to your users.

Ryan Murray:

So as a machine learning data scientist, analyst, a BI user or something like that. They get to poke around the Dremio UI, which we'll look at in a little bit. Search for data sets, start pulling those data sets together. They're writing their own Sequel against those data sets and then using them in their end user applications. So it becomes a lot simpler. We can get rid of a lot of the ETL jobs, a lot of the extra technologies and in most cases it would be faster.

Ryan Murray:

So this sounds really great. How can we take this beyond theory and start proving that this actually works? I thought we would do a few use cases, today. I think we have two use cases I want to look at. One of them is something that we've actually done at Dremio so this is running in production as part of our daily processes. Another one is a project I did a couple months back.

Ryan Murray:

So the first one, we're going to take a look at website analytics. This is a really common problem, especially in a newer companies and startups and stuff where you have your Clickstream data. This Clickstream data is going to be data that's coming in from your website. So you're going to be tracking the actions, what users click on, how long they spend on the website and how they interact with all the content on your website. There's a lot of reasons this is important. First and foremost, it allows us to make data driven decisions about our website so we can choose the content and the design based on how people interact with it the best.

Ryan Murray:

Using a Data Lake Engine to Create a Scalable and Lightning Fast Data Pipeline

We can also use this to assess the health of our business. We can see how often people are coming to the Dremio website, how they're interacting with Dremio and how we can make Dremio more attractive to them from the initial interaction. This hopefully will lead to lead generation where we can use the data for how they interact with our website to tailor custom messages to them to help them engage with Dremio better.

Ryan Murray:

So this is a problem that we have at Dremio, as I said, most people have it. Our Clickstream data gets dumped into a S3 buckets. When looking around for how to get the data out of that S3 bucket and present it to our executives with a overview dashboard of how people are using the website. We started looking at how we could do this inside of AWS. So this is a common pattern for how it'd be done today without Dremio.

Ryan Murray:

So you have your initial S3 bucket on the left. You're going to usually have to do a bunch of transformations to it. You might do this in AWSs Glue. For those of you who haven't used Glue before this is a managed service from Amazon, which helps you do a ETL, and it's basically spark underneath. So it's essentially a managed spark service. So you'll do a bunch of glue jobs and then you'll end up putting into an S3 bucket.

Ryan Murray:

At this stage, you've doubled your storage costs, and you've this manage service for a certain amount of time to create this data. And all you've really done is you've enriched it. Maybe you flatten your data a little bit, maybe you've cleaned it up, filtered some of the bad data. So what you're going to need to do now is pull up something like EMR or some other heavy lifting data conversion tool. And what's going to happen there is you're going to have more data crunching, more managed services, and at the end of it, you still get another S3 bucket.

Ryan Murray:

So at this stage, you now have at least three copies of your data. You need to start worrying about the security, current concerns of who can see that data and who has permission to see that data and stuff. You have a lot of copies of your data, and you have a lot of managed services to worry about. You have to make sure that all of these things are working together and not tripping over each other. Worst off you're not even actually able to clear your data it's still sitting inside an S3 bucket.

Ryan Murray:

So the most common answer to this problem is to use something like Amazon's Redshift. So another ETL job to put your data into Redshift and now the copy of your data. Now you have this managed proprietary tool where you pay for inquiries. Over the past couple of years we've helped people migrate off of Redshift and in the process we've seen some of their bills and how they've changed. And I can tell you, if you don't know already, your Redshift bill can quickly spiral completely out of control.

Ryan Murray:

Using a Data Lake Engine to Create a Scalable and Lightning Fast Data Pipeline

So you have this extra concern to worry about, aside from the number of services, and the number of copies of your data. You just have this really expensive query engine sitting in the front of all of it. But let's hook this up to Tableau, we can now generate our dashboards and everything's great, except the performance isn't very good. We still have to pull data out of Redshift and transform it in Tableau.

Ryan Murray:

Common answer to that is to use something like a Tableau extract. You now have added another moving part, another thing that can break and once again another copy of your data. But at least you're now able to get queries on Tableau. It might cost you a fortune and two full time data engineers, but it's working. So what's a better way? What can we do differently here?

Ryan Murray:

Well, let's start by just putting Dremio in the mix. So now this is something that actually is live today at Dremio. So again, we start out with our Clickstream data on the left. Now we're transforming at once and this is a relatively lightweight ETL job. At this stage we're just extracting some fields that have JSON object, flattening a little bit and then dumping it into another S3 bucket.

Ryan Murray:

So we have done a relatively light ETL job, and we've created a second copy of the data. But at this stage we can point a Dremio at the data. Important to know we're not importing the data into Dremio, we're just showing Dremio where the data lives. And Dremio takes care of querying S3 directly. So then inside of Dremio we deal over transformations and our cleaning and all the other stuff we need to do to get this data ready to display to our users. And then we point Tableau at it.

Ryan Murray:

Now here, the connection between Dremio and Tableau is fast enough that we don't need to do extracts. Dremio can query S3 so fast that we can use a live connection on Tableau. So this is great. Now we have one ETL job, Dremio and Tableau and we're done. This is an example of a dashboard that's being used every day by our executives. This is just the output of this data. It's not really important how it looks. But this is just to say that this is working and used daily at Dremio.

Ryan Murray:

Using a Data Lake Engine to Create a Scalable and Lightning Fast Data Pipeline

So I'd just like to pause for a second, because what I've said so far is that using Dremio, we can take all of your services away. You can half of your data stack, and we're going to be faster. So we're showing we can commission half your stack by using Dremio, but how the heck are we going to be faster than this complex, very long ETL job. Well there's a couple of things that we do here, and we'll just walk through some of the components of what makes a Dremio query so fast.

Ryan Murray:

The first one, easy one is the what's called the cloud Columnar cache. So it's easy to work with, but it's not easy to save on the time. Basically, this is doing caching and prefetching, so it's actually doing a predictive pipelining trying to expect what queries you're going to do on the data Lake, pull that data from the data Lake and put it on to a local fast storage like NVMe or a SFDs or something.

Ryan Murray:

So once it's there, Dremio has effectively instantaneous access to that data, and it doesn't need to wait for a Azure S3 to come back. This is particularly important because some of these cloud blob storage is going to have extremely. So you can be waiting seconds to get some requests back from S3 buckets. So by doing this we can see an immediate 10 times three.

Ryan Murray:

Next step we add in data reflections. I'll go deeper into data reflections later. But for now we're talking about them as a really interesting mix of traditional database indexes. Something sort of like a materialized view and something sort of like data cubes. What you're able to get out of them is transparent access to fast pre-computed data. So your users are still querying the underlying datasets, but the Dremio query engine is rerouting them to this pre-computed data that's living on a direct fast storage.

Ryan Murray:

In most cases we're going to see a good 100 times increase. I've seen significantly more for the right use cases. What's quite nice about these is we also get rid of the extra complexity of managing your data cubes. So the next big speed increase we can see is coming from Apache Arrow, which Justin mentioned before. The reason that this is fast is this is a columnar data storage format in memory.

Ryan Murray:

So we're able to take advantage of all the things from the modern CPU architectures. We're able to do factorization and take advantage of CPU caches and all kinds of other really cool, really interesting things to get another five times speed above that. And finally and the newest kid on the block is Arrow Flight. And this was officially released by the Arrow community, I think a good, maybe less than a week ago now.

Ryan Murray:

What this is, this is our replacement for ODBC and JDBC. So these are fairly old technologies and they're rule based. They create a lot of copying. What Arrow Flight allows us to do is move an arrow buffer between, say Dremio and the client with zero RPCs just straight over the network. When done in parallel, this can give us well over a hundred times speed up. Currently this is only supported in Java clients and in the Jupyter notebooks Python environment.

Ryan Murray:

We expect to see a lot of other people start to adopt this as the Arrow Flight becomes more mature. So that's a brief overview of how we can be so fast on the data Lake. So these four technologies is what allow us to do effectively sub-second queries on the data Lake when you couldn't after massive stacks of complex pieces working together.

Ryan Murray:

So with that, I'd like to talk about our second use case. This is a little fun project that I did in the spring time. So I had a few months off between jobs before I started at Dremio. I've been a huge IOT fan for quite some time. I've always been into the home automation thing and for years I've been buying anything that looks exciting on Kickstarter. So for the past couple of years I've collected literally buckets full of sensors and Arduino Boards and all kinds of other goodies. I said, "Now that I have some time off, it's finally time to put all these things in action."

Ryan Murray:

So I spend some time remembering how to solder, put all these things together. I now have sensors scattered all over the house. My girlfriend loves it and I'm now able to tell you really fun things. I can tell you the temperature in every room in my house. I can tell you how many organic compounds get put into the kitchen every time I cook a meal. And I live in a flat in the UK. so I can tell you that yes, it is still raining. So I can tell you all these things. I can gather all this data, but what am I going to do with it?

Ryan Murray:

So the traditional way of doing these things is say in Azure something like this. So this pattern looks very familiar from the last example, here we have our devices creating datasets. This is getting pushed into the Azure IOT hub. Then the IOT hub writes it down Azure's ADLS V2 for us. So from there we need to transform our data. Again, this is very similar to before, we have some big Databricks jobs doing ETL pipelines for us, creating a copy of our data again. I have shown in this example using HD insights, you could very well be something as a data factory or something. The point here is that again, pulling data out of data as your data storage, doing a lot of transformations on it and then writing it right back down.

Ryan Murray:

So after a few steps we've managed to get it into something like a SQL server data warehouse. At this point, again, we've created a lot of copies. We've opened up ourselves to exposing all of this data in a lot of places. For me, this is a hobby project, and I've had to pull in all of these things and suddenly I have a few hundred pound bill, and I just wanted to look at my IOT data. But at least now I can expose it from SQL server into power BI.

Ryan Murray:

Using a Data Lake Engine to Create a Scalable and Lightning Fast Data Pipeline

Now for my small project, that's fine. I'm done. If we wanted to scale this up to thousands or millions, even of devices, there's no way the connection between power BI and SQL server would be enough. So we'd immediately think, we'd love to use something like ASI as your analytic service. Now, we're back into the old problem we're having to create and manage cubes. So we have a lot of moving parts, a lot of data flying around, a lot of services, a lot of cash, a lot of time. Same as before, how can we do this in a better way?

Ryan Murray:

While we have this with Dremio, this is running in my personal Azure account. I'm creating data, I'm sending it to Azure's event hub. Using stream analytics, I'm actually doing on the fly transformations before writing down tools or data storage. Once I land in Azure Data storage, I only have one copy of the data, but it's clean and it's ready to be put in the Dremio. Inside of Dremio, we'll have a bit of a look in a minute with a demo. But the idea is Dremio's going to do some transformations. We're going to create a few layers of transformations and then display the data out to our dashboard again.

Ryan Murray:

From my dashboard, I've chosen a dash by Plotly, the open source application in Python. I chose this mostly because it's really fun. I can use the flight connector from Dremio and I can really pull in all of our speed benefits. So for me, the single person app this is going to run at or just above the free tier for. So this is something that I myself can run. It's fairly cheap, It's fairly easy to run, I don't have to do much.

Ryan Murray:

Of course, this is a relatively simple architecture and it could scale almost infinitely. We'll take a look at that a few minutes. Here's the results, this is what the dashboard looks like. I can hook that up from anywhere in the world and here's some of the throughput stats for my event hub and the stream analytics and stuff.

Ryan Murray:

So as I said, this is a really good way for a single person to play with their own IOT. What's really interesting is we've partnered with software AG. Software AG is a big player in IOT, and they have effectively the same architectures I showed earlier. They have the same thing running in their software stack. So they're pushing hundreds of thousands or millions of metrics a second from all over the world into their operational store, which has been transformed to be served to Dremio by their data hub.

Ryan Murray:

So this is something that was just recently announced by software AG, it's a really good partnership for us. And if you're into the IOT stuff, let us know later and we can talk to you guys. So with that I'm going to start the demo. Let's just move over to Dremio. So I'll start with just a brief look at how Dremio looks when you open it up the first time. And you can see we have our sources, our spaces, we have a Wiki page over here for each source, and the space that we can do queries and jobs.

Ryan Murray:

This IOT source is ADLs Gen2 and it's setting up to point at our Azure data Lake that we set up earlier. For now, I'll take a look at ... these are all the other sources that we can set up. So we can connect directly to S3 or elastic search or on disc if you prefer and if you were to connect to something, this is a kind form you'd have. You can just fill this out, and it'll connect to Azure.

Ryan Murray:

Then you'll get a set of things like this, this purple means it's a PDs or a physical dataset. This means that Dremio now understands how this dataset is structured and how to query it. This is just a folder, if we were to go in here, we'll see, well nothing, but you'd normally see a pile of JSON documents or something like that. We can just take a look at this, and we immediately open up with a SQL editor and a preview of our dataset. So we can start writing a sequel against this, and we can run preview to short preview of our documents.

Ryan Murray:

With spaces, we have something similar. In the spaces, we're going to have these green ones and these are VDSs or virtual datasets. And what we have here is effectively a chain of datasets, which are just sequence statements that are talking to the other VDSs is to sequentially clean and prepare the datasets. Take a look at the raw weather, for example and this is the final cleaned output where we have the room, the sensor name, the value it's saw and the time it saw that.

Ryan Murray:

If we were to take a look at our graph, we'd be able to see how this data set was constructed. So we can see our weather app, we can see the columns in this VDS and we can move back through our chain of VDSs to get all the way back to our PS and then eventually our data source. And you can see here we have also done a joining against this physical dataset living in my home directory, which is a small CSV file to translate sensor device IDs to. So at that we also have this most recent data set and this is a very small one. This is just a pre aggregation of the current live values.

Ryan Murray:

So you can take a look at our dashboard. This is the live dashboard running. When we refresh it, we've actually gone back to Dremio, executed it and pulled back the results and rendered it and we can see how quick that is. We're able to get that speed through reflections and you can see here this is a query that was running against that particular data set and it's running in less than a second because it's been accelerated.

Ryan Murray:

Now, I'd just like to take a second to fully appreciate what happened there. From the raw data set, this PI IOT living on as your data Lake, we have something in the range of probably 10 million to 15 million records and these are stored in hundreds of thousands of JSON documents. And because of Dremio we've been able to create a sequel layer over top of that and be able to execute queries on that particular dataset in less than a second.

Ryan Murray:

So that's getting 15 million rows and we're able to pull the important data out of it, which in this case is only 29 records, but we can still do that in less than a second. And this is mostly because of magic of our reflections. So we can look at, this reflection, we can see we've created a raw reflection on this dataset. And all this is doing is it's taking all of these columns and writing them all down to a compact parquet file that we can understand and read very quickly. Usually on local storage.

Ryan Murray:

So this is very much like a materialized view in a traditional database. However, you can use the materialized view without having to reference it specifically. And then similarly we can have aggregate reflections and these are more like BI cubes where we can have dimensions and measures and on those dimensions and measures, we form all of the Q values that we need to be able to quickly compute aggregates. So here we're getting all the power of something like Azure analytics services without having to manage the cubes ourselves.

Ryan Murray:

We just have this UI, and then Dremio it takes care of refreshing and managing those reflections. These are some of the things that are reason why we can do such fast queries on such large data sets. So I thought I'll go back to the presentation and I'll open ... Justin you probably want to finish it off and then we can start some questions.

Justin Dunham:

Yeah, sounds great. Well, before we get into Q and A thank you Ryan so much that was a wonderful overview and we've got some great questions here that we're going to answer. We'll probably won't be able to get to all of them, but there's some good stuff here. So yeah, I want to talk a little bit about just some things that you'll be able to do right after the webinar. So you can actually deploy Dremio right from our site at Dremio.com/deploy and we also have Dremio university for learning Dremio and the Dremio community, which we have, staff by Dremio engineer's who you can ask all of your questions.

Justin Dunham:

I'll revisit these again just before we close out. But for now I want to switch over to a Q and A a little bit. So we've got some really, really good questions here. And one question we've gotten a couple of times by the way, is about the recording. So I want to make sure that everybody knows on this call that you will be receiving a link to the recording and the transcript very shortly after the webinar is complete. So, let me start off, Ryan. We'll tackle these together.

Justin Dunham:

Let's talk a little bit about how Dremio manages security and access control. So I'll say a few things about that. Dremio has a ton of features related to this, from integration with AWS to single sign on with Azure to a role based access control and data masking and all sorts of other things like that. So that's one of the really helpful things about using Dremio as a layer here is that, Dremio lets you get very, very granular control over exactly who has access to exactly what data.

Justin Dunham:

Ryan, I don't know if you want to add anything there about sort of things you're seeing in the field and with customers.

Ryan Murray:

Yeah, I think we can briefly talk about what Dremio can do and I'll guide you through some of the demos as well. I think there's a few stages of security that are interesting. First is at the sign on stage and we can support ELDAP or a lot of the other SSO paradigms especially anything can be handled. We also take care of pushing down security. So if we're connecting to something like hive or we can integrate with Kerberos so that we can do things like impersonation on a lot of the underlying data sets.

Ryan Murray:

Then inside of Dremio there's three layers of security. So on our folders or workspaces or data sets, we can control sharing. So this can be limited to specific users or to specific user groups depending on if you have ELDAPs set up. So you're able to control who can see which spaces, which VDSs, which PSs and who can edit them. The other two features we have are row level security and column level security and these are both done via the sequel interface.

Ryan Murray:

So you can tell Dremio that only this group is allowed to see column X and all the other groups should have column X masked out. Say it's a credit card number and they'll only see all X's. Or we can say filter all rows for users and user groups who are a part of so-and-so team. So from that we can control exactly which cells are showing and which data is shown in those cells and we can also control access directly to the individual data sets and data spaces.

Justin Dunham:

Great. Thanks Ryan. One other question that we're getting here is, is it possible to add connectors that are not on that list that you showed, Ryan? I'll hand it to you in a second to add more. But I think one thing we're going to talk about here to answer this question is Dremio hub. So Dremio hub, which is at Dremio.com/hub, is both a marketplace for connectors. So we have I think about five or six up there now. We launched Dremio hub a few weeks ago and we have sources like Snowflake and Salesforce and Vertica and some other ones as well.

Justin Dunham:

On top of that, Dremio hub is also a set of capabilities in Dremio that make it very, very easy to build your own connectors for your own data sources. So I know that Ryan, for example, has done some work on a KTB connector and I know that other folks are working on things. And creating a new connector for a relational source especially, is literally just a matter of editing a [Yaml 00:39:48] file and a couple of other things as well.

Justin Dunham:

Ryan, I don't know if you want to talk a little bit more about that piece.

Ryan Murray:

Yeah, I think you're absolutely right. With Dremio hub, we have a SDK, which allows you to build a connector from a JDBC driver. So if your source has a JDBC driver, then it's a matter of a few dozen lines of Yaml for a config file and that will fire up a connector. If, you don't have a JDBC driver, it's fairly complicated. The driver that Justin mentioned for KTB is quite a hefty one and it changes with every release. So I think it's quite a bit of work but it's certainly doable and I'm happy to talk more if people are interested in the non JDBC drivers. But we've seen people in Dremio put these JDBC based SDK drivers together in literally minutes and hours

Justin Dunham:

Great. We've got a bunch of questions here Ryan, related to Apache Arrow and you mentioned that a few times. Can you just take a minute and talk a little bit more about the story there and some of the capabilities and what Arrow is really doing for us?

Ryan Murray:

Yeah, Arrow started a couple years back and as Justin mentioned, it's growing exponentially. I think because it has such broad application to data science and data scientists such a thing, it's really starting to take off. We can see here the kind of people and the types of groups that are starting to use Arrow as the core of their product. We have huge support across programming languages. I've personally ... I'm a committer on arrow and I've been involved in some of the C Plus Plus Java and Python stuff. But we have hundreds of people committing.

Ryan Murray:

Some of the things that make it so fast for us, one thing that we released maybe six months or so ago is something called [Gendiva 00:41:53]. This is part of the Apache Cornell,. And what this is doing is it's taking a SQL expression, a filter or project or whatever it happens to be. And it's compiling it down into a LLVM Byte Code. And it's doing that just in time on the fly. So that takes a simple expression, which in Dremio would be a job, but it can also be a Pandas operation or something.

Ryan Murray:

And then will compile it down to machine language before executing it. So with that, you're immediately taking advantage of all of the power of LLVM based compilation and you're leveraging the vectorization and the [Cyndi 00:42:35] instructions. And this could even potentially be applied to GPUs and all kinds of other really interesting projects. So these are some of the things that are getting the real bump out of Apache Arrow.

Ryan Murray:

The other thing is the Arrow Flight. For us, we're trying to sell this as ... not sell this, this is a viable replacement for ODBC and JDBC. As Justin and I were talking earlier, these were created around the time of Clinton and George Bush the first, so they're rather old standards by now and we think there's a much better way. And with something like Apache Arrow, we're able to transform data frames from inside of Dremio to Jupyter notebooks for example. We're able to do that in parallel, without having to copy memory around.

Ryan Murray:

There's no marshaling your data translations. It just is directly copied between one computer and another and we're immediately able to leverage that power. Especially in something like spark, you can find some work that's been done on a Spark adapter for Dremio online and something like this is able to achieve hundreds of times speed up because it's able to talk to all Dremio executors at the same time.

Justin Dunham:

Great. Yeah, Arrow Flight is especially exciting as Ryan said, ODBC and JDBC are just for a kind of different universe and today we're using distributed storage. There're all kinds of changes that have happened that really make Arrow Flight very, very exciting. And you can see that even today with Jupyter connections to Dremio. So that's pretty cool. One thing we didn't talk very much about and we have some questions on it here is deployment options.

Justin Dunham:

One specific question we got was about running Dremio and Docker. Ryan, do you want to talk a little bit about some of the ways that people are actually deploying Dremio and scalability options there as well?

Ryan Murray:

Yeah, I think ... this is a really fun story. So I personally have deployed Dremio on everything from large Kubernetes Clusters down to Raspberry PI. There's tons of different options to deploy Dremio. My favorite option is using the image on Docker hub. So when I need to test something for a client or if I'm working on site, I use Docker to immediately spin up an image and I can immediately start testing.

Ryan Murray:

A lot of our clients nowadays are starting to use Kubernetes. We have a Helm chart for that already, we have a lot of improvements coming on that Helm chat in the near future. I've seen people get set up and running a production quality Dremio install or using the Helm chart in a matter of hours, less than a day for certainly. Then the first time I installed Dremio I used the Turbo, that way back, just after version 1.0. I just pulled down the Turbo, extracted it, I had Dremio running within minutes. So I think there's a wealth of options to.

Justin Dunham:

Yeah and on those deploy pages on our site, you will also see, links to marketplace templates. It's very easy and quick to deploy on S3 as well. And as Ryan said, we've got Helm charts and Docker images on hub and all kinds of stuff like that. So super easy to do and very, very flexible. The last question I'll pull here ... and we got a lot of questions, which is wonderful. For folks whose questions that we didn't get to, shoot us an email and also you should expect someone to reach out to you as well because we'd love to answer all these questions.

Justin Dunham:

But the last question we'll kind of field here real quick is about the impact of creating reflections. You might have a deployment where you've got a billion JSON documents in ADLS. Are those all being stored in Ram? Where are reflections actually being stored? What is the data format? What's the burden as far as storage and stuff like that?

Ryan Murray:

That's a really good question. Thanks for asking that. So in short, the reflections themselves are stored as parquet files on disc and the disc depends on the Dremio configuration. The recommended configuration would be a back into, say, the Azure Data Lake or S3 or something like that. Then using Columnar Cloud Cache you would actually start to cache those locally on disk on the NVMes or the local disc as well. So you have a couple options for where or how those are stored.

Ryan Murray:

With Columnar Cloud Cache, we kind of make sure that obviously the discs don't full up and the hottest data's ready to go. The impact is a really interesting question because we've seen a number of places that have turned on hundreds or even thousands of reflections and haven't really seen a performance impact. What's really important is that you treat reflection sort of like you treat indices in a Oracle database.

Ryan Murray:

The intelligent placement of a single reflection can give you a 100 or a 1000 times speed up. But putting your reflection on every single dataset may not help at all. It might actually make it worse, especially depending on the size of your cluster. You can get in a situation where your clusters actually spending more time reflecting, calculating reflections than it is answering queries. So it's a question that's really close to my heart and if you reached out to me by email, I'd happily talk you through some more of that.

Justin Dunham:

Great. Well, thank you Ryan. I think we're going to go ahead close things up for now. Again, we would love to answer ... we have dozens of questions, which is wonderful. Love to answer them all and please reach back out to us. You can reply to any email you've received from us. You'll also get a follow up email as well from somebody at Dremio. So we would love to answer all the questions that you have.

Justin Dunham:

We really appreciate everybody being here and as I said a few minutes ago, a few options here for those folks who are interested in engaging more. In the meantime, go ahead and deploy Dremio a community edition is open source and totally free to deploy and very, very easy to deploy on an AWS or a on Azure. So you can go to deploy for that or on-prem for that matter.

Justin Dunham:

Dremio university is our online university that gives you free courses to introduce you to how Dremio works and some of its features and how to use all those things. So that's a wonderful resource to check out. Go sign up there at university.Dremio.com and then the last place I'll point folks to if they're interested is our community site. So the community site is used for everything, it can be support, questions about the product, questions that we weren't able to answer here and that's staffed by Dremio engineers and other folks here and you'll get a quick response.

Justin Dunham:

So go check out the Dremio community as well. I think that's it. So we'll thank everybody for their time. Thank you so much for joining us today, and we look forward to continuing the conversation. Have a good one. Thanks, all. Bye.