Dremio Jekyll

Are we there yet? New options for moving your analytics to the cloud

 

Transcript

Tomer Shiran:

Let's start with talking about different approaches to migrating to the cloud. Many companies these days are looking at how do they start leveraging the flexibility and elasticity, and some of the cost managers are leveraging cloud resources and of course, with that, especially for more mature companies, come many challenges because you may have different applications and data infrastructure that already exists in the organization. And so there are different approaches when it comes to how do you move your analytics to the cloud and then these analytics is one of the workflows that we see moving to the cloud faster than others.

And so, at a high level, there are three fundamental strategies when it comes to moving a workload like analytics to the cloud. The first one being a lift and shift strategy, the second is a cloud native strategy and the third one is what we call a hybrid cloud strategy.

cloud analytics options with Dremio

Lift and shift, basically, kind of the most obvious one. Essentially running the same workload in the same infrastructure that you may run in a colo or in your own data center but running that on one of the public cloud providers whereas cloud native is more of an approach of starting natively from the ground up, on the cloud and kind of not leveraging the existing investments. And then hybrid cloud is kind of a combination of those two things and is most suitable for companies that already have an investment on premise today.

So we'll talk about each of these strategies and then talk about different approaches from a technical standpoint that can help you take advantage of these.

So what is lift and shift? That's kind of the first of these three approaches. Well, basically what this means from an analytic standpoint can be for example re-deploying your Hadoop cluster on Amazon Web Services, on Azure, on Google Cloud Platform. And really, what this involves is running the same software stack on the cloud infrastructure.

cloud analytics options with Dremio

So if you think about from a Hadoop standpoint and Hadoop consists of maybe 20 different open source projects that are kind of running in harmony or at least work with the goal of running in harmony and basically running that same stack on EC2 instances for example.

And the advantage of this lift and shift approach really, it's the least disruptive transition because you're already perhaps using that Hadoop technology stack, you know how to operate it, your users are familiar with that stack, they use the user interfaces that are there, you're using the data governance tools that are part of the Hadoop distribution and being able to run that in the cloud really doesn't change any of that except maybe some of the kind of the monitoring and the deployment approach. And this is, if you will, the devil you know, you're already familiar with the challenges that come with this type of environment.

The challenge here is that really, this is kind of the highest cost approach because you are now running your Hadoop instances on a 24/7 basis in the cloud, of course, compute is what's most expensive when it comes to running in the cloud as opposed to some of how the cloud storage services which tend to be very inexpensive.

And this approach really doesn't benefit from everything that the cloud has to offer, right, beyond of course the basic hardware, server infrastructure that you're now getting from the cloud provider.

And then they add additional complexity when it comes to monitoring and provisioning and how do you do that versus what you're used to in your existing data center. So by and large, this is the least disruptive approach but also the most expensive.

On the other end of the spectrum, we have the cloud native approach which involves kind of rebuilding your data lake using cloud services that are provided by the cloud provider, by Amazon or Microsoft or Google. And so what does that look like? Well, it's running for example Amazon Elastic MapReduce and leveraging S3 for the storage layer and you know really provides similar functionality to what you may have on premise but different APIs of course.

cloud analytics options with Dremio

And the advantage here is that you're really taking advantage of all the benefits of the cloud, the elasticity, the fact that you can kind of spend things up and down and pay for only what you use down to the minute level of granularities these days. Another benefit here is that these services are constantly improving so you're benefiting from improvements to EMR, from your services like Athena, Redshift and Spectrum and all these different services.

But the problem, the challenge here for most companies is that this is of course the most disruptive transition that you can do because you're now not leveraging your existing ... any of your existing investments or tooling that you have around your Hadoop, your on premise data lake and you're kind of rebuilding everything from scratch and that involves perhaps a different approach to ETLing the data, certainly different user interfaces for dealing with things. It's very different.

And so often this works great when you're kind of starting from scratch, maybe you're a startup that's just kind of building things from the beginning and it can make sense to use the cloud for these different services for your architecture exclusively.

Another challenge here is the cloud lock-in. So of course we all know that there are a few different cloud providers these days that we can choose from and each of them has its strengths and weaknesses. One of the biggest challenges that if you ever tried to move between this and I actually experienced it once myself is that moving from one cloud provider to another is extremely challenging and once you start using kind of these proprietary services from a specific cloud provider, that really makes it almost impossible then to transition to a different cloud, so that becomes a big challenge.

And then, finally, this approach depends on having kind of a strong team of developers that can focus exclusively on building this architecture for the cloud. So if you were traveling recently in various airports, you may have seen some of the ads that Amazon is running for example where you see that kind of whiteboard and I think it says that builders build something like that, which really shows you kind of what the approach is to these types of cloud native architectures where you're kind of basically gluing together a variety of different services, provided by the cloud provider. And so that can be challenging in some cases.

Hybrid cloud is kind of the ... probably the best approach for most companies. And maybe the only realistic one for most situations where it's not a startup that's starting from zero. So what is hybrid cloud? It's a mix of on-prem and cloud services, so you're looking to kind of extend and start to benefit from the advantages of the cloud and leveraging kind of the elasticity and the capabilities of the cloud.

cloud analytics options with Dremio

But realizing that you of course have already invested many years and have a lot of infrastructure running on premise. So how do you take advantage of the cloud while still living in the world that you kind of have to live in?

And this kind of approach allows you to kind of evolve the balance of workloads between on premise and cloud over time so you can gradually migrate more workloads to the cloud or maybe keep some kind of balance where you're running things where it makes sense to run those things or maybe the workload's better 24/7 and have kind of a ... don't have as much spikiness to them or elasticity to them, you can run those in the on premise infrastructure where it's less expensive and the more kind of spiky and elastic workloads, you can run in the cloud.

And so the advantage here is you have the maximum flexibility, there is no big bang event where you have to transition from something that exists to something completely different with of course all the challenges that come with that. You know, you can adapt as these cloud services continue to mature and have more and more capabilities, you can start to leverage more and more things there.

And the disadvantage here, and this first one here can be seen as an advantage as well, but there is less of a forcing function for moving to the cloud, so it doesn't force you ... you know, sometimes a forcing function can actually make you do something, so of course this doesn't cause that. And then you kind of have to operate in two paradigms simultaneously. But like I said, for most companies this is probably the only realistic or the best realistic approach that exists.

So let's talk a little bit about how we're approaching this at Dremio in terms of the open source project that we've created and how that can help companies when it comes to balancing their analytics workloads across multiple environments.

So our perspective at Dremio on the market is that ... you know, there is two trends happening right now in the industry. The first one being that analytics on modern data is incredible hard, just the fact that it's no longer a world where you can have all your data in one relational database. You know, these days we have data lakes and we have no SQL databases and just the size and complexity of the data makes it very hard to deal with and makes it very hard to make this data accessible to the users who want to consume it and want to kind of drive insights from the data.

And then you combine that challenge with just the fact that the users these days, in the business, the business analyst, the business scientist, they have an expectation of being able to do things on their own with their own hands, they don't wanna wait, they've grown up with Google and being able to get instantaneous responses to questions. And smart phone applications where they can book travel within two minutes on their phone without waiting on anybody.

And when they come to work, of course they have that same expectation and they wonder and demand more and more self service and access to data on their own without having to be dependent on somebody else to do that. And when you combine these two forces, it introduces a real challenge for companies because the technology stack in the world of analytics infrastructure really has not evolved very much over the last 20 years. So we have data in a bunch of places and we have here at the bottom and you have the tools that the users want to use and for most companies, when I ask them which BI tools do you use, they respond with, "Yeah, all of them. Many of them. We have five different tools." Et cetera.

And the approach here, really, you have no choice you know, in this world but first ETL the data and most companies will ETL the data into kind of a data lake, using S3 or Azure block storage or Hadoop. And that's a lot of custom development because the schemas here at the bottom are changing and evolving over time, especially with some of these newer data sources that may not even have schemas like MongoDB or flat files or JSON files and Hadoop.

And you know, doing analysis directly on a data lake is usually too slow for an end user, so you end up ETLing a subset of that data into a data warehouse like Redshift or Terra Data, Vertica, Oracle, SQL Server, et cetera. And then that's not fast enough for most people, they'll want to consume the data so then the company goes and creates cubes or pre-aggregates the data into aggregation tables in the data warehouse or creates these BI extracts inside of a BI server for example.

And then finally, you get the performance that you want here at the top but this comes with many challenges. For one, you now have over a dozen different copies of the data, different pieces of the data floating around the organization, many of them ungoverned you know where you introduced lots of governance and security risks.

cloud analytics options with Dremio

You also have a challenge here where just keeping this Python working can be extremely hard and keeping it reliable can be hard. And then of course, once your stack looks like this, it becomes impossible for the users at the top here to be self-sufficient, right? Anytime they want to do something that's beyond, "Okay, I want to do a slice and dice of a specific cube," or, "I have a question on a specific table," anything more complex or that they didn't plan for, IT did not plan for an event, cannot be answered without basically an engineering project to go ETL data into a single source and expose that to the user at the top.

So that's kind of what the world looks like from those companies and it's not something that they like. And so the reason actually we started Dremio and we built this open source technology was that we thought that there has to be a better way and if most companies now realize that data is one of their most significant assets, you know, there has to be a way to allow that asset to be utilized by the broader audience within these organizations, right? It can't be something that only a scholar developer can leverage, you know, the way we've seen over the last few years with some of these big data deployments.

So what do I mean by better way? Well, for one, it would have to work with any data source. You know, we have a variety of data sources where data lives today and that's continuing to evolve and you know, in five years from now, I'll have a new set of data sources that are popular. It has to work with any BI tool, whether that's you know something like Tableau or Qlik or Looker or Power BI or anyone of these data science tools like R and Python.

cloud analytics options with Dremio

We think that there has to be a better alternative to ETL, data warehouses and cubes. We think that self service and collaboration are the key to empower the data consumer to be independent. And then, in this day, in the world that we live in today, data is often very big, it's sometimes terabytes or even petabytes of data and the user whose in a BI tool like Tableau, when they drag and drop the mouse, they don't care how big the data is, they need it to come back in a second or two, at most, because otherwise the experience is extremely frustrating and they don't wanna use it. And then finally, like I said, open sources, something that we're big believers in.

So this is what Dremio is. We sit in between where your data lives today, whether that's a single data lake or a combination of different data sources and we provide the capabilities and the functionality that's needed to make that data available to the users of these various tools. And so that functionality includes the ability to find the data that you're looking for, it includes the ability to curate the data and create new virtual data sets without creating copies of the data so unlike the, say a data prep tool, there's no copies involved here.

cloud analytics options with Dremio

This includes the ability to accelerate queries, so one of our unique technologies is what we call data reflections and those help us provide orders of magnitude to speed up on these queries. And then finally, kind of that data virtualization aspect of being able to connect to multiple sources, being able to join data across those resources, being able to push down as much as processing as possible into the underlying source when we're not leveraging Dremio's integrated cache and indexing technology.

And we call this data tier self service data which is really a new tier in this kind of world of analytics, it's you know, a much more agile approach to making data accessible and available to the data consumer than having that entire staff of different technologies that have to be glued together.

So let's look at what new options are introduced here. So first of all, just to kind of emphasize a few technology related aspects here that we'll be talking about in a second and exploring how they help us with our migration to the cloud. The two ones I wanna emphasize here are first of all data ... or we'll start with Native Push-Down. So on the bottom left here, one of the things Dremio does it can kind of push down the processing into the underlying source and that includes sources that don't support SQL. You know, Mongo's query language or an aggregation pipeline or the elastic surge JSON query language and the painless scripts that it supports.

And then of course different dialects of SQL, so Dremio is able to push down the processing into the underlying source and of course that helps make sure that as much processing happens close to the source than kind of remotely. And then the second aspect is data reflection. So that's actually something that Dremio pioneered, it's a ... you could think of it logically as similar to indexing in a data base where when people have an index in a data base, the user who's querying that data base does not need to think about it, they just query a data base and the data base automatically leverages that index.

Of course, in our world of analytics and especially analytics across very large data sets, on distributed systems and across your sources, indexes are not the right physical representation but the concept is similar to that. And what these data reflections are, basically, they are ... we can maintain different perspective of the data, different representations of the data on S3 for example, on Hadoop. And those could be various aggregations of the data, different dimensions and measures, they can be the data, the raw data sorted by specific columns and partitions, maybe by different setup columns, maybe distributed by some of the columns.

So different representations of the data. The optimizer is able to leverage these data reflections automatically for an end user without the end user needing to think about what exists, what's materialized, what's not. The user can just think about the logical world where you know, all the data is exposed to me, I can do whatever I want with it.

So all this is very different from kind of what we see today where the user is responsible for picking the best optimization. So when you think about kind of the world as a logical model and, metadata layer if you will and then physical optimizations where data's pre-aggregated or sorted in different ways. In the technology stack that we have today, when we're ETLing or using data warehouses and cubes and BI extract, it's really up to the user to pick the best optimization of the data to get the performance that they want.

cloud analytics options with Dremio

And of course, you know, in any real world size organization we have maybe thousands of data sets and many different systems, it's impossible for the user to pick the right optimization of that data.

At Dremio, we believe that the user should not have to do that. The user should just have to think about the logical kind of model, what they want to do with the data. And the system can automatically, in a much more efficient way, pick the right optimization of that data in order to accelerate the query that they're doing.

So what does that look like in the context of moving your analytics to the cloud? Well, on the left hand side here you see users here, in this case let's say they may be using Qlik, Tableau, Power BI, Looker, one of these BI tools. And they're connecting to these data sources on premise and what happens in the context of the cloud is, they could be used ... you can have Dremio sitting in between, running in the cloud and sitting in between these data sources and what the users want to do. So that's what you see here where Dremio is basically offloading, accelerating and enabling data curation and lineage and governance.

I should have mentioned that one of the things we believe in strongly is that because of how things work today where you have all these different point solutions that have to be glued together, there is really no governance in most companies because users are downloading data, they're sending it to others in spreadsheets, they're extracting it into disconnected systems. And we believe that by having a single layer that the users can achieve what they want in an IT governed system, IT sponsored system, you can then have the visibility into what people are really doing with the data.

That was a sidetrack here but this is what we call kind of the accelerated hybrid approach. The mature hybrid approach is really when you have these users now that are using let's say Tableau on their desktop or maybe a Tableau server or Tableau cloud or one of the other BI tools you know, like Qlik and Power BI, connecting to Dremio which is running in the cloud. Dremio is now connecting to both data sources that are on premise as well as data sources that are running in the cloud. And so by cloud I mean for example let's say S3 as an example of that. You may be running other sources as well in the cloud. But you may also have data sources that are on prem and so Dremio can run in the cloud and connect to these various data sources and provide that offloading and acceleration layer so that not every query has to hit the on premise data source.

So this is really where Dremio's reflections come into play, where the SQL query that Dremio receives doesn't necessarily have to get pushed down into the underlying source because when we have reflections which may be for example persistent on S3, we can leverage those to satisfy the query as opposed to going back to these source system for every query. So it's a big speed up, rather than going and kind of fetching the data over the networks, a slower network every time.

So with that, I want to jump into a demo and show you what the system looks like and again, you can go to the website and download this and play around with it. So what I have here is ... this is a small cluster, it's running in the cloud and I have my data sources on the bottom left so you can see that we're connected to various elastic search clusters, we are connected to a MongoDB data base, there is actually an Oracle data base here, postgre, you can see the S3 buckets, these are actually S3 buckets that various people up to Dremio created for their own use as we've connected to that. We have some SQL service systems and some things that I'm actually not sure what they are.

Adding a new source is as simple as clicking this plus button and looking at the list of sources that we support today and we're constantly adding more sources so if there's something that you see in Dremio or that you have and you don't see is already supported, please feel free to reach out to us and we can consider adding that.

What you see here at the top left are these are the data sources ... sorry, data spaces. Spaces are a place where users can collaborate and create new virtual data sets. And so that's kind of how people interact with data in Dremio, they create new virtual data sets they can then be shared with their colleagues and kind of they can build on top of each other. And then ever user in the system has their own virtual data set. Oh sorry, their own space. And within their own space they can even upload spreadsheets and files and so you can imagine a scenario where the organization has a very large data set, maybe with a lot of events related to customers and as a individual rep I maybe have an excel spreadsheet with a list of my 20 customers and so I can then create a new virtual data set that basically is a join between my spreadsheet and the multi terabyte dataset with all the customer events.

We can look at a few simple examples here of what's possible. So let me create a new space here. And just for the purpose of this webinar, let's call this space the webinar space and we're gonna all the users to access this. So of course, it is possible to restrict that. So we see here, we have this new space called webinar and I'll pin it to the top. It was already pinned. And you can see here that there are no virtual data sets right now inside of this space. And so let's go and access some of this dataset.

So here I have this MongoDB database and inside of that I have a MongoDB cluster, inside of that I have a yelp database and various collections. This is data related to businesses. And what you can see here is I have for every business the business ID column, the address, some kind of JSON structure related to the hours in which it's open. See there, it's a map, based on the icon, whether it's open right now, the categories of the business and so forth.

And if you look at the top left, you'll see that this is actually the name of the dataset. So this one is called Mongo.yelp.business. It's a physical data set, meaning it's in one of the source systems that we're connected to, that's why it's purple. And I can start playing with this data, I can say, "You know, I don't need this business ID column, I'm gonna drop that." And maybe I look at it and say, "You know what, the city here, I only really wanna look at a subset of the cities, so let's see what I have here."

Right now we're looking at the cities by values, I can also kind of filter it out by patterns or just custom conditions. So let's say I'm interested in cities that are in the desert, okay. So Las Vegas and Phoenix, I'm gonna click on apply and you can see the live preview here actually at the bottom, you'll see that you've now filtered it to Las Vegas and Phoenix. You can also see that categories here, every business is belonging to ... or every business has potentially more than one category and I can, if I wanted to do a BI analysis, that's not very helpful because BI tools don't deal with lists very well so I can flatten the list by clicking on the unnest function and I can rename this column.

And what's really nice is, while there are various things that you can do from a UI, you know, in terms of visual interface, sometimes you'll run into something that's not possible there and you can actually leverage the SQL editor here and define the data set in Dremio using the full range of ... the full spectrum of SQL. So here you can see ... you know, what we've done is we've filtered with city in Las Vegas and Phoenix, we've aliased categories as category after flattening it. So we've done all these transformations visually but that's all represented at the end of the day in the SQL statement.

And I can save this new data set in our webinar space so I can call this the categories data set. So now we're looking at webinar.categories as the name of this dataset, okay? You can also open this up in a BI tool, so you know, we make it really easy like Tableau and Qlik, let me see here, Qlik Sense, Power BI, you know with a single click, you can open the BI tool on your desktop, on your laptop in this case and with a live connection to the Dremio cluster. So you can use any tool, you can go in with Micro Strategy, connect to Dremio, just establish kind of a live connection. We make everything look like it's part of one giant relational database.

So what I'm doing here is I'm connecting Tableau to the Dremio cluster and you can see for example categories is that flattened ... that was that array that we flattened, using the unnest menu option. And I can see what is the most common business category, sort that. I can see here that restaurants are the most common business category in this data set.

So it's a simple example, I didn't have to export data from Mongo into some other relational data warehouse and talk to the engineering team to do that, I was able to do that all by myself and I now have a virtual data set webinar.categories that I can share with other people and they can access as well.

Actually, if I click on webinar here, you can see that we have this new data set called categories inside of this webinar space. Okay? So let's look at another example here where I can join data across disparate data sources. So here I'm going to look at this data set of all the reviews on yelp. And this data set, you can see here in the text field, you know, for example there, "I'm very disappointed in the customer service. We ordered Reuben's and wanted coleslaw instead." Okay. So this is the column called text has actual text of the review. Again, you can see elastic5.yelp.reviews is the name of this data set.

If I open up the SQL editor, you'll see that this is just of course a select star from elastic5.yelp.review. And we don't actually pull everything into the web browser, that would be pretty painful but we show you the first x number of represent if you scroll we'll fetch more.

But if you know SQL here, you're actually gonna use for example, we support contains which is a free text search function in SQL and any Lucene expression is allowed here. So this is actually pushing down an expression that filters the reviews based on the word amazing in the text field. So now these are all reviews that have the word amazing in the text field.

Now let's say I wanted to join this, unfortunately I can't see the name of the business, you know, there's a business ID but that's not very helpful, there is a user ID, there is a review ID. But I can't see which businesses these things belong to as a user. But if I click the join button here, I can join now with other data sets. And in this case, the system is automatically recommending things I might wanna join it on. So here, we're joining it on the business data set entitled MongoDB. So we're going to click on ... I'm gonna accept this recommendation. This is all based on user behavior, so as BI users are joining things or maybe people in the Dremio, in the kind of curation interface are doing things, we are learning more and more about the relationships between different data sets and what people are doing with them.

So I can save this new data set in our webinar space. Scroll to the bottom here and we will call this the amazing reviews. And so now I have a new data set which is a join between data and elastic search and data in MongoDB and I can query this data set inside of my BI tools. You can all see the history here of the transformations, you can click on the graph and see kind of the linears between the different data sets of ... here I have elastic5.yelp.review and mongo.yelp.business. Those are kind of the parent data sets for this data set, if I click on one of them that moves to the middle and I can see all the things that are built on top of that.

So here you can see this positive reviews data set is built off from the reviews and if I click on the ... you can see next to the jobs, there is always a link so I can see for example what are the 10 jobs or queries that people have done on this data set and you can see that there are actually ajay and myself and Dremio have been using this dataset. So you can see who is doing what with the data.

And that makes it easy to see kind of what's going on, you know when people are accessing data that I wasn't expecting them to access, maybe as an IT team we should double check that that data is correct and clean.

And so finally, let me show you what data reflections allows us to do. So data reflections, again, are a way to accelerate queries on data sets and also serve as a caching layer. So for example, if you had data and Dremio is running in the cloud and you have some on premise data sources and the network maybe has a higher latency or maybe you don't have as much throughput as you'd like between the cloud and your data center, you can create a reflection on the Dremio cluster and we actually store those inside of a S3 or Hadoop, whatever stores you select.

And then these reflections allow us to kind of offload the need to go back to the data source for every query. So as an example here, let's say I'm an analyst and I want to ... I remember there's some column somewhere called, has the word MTA in it, so I can use the built in catalog and find this New York City Taxi data set is what I was looking for. So this is a New York City Taxi dot trips, this is a data set that has over a billion records. And it has a drop off time, a pick up time of the taxi trips, the number of passengers and so forth. And actually just using the SQL engine on this data set, like a SQL Hadoop, you're looking at somewhere between six and 10 minutes for every query.

So let's look ... let's jump into this data set, we're gonna connect with a live connection again from a BI tool and again, it doesn't matter which BI tool and I just haven't had this one installed and you can connect to the Dremio cluster with a live connection, the seniorcity.taxi.trips is the data set I'm connected to. And I can start playing with this data and one of the things you'll notice as I start to drag stuff here is that it doesn't take 10 minutes with Dremio to analyze this data. And I just counted over a billion records, I can aggregate on the Dropbox time and see how many taxi trips were there per year. And so you see that, it's about 170 million taxi trips per year and that's kind of been going down in the last few years, maybe that's Uber and Lyft that have been causing more people to not use yellow cab taxis which is what this data set is.

I can look into tips, so how much are people tipping. So if I drag that here, make the color more clear. Red and green. It's good for the holidays. So I'm looking at the ... actually I wanna look at the average. So it's looking at the average here and you can see that the average tip amount that's been going up from 2009 to 2014. So you may recall, 2008 was kind of the great recession and so the economy's been improving over the years and people have been more generous with their tips.

And we can also slice this by months and see which months have higher tips. So actually interestingly, the end of the year tends to be the months where people tip more when they take a taxi, a taxi ride. So we're kind of in a good spot this month with November and you can see, beginning of the year, people maybe spent all their money on the holiday season and they're kind of back to work, they're not as generous anymore. And in July, maybe we have a bunch of families in New York, they're not tipping as much as business [riders 00:36:03].

I can look at other slices, maybe month and year combined and see kind of what changes have been going on over time and look at it by month. So I can do all these different things and again, this webinar would have taken us a few hours if I was just kind of a SQL engine here. But fortunately, we have this acceleration layer, the data reflections and as a result of that, all these queries took less than one second and I was able to get that very interactive speed

You can see the flame here is an indication of that acceleration that was happening. You know, when you see the flame next to a job here, that means that we were accelerating that query or in other words, we were leveraging one of these data reflections to satisfy that query much faster than we otherwise could have.

If you want to define these reflections manually, you can actually go and do that. So we allow you to go in here, on a specific data set and you can define new data reflections. So for example, you know, sometimes you have workload, various different kinds of workloads and maybe you want to tune these reflections and say ... so we had two kinds ... actually, let me give you some background here. We have two kinds of reflections, we have aggregation reflections which are representations of the data, optimizations where the data is aggregated by various dimensions and sort of with different measures. And we have raw reflections where these are kind of at the roll level granularity where the data maybe sorted in partitions and distributed in different ways.

And when you can ... let's say you had a new use case, people were querying the data and running a lot of queries that involves looking for a specific vendor and looking to all the events that vendor had between two dates. And so if I created a new reflection here and I maybe ... actually I should have. I wanna sort the data by that vendor ID and then do partition by the pickup time and the drop off time and now those types of course will be faster starting from this point on.

So that's all you have to do is really kind of a ... you know, spend a minute here, create a new reflection. That gets maintained automatically and incurrently by Dremio and from now on, those types of queries will run faster and again, the user doesn't have to change the application or the dashboard or the report to point to anything different. They keep using the logical model of the data and the virtual data sets and behind the scenes Dremio can accelerate that.

It's also not necessarily a one to one mapping so or even a one to many because a single reflection could help accelerate queries on hundreds of different ... or thousands of different virtual data sets that all have some relationship to this data set. So maybe it's a join between this data set and something else or maybe it's a virtual data set that is a subset of the records of this data set. So all those things can benefit from the same reflection.

So with that, I think we will ... that wraps up kind of the presentation part of this webinar and I think I'll turn it over now for kind of Q&A. So if anybody has any questions.

Haven't seen any questions come in so far. I think one of the questions I'm getting here how do reflections help with the migration to the cloud and actually, if we went back to this picture here, the situation that you have is Dremio, at the end of the day, the user wants to get an answer and that may involve data coming from different places and so the more we have to reach across the network to a remote data source, the better that performance is going to be. And reflections, basically, serve as that kind of integrated caching and indexing layer that allows us to respond to queries much faster than if we had to go to these sources and get the data every time.

There is another question here. What IT resources are needed to run Dremio? So basically, I guess I can answer ... there are two aspects to that question. One if from a people standpoint, one is from an infrastructure. So from an infrastructure, you basically need you know one server or 100 servers or as many as are needed for the amount of workload, it's a scale out architecture. So you just need some instances on Amazon let's say or some servers or BMs in your data center or if you have a Hadoop environment, you can actually run Dremio natively on the Hadoop environment and we support Yarn as a way of spinning up the execution resources in a case where somebody has Hadoop.

In terms of people wise, you know, if you were to use Dremio, you would be up and running within a few hours, this is nothing like Hadoop where you have kind of a complex installation process with dozens of ... you know, maybe two dozen opensource projects that need to be integrated and cared for. It's actually very simple. There are two roles in the cluster, one is the coordinator node, you can have one or a few coordinators. Those are the servers that basically serve the UI and they also do the query planning and optimization, so that's what the BI tool connects to.

And then you have the executors which are responsible for running, executing the queries and those are more elastic and you can have as many of those as you want.

Somebody needs to manage the kind of the cluster and be the person who's responsible for you know, kind of provisioning the users unless you're using LDAP, maybe creating the reflections or maintaining the reflections, making sure that you're using it within the capacity constraints that you want in terms of how much storage usages can be consumed.

But by and large, it's certainly not a full-time job and it should be pretty simple for anybody who has a ... somebody who is already managing some big data infrastructure, those types of kind of skillsets are very applicable here. I would see that kind of IT resource that can help here.

Okay, so with that, thank you everybody for joining this webinar and listening in. And we will make this webinar, this recording available to all of you. So thanks again for signing up and joining us this morning and if you're in the US, Happy Thanksgiving. Thank you.