Dremio Jekyll

Enabling Self-Service Interactive Analytics on All of Your Data

Dremio

Lucio Daza:

Thank you for being here with us. My name is Lucio Daza. And today we have a very exciting presentation prepared for you. We're going to talk about enabling self-service interactive analytics on all of your data. without further do, I want to welcome Jason. Jason, thank you for being here with us and the time is yours.

Jason Hughes:

Thanks Lucio. So, what we'll go over and cover is the topic of how to enable self-service interactive analytics on all of your data. First of all, starting from a data consumers perspective, if they want this capability, what's your experience today? We'll go through that, and I'm sure there's a lot of the attendees are aware, it's not exactly seamless right now. There's some issue there. So we'll walk through an example there. And then we'll kind of take a step back and really see how we got here. Because, obviously, there are some suboptimal things and we'll kind of try to understand some of the history and really why it's this way.

Enabling Self-Service Interactive Analytics on All Your Data

Jason Hughes:

After going through that, we'll try to understand the business impact of these issues and the capabilities and where we are now. To that point, we'll go through, based on how we've walked through there, we'll basically propose some requirements of what a solution to this problem and how we would enable self-service interactive analytics, and what that would really look like.

Jason Hughes:

Then will propose a solution as the how we can meet these requirements for organizations, and then we'll go through, all right, now what's the day-to-day consumer's experience to do the same thing, from the first part, with this new proposed solution?

Jason Hughes:

So, to jump in, let's go through what a data consumer's experience is like today, kind of broadly. So first of all, obviously, data consumer to do their job, they generally need to answer a business question, right? There's some question that the business needs answered, or some information that the business needs in order to make a decision or inform some decisions there.

Jason Hughes:

So, first thing that whenever you need to answer a business question that's, obviously, a data driven one, is that first of all, you need to find the data, right? So in order to find the data, you generally, A, need to know where it is. So, a lot of people have a... A lot organizations have a lot of different systems. And what we see quite often is that a lot of our customers that have actually told us, or few of them that actually explicitly told us that a data consumer in order to find the data that they need, will actually either just go walk up to somebody of the cubicles yellow across the room, right? This is because everything is kind of tribal knowledge, which, obviously isn't very scalable and it's kind of who you know, depending on what data you end up getting.

Jason Hughes:

And then, B, once you have found a data set, how do you know it's the right one? Right? If somebody told you, "How do you know that it's the right data set that if you go use it to answer this business question, that it's going to be the right answer?" Right? If you're operating with wrong data or incomplete data, or out-of-date data, then you will come up with potentially the wrong decision, right?

Jason Hughes:

But in general, what we see, more scalable, is some people maybe go as people, but the more kind of appropriate or process recommendation here is that it ends up being, they go ask IT, right? You need to either open a ticket or maybe you know someone on the inside, but if you end up asking IT where the data is, and also some details about the data. Right? So they give you a data set, and then maybe it's, well, when was the data refreshed? How much does it have? Those kinds of questions. And these requests obviously take time, because they have everyone else asking the same kinds of questions. So in order to fulfill these requests that can take time just to get the data, just to find the data set that you need to then use.

Jason Hughes:

At that point, though, generally this data set that you get, it doesn't have all of the data that you need to answer this question. Perhaps it doesn't have all the columns that you need, it doesn't have the full history of it, or perhaps it's not the right granularity, and that it has, "Oh, this is my monthly reports, but I actually need weekly," right? Or, "I need daily," or, "I need whatever it is."

Jason Hughes:

And often you actually need to integrate another data set. So you have a single data set, but you need to actually integrate it to answer especially if this is a new type of business question, or exploratory question, you generally need to integrate another data set. Well, how do you do that? Generally, you go ahead and ask IT again, right? Because they're the ones that have the ability, whereas you don't necessarily have the ability to do it, or don't have the ability to do it, not from a skills perspective, but from the tools and governance perspective.

Jason Hughes:

So perhaps, now you have the data that you need, or it's what you believe that you need, so now that you actually need to go start your analysis, right? Now you need to perform the analysis, but when you ask the question of the data, of this new data set that you've now combined with and all that stuff, generally, maybe you're waiting, right? Whether it's 30 seconds, or more commonly, whether it's a minute, five minutes, or maybe even half hour, right?

Jason Hughes:

And then it's always an iterative process that you now need to go, "All right. Well, now I wanted to answer this, and it turns out that when I went and answered this question, I do actually need this additional column or I do need this additional data set integrated," right? So then you go back to either step one or step two, right? You end up going back there. And really once you need to go ask IT to speed up these queries or get these additional data sets, it's again, more request to IT.

Jason Hughes:

And finally, this is an iterative process. Any analytics project or process really is iterative, and that you need to go ask the question of the data based on the results, then you need to tweak it. And then this becomes an iterative process, and more broadly, everyone is doing this, right?

Jason Hughes:

So the response time from IT, this whole process could take anywhere from days to weeks. We've actually had... Because everyone's doing this and IT, it's not that they're lazy or anything, it's that there's so many requests, because they need to do this for everyone. So what we've actually heard from one of our customers is that this, actually, to get a data set like this permission from IT, it actually takes up to three weeks, just to get the data, right? And then now you need to analyze it more, and maybe now you need to make more changes.

Enabling Self-Service Interactive Analytics on All Your Data

Jason Hughes:

So, obviously, with this being a given data consumer having everyone do this within optimally a data driven organization, that can cause problems. So obviously, this is a problem, right? But obviously, this isn't the way the organization would want to be. So why is it this way? Right? How did we actually get here to this being the situation?

Jason Hughes:

So to take kind of a step back, a little bit further, is that the analytics really kind of started in the data mart era, in that analytics wasn't necessarily really prioritized. It was more of a kind of low level kind of descriptive reporting, kind of basic things, really just around, like sales and those kinds of basic KPIs.

Jason Hughes:

So you generally had these various application databases, and they would load it into all of these different physical data marts. Maybe it's one per application, or maybe it's one per organization or project, or it's likely a combination of these, right? Do you end up with your data consumers with these different tools that they have at their disposal, but they don't really know where to go for what data, right? So obviously, you end up involving... You need to involve IT a lot in this situation.

Jason Hughes:

So as analytics really... As this became more and more of a problem to run your business, well, we needed to kind of do away with these individual physical data marts never any idea where any of the data is, recreating new things that everyone else is doing. So the next thing was the EDW era, right? Where you had this single data warehouse that you would optimally load all of the data in. Now you would have all of your data in one store that you could actually perform your analytics on, right? And it worked really well for the problems that it needed to, but there was still, inevitably, some side data marts from a resourcing governance perspective, because these data warehouses tended to be fairly cost prohibitive to actually get to the all of the data and all the permissions and everything that you wanted.

Jason Hughes:

So there ended up still being physical data marts. And these were both dependent data marts. The data mart on the left here is just from the EDW, but also independent data marts, from specific application databases that would be loaded directly into it. And this worked fairly well. It definitely had a couple cons there that I laid out, but really the demise of this kind of architecture really came with big data, where the velocity and the volume and the variety of data, that the data warehouse, it wasn't built for that. Right? It just wasn't designed that way. And it certainly wasn't cost effective to store this kind of low value density data that really came in with the advent of big data.

Jason Hughes:

Well, then the next architecture kind of era came in, was around the data lake, right? So now the kind of thought was, "Well, can we load all of these different types of data, the different value densities of the data?" Right? Can we load that all into a data lake where... It was great, because you could treat it as a central data repository. It was really scalable. You could decouple or compute from the storage. I know, Hadoop kind of started scaling them both. But the architecture itself, you could, in theory, scale them independently. And several organizations did. And also it was a file system, right? So it ended up being at SEMA on any format you wanted to.

Enabling Self-Service Interactive Analytics on All Your Data

Jason Hughes:

Even now store a lot of the logs, audio, video, a lot of those things where you weren't restricted into what kind of data you could store. So, you could just dump it into a single place and you could retain that. It was also very cost effective, right? It was built on commodity hardware. So, again, the scale that you could achieve with this, that was cost effective to store all this data, versus throwing it out.

Jason Hughes:

But at the same time, there were cons, right? So this kind of pros architecture, and this is certainly simplified. They're probably a lot more databases and things. But in general, in order to get self-service for your data consumers, you needed some sort of programming background. Maybe either that's writing Java and like MapReduce, and those kinds of things that simplified the language. But in order to get real use out of it, you needed to understand what was happening in the background.

Jason Hughes:

So if you just wrote whatever query wanted, chances are, it could take a couple of days just to run that query if you didn't understand the architecture of what was happening behind the scenes. Also, even if you did know, you often ended up with poor performance if you didn't have... if the data wasn't physicalized in a way that was really for that application. Right? In order to really get performance out of it, you needed to perform application-specific ETL.

Jason Hughes:

In addition, you ended up with... It was very hard to find data in the data lake. And on top of that, there's a lot of governance. And really the both of these two things lead to what a lot of people ended up calling a data swap, right? And that the data was there, but it wasn't super useful, right? It was hard to find data. It was hard to... People were making copies everywhere, right? And even if you did find it, it was difficult to use.

Jason Hughes:

So all of this kind of led to the fact that you had all of this data stored, but it was very difficult for you to actually get real value out of it, right? The ratio to that was relatively low. So what we had to add was, now that we have all this data in the data lake, well, can we add something to it? Right? Well, what we added was the EDW. And this isn't necessarily chronological, right? Of course, the EDW was around, but it's more about the data flow, right? Now everything everything is in the data lake. Well, now we need to load more of it into the EDW.

Jason Hughes:

And this provided a lot of benefits. And really offloading the cons of the data lake was really what it was about. And it was great because, traditionally it's been more performance than data like execution engines. And that these things have been honed in and around for a while, the cost is optimized, is very good. It knows about the structure of data storage. It has several optimizations it can do to really increase the performance for your data consumers. It also was high concurrency, so you could really run the business off of these, and that you could have 100, 1000, 10,000 data consumers against this single system, and it could still provide them with that analytics and interactivity that they needed.

Jason Hughes:

It also provided high guarantees, and would consistently meet your SLA. So use the data consumer, if you issued a query today, you knew that if you go into a meeting and go to the dashboard, or whatever, you generally see that consistent performance there.

Jason Hughes:

And another big thing was that it really provided that security in governance. So now that you can actually control and really make sure that your users actually have the, A, from the perspective of governance, that not everyone is creating copies everywhere, but also obviously, from a regulatory standpoint, right? That now we can control who sees what, an audit logging and those kinds of things. Right?

Jason Hughes:

So those things were great, and it really served it's purpose there for those things. But at the same time, they were also cons to this approach. And with that, certainly it's expensive, right? And oftentimes, is that while it's traditionally more performant than a data lake, because it's expensive, you don't know you can't necessarily optimize from a performance perspective for all of these different types of queries. You had to basically pick and choose, because you're pretty cost constrained.

Jason Hughes:

So what you ended up doing was they ended up trading off one to the other, and now some portion of your users wouldn't have the performance that they need. Yeah, the executives wouldn't be really important like high priority applications would, but not necessarily everyone. Also, because of the cost is that you end up with a subset of data, right? That you can't load all of the granularity or all of the columns or all of the data sets in the data warehouse, because there was this issue with, basically, cost of it, as well as sometimes even the scale of it, right? That sometimes it also couldn't scale depending on if you're a web scale company perhaps.

Jason Hughes:

Another thing is that in order to use this data warehouse, and everyone's accessing the data warehouse generally now, in order to use that data realistically, you actually needed to load the data. So before you even use the data, if it wasn't in the data warehouse, or you need to new columns, or the granularity, or those things we went over before, you also needed to have someone load the data into the data warehouse. And again, from the cost perspective that you generally needed to justify it, which doesn't really help exploratory analytics, right? Also going along with that is that not all of the data is in the system, right? So again, you need to have people load this data in and makes you dependent on the outside group that's getting these requests from everyone.

Jason Hughes:

Another big part was that these data warehouses were almost all proprietary, right? So you ended up the "Well, hey, maybe these things helps me here, but it also walks you in to the future and doesn't really give you much flexibility to do what you'd like to do," right? You're kind of beholden to them.

Jason Hughes:

So again, kind of thing theme of, let's keep the pros, but try to offload the cons is that we added, BI Extracts as one piece. Now, we actually had extracts from the EDW, and that was great, because now we could, for all these performance requirements that applications weren't meeting the performance, we could now meet them. Because now you do all these performance optimizations in the BI tool. And now the dashboards load quickly for the most part.

Jason Hughes:

You could also join data from multiple locations. So, not all the data is in the EDW or you needed some granularity or whatever, you could now do that in the BI tool in theory. Again, the key to these though, is really as long as it fits. When you get to kind of data scale, you end up with the fact that this kind of breaks down, because now you end up with the same problems. At larger scale, it's a single server. It's not a scale out architecture, it's a scale up. You end up with, "All right. Now I need to choose the subset of data." Right? Now, they're also not engines underneath. They're not execution engines really, underneath the covers. They're really meant to be that way.

Enabling Self-Service Interactive Analytics on All Your Data

Jason Hughes:

So if you feed in a large dashboard, you can also get performance issues, A, are not running as fast as quickly as I want, but also inconsistently, right? You can also inconsistency and dashboards, again, because they weren't really meant to be execution engines under the covers.

Jason Hughes:

Another piece is that any of I need these improvements performance and these joining was only specific to that BI tool. So you're clients, your Data Science, even Excel or whatever tool users wanted to use, well, they didn't get these benefits. This only worked for the specific, not only BI tool but that specific BI tool and that specific instance of that BI tool, which you could have many BI tools and many instances.

Jason Hughes:

Another big problem was that the business logic that you built into this is actually only in the BI tool, because now you end up with, "Okay, well, now I have need to do the same logic in different tools." Well, now you end up with, these people have different definitions of KPIs. So now you're going into a meeting and somebody is saying, "This was our return," or, "This was our revenue or whatever, for this quarter." And somebody saying, "Well, no. I'm seeing a different number." And now you need to go back through and trace all that back. Or it ends up being just wrong, and you only see one of them and you're making a wrong decision, a less optimal decision based on that data.

Jason Hughes:

Another piece is that it's still really IT managed, these extracts. So you still lose that self-service piece with the performance, right? Again, and you guys can see that this is you go through this. You don't need to pick, you pick a couple and deal with the trade-offs. Right? And then you end up offloading the trade-offs. So if these were potential issues for a given application, well, maybe what then they would go with is a physical data mart, right? Bring those back into the play, because it was great. Because now they own that generally organizationally. And now these users could have more self-service, and they can really have the flexibility and agility to achieve their goals from a self-service standpoint. And they also have the performance. So now they can actually really build these applications and they're now much happier in general, right? they can do a lot more themselves.

Jason Hughes:

But obviously, you guys maybe get the theme now. There were a lot of issues around, of course, the governance and regulatory, right? So now you have data copies everywhere, as you can see, so how do you really keep track of not just where the data is? And especially with GDPR and those kinds of privacy laws coming into effect of deleting someone or a record about someone is that how do you control who had access to this? And how do you know who accessed it? Right?

Jason Hughes:

Also a cost perspective, and running all of these is obviously oversimplified. There's more than just two data marts and probably more than one EDW. And probably more than one data lake in the environment. So you end up with the cost of actually running all of these where you don't really necessarily need to.

Jason Hughes:

Another big piece was the data drift. So we talked about the business logic drift of the incorrect decisions or conflicting numbers, and those kinds of things. Same thing happens with the data drift, right? You end up with that in this situation. Also, really wasted effort around, if you're looking to do this, answer this business question and you just analyze some customer trend, well, it turns out marketing has already done that, but you're in, I don't know, operations or business development, and you don't necessarily see that. Well now you're replicating that effort and you're wasting amount of time, but you could really just leverage that optimally. Right? And gain going to same thing of the data and then logic drift there.

Jason Hughes:

As well as, each group now needs to have their own IT. And that where they're managing this, that these physical data marts, we still need someone to manage and operate this cluster as well as the jobs that's run inside it. So now you also have your own IT that you need to staff.

Jason Hughes:

So, as you can see, this is kind of a mess. And this is in fact simplified, probably greatly. So, with all of this, this kind of a given architecture, obviously, this is a fairly big... it's a fair generalization, but it's really based on what we've seen in the past. And, what, certainly I've personally seen. And so what we end up with an architecture and a process and consumer experience like this is that you end up impacting the business, potentially, fairly significantly. So, certainly, you end up with, as we discussed before, you end up storing a lot of data, but really the value that you get out of that is relatively low than what the database wake era promised, right?

Jason Hughes:

And you're going to have all this data, you get value out of it, great. But as you've seen, to really make that data usable, it's traditionally taken a lot of effort just to get an application built, let alone application changes, let alone exploratory analytics. Right?

Jason Hughes:

You certainly also lose a lot of productivity, and that when the data consumer asks a question, they're not waiting one, five, 30 minutes to get an answer from that single query, but also, now as you're going over before, they don't need additional data to really answer that question. Again, you could be waiting for days, even weeks. So between that time, what are they doing? Maybe you have additional questions that they're answering, but again, you end up with... They're not really able to focus and there's the context switching problem of course, so you end up with the loss of productivity there.

Jason Hughes:

As well as on the IT side, right? Because now the IT, and data engineering is shipping this data around and replicating this effort as well for all these different groups, and that they could be focusing on a lot more higher value things as well for the business.

Jason Hughes:

You certainly end up with late decisions, right? So it can take three weeks for you to get a data set, and certainly could delay the decision especially as you need to move agilely, right? And if your competitor comes out with something, you do to answer some questions, it takes three weeks to do, that's obviously a problem.

Jason Hughes:

Wrong decisions, as we've also gone over, around two different numbers, are the data have a date, and you don't even know it because there's not an alternative, you end up with these kinds of wrong decisions as well, especially with the complexity of all these different pipelines of moving data around, right? If something fails, or there's some issue with it, then you end up with... Certainly it's hard to manage all of this with the very high complexity here.

Jason Hughes:

As well as the regulatory compliance issues that we had mentioned before around physical data specifically, but, overall, if you look at this kind of architecture, it is copied everywhere, right? It was hard to keep track of data discovery within a data lake, let alone this entire ecosystem, right? So it's incredibly difficult to maintain and track all of that, as well as waste of time and resources. Right?

Jason Hughes:

So you're not just wasting the time of your data consumers and data engineers have you gone over, but also the actual resources of the systems that could be better put towards, actually, the data consumers questions. Now, have to go through a lot of is copying, a lot of is ETL, a lot of is ELT, a lot of those kinds of transformations, right?

Jason Hughes:

So with that kind of understanding of kind of how we got here, why it is the way it is for these data consumers, what kind of options do we have? So, what I'd like to kind of go over now is this kind of goal-state of what we can create and call it whatever you like. For now, what we'll call it, self service interactive analytics, and that they can do it themselves on the interactivity and they can run their analytics.

Jason Hughes:

So first, from a data consumers perspective, because the only the requirements drive from the business today to consumer to the data engineering and IT. From the consumer perspective, they have a series of requirements like, A, it needs to be self-service. They need to be able to do these things themselves. They need to be able to be self sufficient at the end of the day. You really need to have that kind of experience in your life of, Google, being able to do all of these things and all the technologies that you have in your personal lives really bring it to your business.

Jason Hughes:

You also need to have data discovery. You need to be able to find the data. You need to be able to understand the data from a semantic perspective of these terms didn't mean things that I think they do. And you need to be able to trust the data, or at least understand the level of trust that you can put in a data set, right? You need to have these kinds of meta data and these kinds of things.

Enabling Self-Service Interactive Analytics on All Your Data

Jason Hughes:

You also need to have any data that you actually need for your analysis. If you need to augment it with another data set, if you need to get deeper granularity, you need to be able to do those things yourself. You also need to be able to... Chances are the data isn't in the exact format that you need, right? You need to probably do some curation and augmentation. You need to do these things, to be able to do these things yourself, right? Whether it's reformatting a column, extracting a piece of a column, or whatever it may be, you need be able to do these things yourself.

Jason Hughes:

You also need to have or you can do all these things, but you really need to have interactive performance to actually provide these. All right, I can ask these questions and capabilities, but if I'm waiting 30 minutes for an answer, you're going for a walk or grab several cups of coffee.

Jason Hughes:

So you also need to be able to collaborate, right? Going back to... It's kind of on the other side of that data discovering piece in that once you save the data set, you need to probably... Nobody works in a silo. You probably need to share it or send it some other person. You really need to be able to make that collaboration lightweight and really practical as well.

Jason Hughes:

You also need to be able to leverage your existing skills and tools-of-choice, right? Anytime you make data consumers learn, or anyone really, new skills or new tools, you're upping that barrier to entry that much, and they're going to end up finding a way probably with either downloading things to excel, and those kinds of things, and that really needs to be able to leverage their existing skills and use whatever tool they want.

Jason Hughes:

So with those kinds of the data consumer requirements, what ends up being... With these, if we can provide these, what ends up being the actual IT, your data engineering requirements of this platform?

Jason Hughes:

So, A, from a functional perspective, they need to have, of course, security and governance throughout, right? You need to be able to really secure the data. You need to have that those governance policies that you can enforce. And you also need to have audit logging throughout, right? You need to be able to understand throughout this entire ecosystem, you need to be able to control it and understand why.

Jason Hughes:

So you also need to be able to meet these SLAs consistently, right? Business, application. We have a customer, in an exec meeting, they really need to be able to ask questions of this interactively, and you can't be waiting 30 seconds or a couple minutes in order to do this, especially in the C Suite.

Jason Hughes:

You also need to be able to fulfill a request. There's probably always going to be requests, right? Self-service aspect is kind of spectrum anywhere from fully decentralized, data consumers could do anything they want, to fully centralized of, IT basically controls everything, right? That's kind of the spectrum where you end up fitting somewhere on it. So that ends up still certainly being requested as IT needs fulfilled for the data consumers specifically. So they actually need to be able to fill these quickly, right? Not copying data from system to system, to system where it's already been done, and then maintain that monitoring and all those kinds of things.

Jason Hughes:

You also need to be able to enable annual development for these data consumers. So they need to be able to develop their data applications and all these things while also allowing them to do it in parallel, right? So somebody doing it marketing shouldn't impact someone doing it in operations, or sales or wherever. So, you need to really provide this multi tenancy, not just from a technical perspective of so many users can be on it, but also organizational multi-tenancy, or domain multi tenancy.

Jason Hughes:

So now on to the kind of technical requirements here of the actual platform. A, it needs to be linearly scalable. You need to be able to scale these things, as you increase the number of users, increase the amount of data. You need to be able to maintain these SLA's and these capabilities to really needs to be not just scalable, but also linearly scalable, right? You can't add two users, now you need to add 12 nodes, right? You can't be adding those kinds of things.

Jason Hughes:

So you also need to decouple the compute from storage, right? I think, throughout the industry, we've all kind of understood that this is a good thing. And this is something that really brings a lot of benefits. And that's something, it's pretty much a given at this point, right? So it definitely needs to be able to decouple the compute from the storage. But it also needs to decouple one step further, the logical from the physical. Your users need to be able to operate in this logical semantic layer, especially from a multi-tenancy perspective, or different domains that view the same data but slightly differently.

Jason Hughes:

You need to be able to decouple that logical experience that the data consumer has with the physical actual, A, where the data is stored, and also how you're optimizing the data from a performance perspective. Right? Your users, A, shouldn't have to care and they probably don't care, right? At the end of the day, if they don't have to care, they don't want to, about where the data is stored, and how it's stored and how it's optimized. Which also gives you the ability to basically be tool, data store and cloud agnostic, right? Because on the top you have your tools that access it, that is logical layer, but down below this platform, you would basically have any data store that you want, and any clouds that you want to run this on.

Jason Hughes:

And, A, that helps for flexibility initially, but, B, it also really future proofs you in terms of, it gives you the flexibility and the peace of mind that in the future, you want to completely re-architect these things, right? And it has to be cost effective, obviously. If we get our end up back into the EDW era of things being expensive, you end up with the same problems. You're going to choose. Some side of your data can be hard to justify the applications, exploratory analytics, those kinds of things.

Jason Hughes:

So, with this, these kinds of the requirements that are generally the high level requirements for these things, what we can kind of look at is this kind of architecture, right? And that you still end up with all these various source systems. The source systems are engraved, where the data is coming from. And you generally want to load this into the database, right? You want to load that in there. You want to have a central place, so you don't want to be hitting all these different sources. And a lot of times these are external sources, right? So you need somewhere to put this data. And it's still probably going to be the data lake, because it is good as a cost effective, scalable storage system for any data format, really, right?

Jason Hughes:

And you end up also with the EDW, because it's the pros that we went over is certainly positive things. But at the same time, again, you really need to be able to offload the cons. And that's where this optimally this platform comes in, that enables self-service interactive analytics on your data in any of these data stores, right? Sometimes it actually makes sense to connect the application database directly too there, because the application database loads it into the data lake every 12 hours. But for certain applications, you need that every 30 minutes or you need it in real time, right?

Jason Hughes:

So you need to really provide this flexibility and really decouple your users who iterate at that logical level from actually what's happening behind the scenes in the physical level as along with it as an impact them from the performance, form the discoverability, from those kinds of capabilities.

Jason Hughes:

So, what we kind of propose here, well, Lucio and I at Dremio, is that Dremio can actually serve as this self-service interactive analytics platform. And that it can really it provides, as you can see, kind of decouples the users from the actual storage medium. And we'll certainly go through exactly how Dremio does this.

Enabling Self-Service Interactive Analytics on All Your Data

Jason Hughes:

But just really briefly on Dremio itself, for those of you who may not be fully familiar with it is that, A, it provides a self-service semantic layer for data engineers, as well as analysts and kind of data consumers. So it provides the self-service semantic layer, it provides a UI and also that is really user friendly and intuitive even for non technical users. And it also basically provides the same kind of experience from whatever tool you're using from. It really needs the tools.

Jason Hughes:

It also provides really incredibly fast queries directly on your database storage or wherever that data may be stored. And primarily the first place it's going to land is the data lake. And optimally, you don't need to move it from the data lake if you can address and mitigate those cons that we went over before. So there's no need to really load the data into a lot of proprietary data warehouses, create cubes extracts, physical data marts, those things that we went over before, if you can get the performance and those other capabilities directly on the data lake itself.

Jason Hughes:

Of course, any environment now is the pendulum swings from decentralized with the physical data marts, decentralized EDW, now back to the centralized, and it's probably not going to be fully centralized for a while, right? So what you end up needing to do is you really need to be able to join and be a data store agnostic, and really be able to join databases and data warehouses and application databases and file systems and whatever, right? You need to be able to join all these things to provide flexibility.

Jason Hughes:

And also it really needs to operate on open standards. So again, with that kind of... You want to avoid the lock in as much as possible. And you really need to be able to operate on standards. So there's open standards for most things now, especially in the data storage space in terms of file formats, and data stores and those kinds of things. You're going to operate on standards as much as possible.

Jason Hughes:

So with that said, let's dive deeper into exactly how Dremio meets these requirements. Right? So let's walk you through each of these. So from the Self Service perspective, Dremio provides that self-service semantic layer for, A, technical user. They can certainly use it, but they've never really had much of a problem, right? But really, it's also very intuitive for non technical users. We have quite a few customers at Dremio that they're actually able to provide business users with a very intuitive experience that they generally could only get with something like, Business Objects, that non technical users generally like. But they've also really enjoyed it and found great use and productivity.

Jason Hughes:

It also provides you a lot of the data discovery and understanding and trust via the ability to search on data wherever it is, right? So data lake, data warehouse, application database, wherever it is, if it's connected to Dremio, you can search through. It also provides the actual ability to browse lineage and view how a given data set was actually created. And therefore, some information about it's providence, as well as it also provides the data set wikis and tags. So you can actually see some general documentation about it directly in the tool without leaving it. It's just a single click over the next piece of the page.

Jason Hughes:

And unfortunately, we don't have the time in the session to actually go through a demo in this to really show you these kinds of things. But Lucio certainly does a great job with the live demos every Tuesday. So we certainly have that or you can follow up with us to actually see it. I think there's somebody videos hosted on that.

Jason Hughes:

But moving on, is that it also needs to provide any data for the analysis. Well with Dremio, you can look at data at any depth. You can look at whatever history, granularity, multiple sources, multiple data sets. It really allows you to look at any of this data, again, in a self-service manner, doing it yourself.

Jason Hughes:

Also, looking at all that data, you also need to, again, the curation aspect is that in Dremio UI, you certainly have the visual curation. But again, today, you can also do it with pure SQL if you like from any tool that you like. It's, again, ODBC, JBDC drivers as well as the REST endpoint. So you can use your SQL from wherever. Basically, whatever tool you use today can likely work with Dremio.

Jason Hughes:

You can do these kind of last-mile ETFs or just kind of curation without making any copies of the data. You're not making copies of this data like traditional the prep tools that are all logical in SQL. Also it provides you the ability to do all of these kind of augmentation and these things in a self-service manner as well as actually upload. So if you'd like an Excel spreadsheet that you actually want to augment and put your data, you can actually do that fully MPP and in parallel via just uploading it to Dremio and joining that data set to the one that you're working on. It can be a two step kind of process.

Jason Hughes:

It also provides... With all of these capabilities, that's great, but if it takes you 30 minutes or an hour or three weeks to do these kinds of things, it's not really that useful, right? So you also need to be able to provide this interactive performance with all of these things. And that's where Dremio, really, we have a scale-out architecture. So it's full MPP. One of our customer's actually running us on hundreds or 700 nodes. So it certainly has that scale-out architecture.

Jason Hughes:

It's Apache Arrow-based execution engine, which is highly optimized. It takes really a great advantage of some advanced, some modern, rather, CPU architectures such as factorization. It also provides, within the bottom, Apache Arrow is open source, so is Gandiva which is, if you kind of think about, is a technology built around Arrow. And that it does some very low level optimizations in that you really want to optimize for the inner loops, if you can, as much as possible, because that's where generally a lot of things spend the most amount of time.

Jason Hughes:

Gandiva lets you compile instructions to machine code. So what it'll actually take full advantage of the wide registers and those kinds of things that kind of really highly optimize a lot of operations there.

Jason Hughes:

It also provides Columnar Cloud Cache, which is basically instead of going to a cloud data lake, which is high throughput, but also high latency, we're actually able to autonomously and transparently cache some of that data locally, right? So, then next query comes, and turns out we already have it, we'll go make sure that the data is fresh or hasn't been changed, then we'll actually go use that data instead. And all of those things are actually fully transparent to the end user.

Jason Hughes:

However, we also have the capability of what we call data reflections, which is a whole session on its own. But if you can kind of think about it in a way that these things help you giving query, but at the end of the day, if you need to scan 10 terabytes and do aggregations across 20 different fields, you're dealing with physics at that point, right? You'd have to go do that work.

Jason Hughes:

So reflections is really what enables you to, instead of doing it at runtime. Well, your users still want interactive forms or your dashboard used the load in sub second or whatever it may be. Well, let's actually go ahead and do those operations ahead of time. And then let's store that results, again, behind the scenes. And then when users ask for that, again, without changing their query, they should be exact same query, Dremio actually recognizes using a relational algebra that it can substitute this reflection instead. So you also have that ability as well, to really optimize where some of these... if you need to get it from a 10th of a second down to sub second, you have that capability.

Jason Hughes:

In addition to that, also around the collaboration is that Dremio is full server-based. So you create these data sets and it's all server-based. You're not copying the data out, sending it over ending up with these drifts and these copies of data. And you also have the wikis and tags there.

Jason Hughes:

In terms of leveraging existing skills and tools-of-choice is that Dremio fully operates on standards. It's one of the core principles that we really operate on is that let's meet the users where they are, right? Let's not make the users change their behavior. So, the industry spoken SQL is the been in fact a standard for ANSI SQL is what we around as well as all the ODBC, JDBC and REST API endpoints, so you can consume from any tool, as well as Arrow Flight as well which is an upcoming standard as well.

Enabling Self-Service Interactive Analytics on All Your Data

Jason Hughes:

So that's how Dremio meets those requirements. Let's now go through how Dremio meets the actual IT requirements. So, as we go through... Now, let's first go through the functional requirements that we have gone through. The security and governance throughout the platform is throughout the environment, right? Dremio actually provides security at the source level. The PDS, the Physical Data Set, you can think about it as table. The Virtual Data Set is kind of like a view. So you have both of those levels, as well as the space which is a unit of organization. So basically how you organize it and secure. You can fully mess those to your any arbitrary depth of really this tree building out, as well as row and column level. So you can enforce security at any and all of those levels. And it also provides audit logging in terms of who's actually interacting with data, who's doing what, as well as full end to end encryption, all the way from reading the data to returning it to the cloud.

Jason Hughes:

Also, needs to again meet SLAs consistently. Who can do that is, A, through, again, scale-out architecture, add 10% more users, add potentially 10% more nodes. Potentially you won't need to. Maybe you only need 3%, maybe something like that.

Jason Hughes:

Also, workload management is certainly a key piece of how you can manage and consistently meet SLAs and ensure that different people basically can see the consistent SLAs, as well as the reflections. Right? So you can actually, as we discussed before, instead of doing all this work or runtime, well, if you can do most of it ahead of time, it really helps you meet those SLAs consistently as well.

Jason Hughes:

In terms of enabling IT to get high productivity fulfilled requests, A, decoupling the logical and the physical there is huge, because now either A, they ask for a new data set, you can go build it in the logical layer without even having to worry about... You don't have to do any data copies, you want to move it from system to system, you don't have to really worry about that. And then also from a kind of, okay, well, now it's too slow, right? Or it's slower than I need it to be, now you can go optimize the physical in the background, whether that's moving it to perhaps a higher performance data store, from maybe a data lake to a local, like Hadoop cluster or even a relational database or something like that, or even actually optimize with reflections. Again, this is all behind the scenes, so it doesn't impact the users at all. It segregates those two layers.

Jason Hughes:

You also have the visual curation. If you choose to but SQL if you really want to, again, it's all standard. You're not doing with some visual thing that you're like, "Where's this menu button?" It's all standards and those kinds of things.

Jason Hughes:

As well as reflection creation, to actually create these reflections is much more of an agile process and they can take... You had a query slow, you can look at it and you can specify a reflection to be created in a minute, perhaps. Sometimes, of course, it takes longer, but if you understand that, it can be up to a minute to actually create that reflection to increase the performance, to fulfill that request.

Jason Hughes:

Also, enables as a development in a basically, again, decoupling of logical and physical really helps that agile development as well as promoting from dev to test prod. So that kind of promotion process where you're different levels of governance through each layer, different controls, as well as spaces. So organization and then control through that way.

Jason Hughes:

Certainly, multi tenancy is, A, from a technical perspective of, "Hey, there's a whole bunch of users on here. I need to make sure the queries don't hate each other from a workload management perspective," right? But there's also the domain organizational multi tenancy. And that was really where decoupling that logical and that physical really helps as well. Now, you can have group specific or application specific semantic layers that all leverage the same physical performance optimizations underneath the covers of these kinds of reflections, and then all the other Gandiva and all these kinds of other pieces that we've gone through.

Enabling Self-Service Interactive Analytics on All Your Data

Jason Hughes:

From a technical perspective, again, from a scalability perspective is that is full MPP scale-out architecture. So again, you'll add 10% more users... You'll add up to 10% more, but you're not adding 50% more, it's a scale-out architecture. Ultimately, decouple compute and storage, you can and we recommend actually running Dremio on non storage nodes. So you can certainly run it like co-locate on Hadoop cluster if you'd like, and you benefit from things like data locality. But you can certainly scale Dremio fully independently in the storage. Or you can run it on like Kubernetes, which you run it, again, fully independent.

Jason Hughes:

Again, from that decoupling the logical and physical, the benefits there, we do that via, A, Virtual Data Sets, reflections being tied and algebraic matching. So it's not given data set. Even if you build up this dataset over to the right, and you end up using one over here on the left, that's completely different. We can actually use the reflection on the one on the right. If it algebraic, we can.

Jason Hughes:

Also, connector architecture that we have. So you don't really worry about the source. It's the same language. All that changes is that Dremio manages the translation of this data to the connector. You have data source abstraction using that same layer.

Jason Hughes:

Again, from basically the agnostic of these things, it, again, operates on standards, the connector architecture and can run anywhere, right? So you can use any tool because generally they speak those standards. You can use any data store because there's connector architecture as well as you can run it anywhere on any cloud, especially with Kubernetes. But again, Dremio's can also run on AWS and Azure. And it can be run on a hybrid cloud architecture as well.

Jason Hughes:

And it's also future proofs you, right? So you don't need to worry about different tools that come tomorrow ODBC, JDBC, REST, different data stores, the amount needed to log into great connector architecture, different clouds that you want to run on the future, great, it doesn't matter what you run.

Jason Hughes:

As well as cost effective nature is that, again, you decouple the compute and storage. We have a very efficient engine so sometimes when it's running, but actually, from it, we highly optimize and focus on the efficiency of it. And you can also scale it out, right? It's an elastic engine. So depending on your demand, you can actually scale it out. And it's not this kind of static thing. And because Dremio isn't actually storing the data, it is very simple and easy to actually scale it out there, as well as reflections, which can greatly reduce the workload there, because you're not necessarily doing all of the work for 100 queries, you're doing it at once, and then serving us. So that can help make it cost effective as well.

Jason Hughes:

So now, from a data consumers perspective, in this kind of future state, you now need to answer business question. Now, you need to go find the data. Now, where is it? Well, I can go search for it, right? Using your keywords, or you can also browse for the data if you want, right? And now, you know it's the right data set. Well, this is a wiki that has descriptions about it. You also know where that is, such as if you're located there for some information about that. The tag's on it, as well as you can also see how many people actually use this data set, right? Only five people have used it, probably less trustworthy than the one that's been used 10,000 times or a million times.

Jason Hughes:

But now, you need more data for your question, right? you'll looking for data. And that's not necessarily. More columns, more history, more granularity, all you need to do is change your query, you change your SQL, and you've now gotten it.

Jason Hughes:

Now, you actually need to integrate with another data set, you can actually click join in the UI and Dremio should recommend joins for you. Or you can just change your query, right? It's the self-service kind of nature that you can do these things yourself. Now, you can go perform your real analysis, your query is complete quickly, due to all of the things that we went over, right? The scale-out architecture, the execution engine, Gandiva, C3, Cloud Cache, as well as the reflections, right?

Jason Hughes:

Now, you need more data, again, go back to step one, great, all of these things are self-service because this whole thing is still iterative. That hasn't changed but now each process in there is much, much faster. So that really means that now from days to weeks, you can now achieve this in minutes to maybe even hours depending on the analysis.

Jason Hughes:

And certainly from the broader picture is that Dremio, we recognize that this is certainly a large piece of self-service, interactive analytics. But there's certainly other pieces that kind of fit into the picture overall. And that's anywhere from other analysis and operations from kind of the long haul ETL that takes 24 hours to do all these transformations. Certainly, that I think the picture as well, you're probably using something like Spark, also from your streaming and real time analytics, as well as kind of an OLTP workloads.

Jason Hughes:

You recognize that that's around in the ecosystem, and we certainly work with a lot of customers that Dremio coexist with these kinds of things, as well as other kinds of data types, in terms of unstructured data, where things are great, but not necessarily built for unstructured data and these kinds of things. But again, we've many customers where these things coexist.

Jason Hughes:

And of course, in the future, right? There is no single tool for absolutely everything. Personally, I've seen Dremio be one of the tools that satisfies and addresses them. But most of these requirements that I've seen any other given tool do, but the platform should really use open standards and open formats. So all tools and processes can really coexist to mix and match them basically to achieve the business requirements. Right?

Jason Hughes:

And, kind of that's great, we've been talking about this, but kind of the proof is in the pudding, we've actually done this with a large number of organizations, and some of the largest data organizations out there are. Right? So all of these customers are actually using this approach with Dremio to actually provide their data consumers self-service interactive analytics on all of their data.

Jason Hughes:

So with that, I wanted to hand it off back to Lucio. Over here is certainly some resources for you that you can go ahead and then try Dremio out. Lucio, you want talk a bit about?

Lucio Daza:

Thank you. Yeah, absolutely. And now, I want to encourage everyone who caught that last line to go and get some pudding at the end of this webinar. Thank you, Jason, for saying that the proof was in the pudding.

Jason Hughes:

It sounds pretty good.

Lucio Daza:

Definitely. Thank you so much Jason. This has been an amazing amount of wonderful information. And before we jump into the Q&A, we still have a couple of minutes. I want to encourage everyone to go to our deploy page. This is a change that we did a couple of months ago. Now, we don't have a download page, but we have deployed page where you can see all the different deployments options that you have for Dremio depending on what cloud flavor you are working with, or if you want to launch on prem, or if you want to figure out how is it that you can use Kubernetes to launch a containerized version of Dremio. All the information is going to be there.

Lucio Daza:

Also, please go to our tutorials and resources, and go to dremio.com/tutorialsandresources. We have a ton of information in there, a lot of tutorials that you can see and you can read through to learn how to use Dremio with any BI tool that you might be working with. And of course, go and join Dremio University is an online learning platform that we have created for the community. And you can register for free. And you can take any one of the seven courses that we have there for free as well. And it gives you the opportunity to launch an instance of Dremio. It would be your private virtual lab with Dremio Enterprise Edition in it. So you can take it out for a spin, and follow along with the exercises that we have in there.

Lucio Daza:

And as Jason mentioned, a few minutes ago, I host a weekly demo on our site. So please go ahead and register for that if you would like to see how easy that we tackle all these use cases that Jason was talking to you today about in real life. And in addition to that, please if you have any questions always go to Dremio community. We are actively looking at it all the time trying to answer questions. We have a lot of smart people across the community, always posting new challenges in there and helping each other and answering questions to help the rest of the community.

Lucio Daza:

So, we still have a couple of minutes. I think we have the time for one question before everyone goes. And Jason one of the questions that we have here we were talking about performance throughout the presentation and if you can summarize in the next couple of minutes how is it that... The question is, "How did you exactly achieve your performance as relays using Dremio?" So, is there anything that you can share with us there in that aspect?

Jason Hughes:

Yeah, certainly. So I know we're running up to the top of the hour, and I could probably talk about this for a while. So if anybody wants to follow up on that, I'd be happy to have a conversation about that.

Lucio Daza:

Absolutely.

Jason Hughes:

But in general, it has to do with the architecture of Dremio itself, and that is full MPP and scale-out. We also highly optimized Dremio to especially read from autumn data lakes, especially cloud data lakes. The cloud data lakes have some specific traits or attributes, especially around some of the requests, and it's high throughput, but it's also high latency. So especially when you're reading and columnar formats on there and big data, there's a lot of waiting that you end up doing for that.

Jason Hughes:

So we actually end up developing a capability that it's not just in general the IO, there's read ahead, but it's actually column aware read ahead. So it's not going to read the next bites, it's, "Well, is this the next column? Well, let's try to read the next column and skip ahead to that instead." So there's certainly that in terms of what we call predictive pipe lining, and going ahead and requesting that kind of stuff, ahead of time to expect what the query will need, and what the users will need. It all becomes the Cloud Cache that I'd mentioned. Right now, we have this data locally. Well, we read it once. We don't need to go ask about this, or whatever for it again, and wait another... I mean, we've seen anywhere from, of course, a low latency to up to five to 10 seconds for a request. If you're trying to complete your query in one second, that's obviously not possible.

Jason Hughes:

As well as the execution engine itself based on Apache Arrow, as well as Gandiva really getting that, taking advantage of the machine level and new CPU architectures, as well as, certainly data reflections. And that's really where, if it takes you again, 10,000 ones to scan 10 terabytes of data, well, there's no really way around that, right? It can only be physics. And that's where data reflections can really help you.

Jason Hughes:

And, just to maybe kind of provide one more piece of color here is that we've actually had a customer who went in... And we went in, we're talking to them, and they wanted to go ahead and test out Dremio. We tested it in a PLC and they had a query that was running on the other vendor called a Modern Big Data engine, and we gave it query and it took it 28 minutes. We actually ran the exact same query on Dremio out of the box just deployed a bunch of nodes and it took it 40 seconds. 40 seconds is good enough. They're like, "That's a significant improvement." And then we can be done.

Jason Hughes:

But actually, what they really wanted was they wanted to see if we could get to the sub second. So that's when we actually brought in data reflections. We created those data reflections to actually get that query to be sub second, from the 20 minutes. So really provide you a lot of the box improvements, as well as this other kind of tools in your tool belt to achieve your performance goals.

Lucio Daza:

Excellent. Thank you so much, Jason. And I know we're past the hour. So I want to thank everyone for being here with us today. I hope you learned a little bit something. Always please, as I mentioned, please go ahead and join us for the live weekly demo. And, Jason, thank you so much. And I hope everyone have a wonderful week. Bye.

Jason Hughes:

Thanks everybody. Thanks Lucio.