Top 5 reasons Why you are still asking your big data small questions

We surveyed hundreds of data consumers, data architects, and data executives to understand and get a clear picture of where cloud and data lake modernization is, where it is headed, and why it still represents a challenge for many. In this webinar, we will examine 1) The current usage trends of cloud data lakes 2)The main challenges organizations face when moving to the cloud 3)How these challenges can be addressed successfully with the right technology in place.

Dremio Jekyll

Top 5 reasons Why you are still asking your big data small questions

Transcript

Lucio Daza Hello everyone. Thank you so much for being here with us today. If you are here for the top five reasons why you are still asking your big data small questions webinar, you are in the right place. We are going to allow a couple more minutes for the rest of the audience dial in and we will get started shortly. In the meantime, enjoy the music, enjoy the questions and we will be right back. (singing). Once again, thank you so much for being here with us today. If you're here for the top five reasons why you are still asking the small questions to your big data, you are in the right place. I am going to allow one more minute or so for the rest of the audience to dial in and we will get started shortly. Thank you.
Lucio Daza All right everyone, thank you so much for being here with us today. My name is Lucio Daza. I direct technical marketing here at Dremio. And today we have a very exciting presentation prepared for you. Before we begin with the top five reasons why you are still asking your big data small questions, we are going to run you by a couple of housekeeping items. So the first one is how you are going to ask questions. This is your time to understand. This is your time to learn. We are here for you today and we want you to ask any questions that you may have. For that, go ahead and use that Q&A button that you will see in your Zoom interface. Hit the little Q&A button. It will open this canvas right there. Just go ahead and put any question that you want in there and we will definitely follow up through the presentation.
Lucio Daza If you have any technical difficulties, if I'm speaking too low, if I'm speaking too loud or if I'm going too fast or too slow, just go ahead and raise your hand. We will go ahead and address the issue and also you can chat with the rest of the audience using the chat button that you will see there. Or if you want to ask any questions to the rest of the audience, you can as well. I repeat. If you have any questions just go ahead and do the QA button. Otherwise we won't be able to get that annoying alert that says, hey, you have a question, go ahead and answer it. All right. If you put it in the chat window, I'll try to get to it but I cannot guarantee that we will see it. If for any reason we cannot or we don't get to the question that you are asking, we will definitely follow up with you via the email that you used to register for this webinar
Lucio Daza So today, we are going to talk about the top five reasons why you are still asking your big data all these small questions. Why are you being shy with your big data? And for that, I have with me my dear friend is Scott Gay. Scott, thank you for being here with us
Scott Gay Hi Lucio. Pleasure to be here
Lucio Daza Super, super happy to have Scott here. And he is a senior solutions architect here at Dremio and I couldn't think of anyone else more indicated for this situation. We're going to talk about real use cases that we have seen in the industry. And Scott obviously he spends a lot of time with our customers and helping them succeed with our products. So we're going to pick his brain today. We're going to take advantage of all the experience that he has. So let's go ahead and get started. So what are we doing today? In the next 45 minutes or so, we scheduled an hour for this webinar. But I think we're going to have enough to talk for the next 45 minutes
Lucio Daza We're going to tackle those five reasons that we were able to identify. And then each one of these five reasons we're going to have a fireside chat with Scott. And at the end we are going to have a QA, at the end of the presentation as I mentioned a few minutes ago. So let's go ahead and start talking. Scott, this is a conversation I wanted to, I was looking forward to have with you. Because this entire webinar is going to be based on the data lake. We have a lot of information, a lot of data that we are producing
Lucio Daza We have any data driven company is witnessing right now the fact that data is just growing and this is something that I was able to identify many years ago. Data is extremely cheap right now to preserve than throwing it away. So we're seeing, we have some numbers here to share with you. And then I'm going to explain also how is it that we went ahead and identified those five reasons why big data is not being completely used. Data is growing at a very, very rapid rate. And especially we have seen this going on on Amazon. Obviously ADLS and Hadoop as well, but cloud is something that everyone is using right now for many, many reasons. And we also have here a statistic that says that by 2025, and this is something that blows my mind. By 2025 more than 50% of the data that we are generating will live on the cloud data lake storage
Lucio Daza Now I want to disagree a little bit with this number that I have here because I think at the rate that we're going with right now, and I know this is just five years away from now, but I do recall a lot of predictions saying by 2020, something's going to happen. And that happened way before that. So I think more than 50%, I would dare to say that of data is going to live in the cloud, if that is not the situation right now. And data lives in the cloud for many reasons. We are taking advantage of the cloud because it is cost effective, it's elastic and we are going to, Scott and I, we are going to discuss at the end of the presentation our view on data lakes now versus how a data lake was perceived years ago when the actual data lake term was coined
Lucio Daza So it's easy to set up, it's cost effective, it's elastic. You have virtually limitless resources on the cloud. But there is one problem. Consuming data from the cloud is very, very difficult and it's very slow. So one of the things that we see a lot is that, and actually I can testify to this issue as a former data consumer. It is very difficult to just go ahead and consume data from your data lake. Go ahead and try to plug your BI tool or your data science tool directly to your data lake and you will find yourself in a little bit of a pickle. So what is going on? The first thing that happens is that you find yourself in a situation like this where if you need to extract data from your data lake and you need to go ahead and make query, you need to create a model
Lucio Daza If you're a data driven organization where you're trying to make decisions that are going to improve your operations, you can think also you're trying to improve your customer satisfaction, sales, you name it. You cannot really afford to spend too much time processing all this data, trying to extract it from your data sources in your data lake and then trying to bring it to your BI users. And we have seen this happening a lot and taking a lot of time. So what is happening? So you end up with a brutal and very, very slow complex where you have to go ahead and extract the data, then you have to go ahead and move that data to a proprietary and expensive data mart. And on top of that, depending on the BI tool or the data science tool that you're working with, you end up having to create some sort of extract to be able to analyze the data
Lucio Daza So what is happening? And this is something that I'm going to engage with you in a second, Scott. And now we want to engage with the audience. You are seeing in front of you on your screen a question and we want you to participate on this. The question is, do you have a data lake in production right now? So while we ... and you have two options, yes or no. Just go ahead and put a ... submit your question in or put your answer in there and go ahead and submit the question. And we are going to compare that information in a few seconds. We are going to leave around a minute or so for the audience to answer. So Scott, I want to bring you in here and I want you to tell me, have you seen, and if so what percentage of people, users, customers or colleagues that you have been dealing with, out there in the field have a data lake in production these days?
Scott Gay Yeah, I think that's a great question Lucio. I think there's been a lot of attention on this the last, I mean, call it five years. And we can even go a little further back with HDFS and Hadoop exploding maybe in the prime six years ago or so. And so coming from there and into the cloud data lake, I see it very often. But I'd say the majority of folks that we work with or engage with already have some kind of a data lake, whether it's for exploratory purposes or for really operational use cases. I'd say the kind of detail that folks are using it varies quite a bit. And some of that's depending on the tool set they're using or the various things that were available when the whole system was developed
Lucio Daza Absolutely. No, that is great. So now let's go ahead and see if we can close the poll and see what answers we get. And then I have to confess Scott, this is the first time that I'm doing a poll here with Zoom. So I'm not 100% sure that I'm going to get all the numbers in front of me as I got the question. And if I don't, then that was fun. Thank you everyone for participating. Okay, there you go. Now we close the poll and now we have the answers. I don't know if the audience is being able to see these as well. I hope they can. And I'm very, very excited to see this because now I see that roughly 55% of the attendees, they have a data lake in production and 45% do not
Lucio Daza So let's go ahead and see how these compare to the results that we have. And just to give you a little bit of background on these because I don't want any confusion where these numbers are coming from. And I should have started with these at the beginning of the webinar. We had the opportunity to launch a survey to our user base and especially also in a couple of trade shows that we attended last year. I say a couple and there were a ton of them. And we collected roughly 1500 responses where we were trying to analyze in a very concise survey what were the challenges that people are facing with the cloud data lakes. So this is a first, one of the very first questions that I wanted to ask the audience. And now we have, the numbers are very similar
Lucio Daza So we have here roughly 40% of attendees at that point or participants didn't have or did have a data lake. And of those, also roughly a little bit more than 40%, close to 50% did not have a data lake in production. This also depends on the persona. There is a ton of potential for this data to be utilized. And there is also a good chunk of them that they were not sure. And understandably, I mean if you are not the kind of person that deals with data but the only thing, or the architecture, the only thing that you are doing is just consuming your data, you really don't know what is happening in the background. Now, what I found very interesting is this number right here, out of those who did not have a data lake in production, roughly 30 to 35% are planning on launching one within the next year or so
Lucio Daza And I think these reports, the fact of the statistic that we had at the beginning where we were saying that more than 50% of data is going to be in the cloud by 2025, which is just five years from now. And I then again I wish I could put that in a stone. That is going to happen before 2025. Second question, this is where I want the audience also to jump in and engage with me. And Scott, this is a question for you and I as well. What do you think are the benefits of a data lake? And this is a key question because this is actually going to trigger the conversation that we are about to have about five reasons. So there are going to be certain options on your screen. And in a few seconds you're going to see another poll. So go ahead and put in there your questions. And Scott, let me go ahead and ask you, what have you seen customers using data lakes for out there? What do you think is the main benefit?
Scott Gay Yeah, loaded one Lucio. I think we'll get into a lot of the varying benefits and the various levels of details that we can get into with kind of as we address each of these challenges that you have for us today. But I think one of the main benefits is really having visibility across somewhere where you can have visibility across an enterprise for example. And having a single place for data is one of the major benefits. There are some complications that go with that, that need to be solved which is the things that we'll touch on today. But I'd say that's probably the main benefit and why it's so attractive, right? It's flexible. It's very scalable and it's really cost effective for the actual storage of data
Lucio Daza Yeah. And I think that is the main beauty of working with a data lake, especially on the cloud. How flexible it is too. And there was a lot of people who just go ahead and tell you, "Hey, don't forget the cloud is somebody else's computer." And I think that's the beauty of it. It's somebody else's problem. All you have to do is use the service, go ahead and load your data out there. Now security, while this is still your responsibility, cloud vendors, they have all this awesome technology that is going to help you keep your data secure. The fact that you don't have to worry about the benefit, about their resources, now that you don't have to spend time and money trying to configure your barebone metal stuff in the back of your data server room and all these things
Lucio Daza Now there is a huge load that is taken away from the shoulders of the user when you go ahead and work with a data lake, especially on the cloud. So I think we close the poll already. I want to thank everyone for participating on that. So let's go ahead and see if the questions pop in a second. And we have here a big data source for analytics, 38% of the participants. And we also see advanced analytics. There is another option in there. It's very similar to an extension of a data warehouse, that's just part surprising. And then so we have the two winners would be I guess big data source for analytics. And then we have data staging for data warehouse. And that is very interesting to see those answers in there
Lucio Daza And now we are going to go ahead and compare this with what we got from our audience. And there you go. These roughly 60% or a little more than 60% of the audience that participated on this survey let us know that advanced analytics is a benefit of the data lake. And then we have data exploration and big data source for analytics as well. And I think those three right here are very ... and Scott, feel free to disagree with me here. I think these three are very, very similar in the fact that we ... I mean if you're, and I mentioned this at the beginning. If you are a data driven company, you want to make analytics on your data. You want to do advanced analytics. You want to use the data that you have in your data source as a source for analytics as well
Lucio Daza And for that you need to explore that data. And I think that's why we are seeing this similarity right here on these three numbers. So, and I have my personal take on here on this. I think if you are a data driven company, if you are trying to improve customer success, if you are trying to improve the quality of your service, if you are trying to improve the sales of your product, if you're in trying to improve anything right now, your decisions will have to be based on data. Otherwise, you are just going to be a person with an opinion. Nothing better than having data and analyzing this. So I think, and I agree with the audience on these results right here, one of the main benefits of the data lake is advanced analytics is making decisions on, or data driven decisions on the data that you already have
Lucio Daza But then Scott, this is where things get very interesting and the rest of the audience, why are you, or not you, don't take this personally. Why is the audience still not using their big data they way they should? So let's go ahead and talk about this for a little bit Scott. And I think this is where the first reason comes in. Complex data pipelines. How many times have you seen a situation where you want to go ahead and analyze data. You collected all this data, IoT, manufacturing, production sales, you name it, and then you find yourself in a situation like this one. And so in this case, this is a use case that I'd like to share because, and in this case especially, 47% of respondents commented that they think the complexity of the pipelines when dealing with data is something that is keeping them from asking big questions
Lucio Daza And I want to clarify this. And funny fact Scott, for those of you who probably already noticed my accent, I am not from around here. And as I was playing with the words for these webinars, for the title of this webinar, I wanted to emphasize the reason of why people is not asking questions against their data lakes, their big data. They are collecting a lot of big data and they go ahead and ask a tiny question like, "Hey, just give me whatever you can." And so it was very interesting to come up with that title because what I wanted to emphasize is that it's like, there is a lot of people not analyzing their entire data sets. There is a lot of industries not taking advantage of the great amount of value that they have
Lucio Daza I mean, there is a gold mine in their data lakes and they are not using it. So the first one is this. It's complexity of the data pipelines. And Scott, feel free to chime in here. We have seen many situations where if you want to ... you're a client. Right here you're the client. You're the person who's consuming data. Then you go ahead and make what I like to call a pipeline request. You go ahead and say, "Hey, I need data from ADLS. I need data from S3 and also stuff that I have stored in Hadoop, optionally similar data sources." But then you find yourself in this situation. Scott have you seen this happening a lot? Do you have people trying to create transformations to change data types and do validation, cleansing, you name it
Scott Gay Yeah, absolutely. Right? And I think some of it is necessary, right? There's a lot of business complexity and things like that that need to happen. But really there's also a huge bucket of work that happens there. And it's simply just because there's not the right way to do these kinds of activities directly in your data lake. And so bringing it out into a warehouse, whether it was kind of the picture you showed before, right, is a great depiction of how it happens, where you need really, it's in lieu of getting the right performance for business consumption is needing to do these things and get data somewhere else without being able to do those couple of boxes that you have there with the validation, cleansing and transformation. You can't do it directly on the lake today
Lucio Daza Exactly. And that is a big issue. And then again, and that's what I wanted to emphasize here. Like I mean, wouldn't it be great if you actually just could get rid of that entire process? And I'm not saying you don't need to do ETL. You need to load your data on your data lake somehow. You need to capture your data and you need to definitely clean it. And I would bet some money on ... I don't gamble, but I would ... that's just a saying. I would actually say that. And we've heard this a lot, 80% of the time that we as analysts, we as data consumers spend on instead of doing that or the time that we spend on doing analysis, 80% of it goes to cleaning data and the rest of the 20% is on analyzing this data
Lucio Daza Now, wouldn't it be great to actually just get rid of all that and invert that number? You spend 20% of your time doing preparation and then 80% gaining insights from your data at the speed of thought. And that way you don't have to turn yourself into a data architect or a data engineer trying to access and clean data and whatnot. So that is one of the insights that I want to share with the audience today. Working with a data lake engine would allow you to get rid of that complexity of the pipeline, which normally is very brittle. It's very complex and very slow. So let's go ahead and jump to reason number two. Lack of data access and discovery. How many times have you been in a situation where you're trying to work with a dataset and you don't know where this data set is located?
Lucio Daza And this is something that I've learned a while ago especially working for companies where data lakes were not a thing. And the risks behind not having an actual way to access your data, to enhance collaboration along with your teams in terms of the data, the work that you're doing on your data is a big situation because especially, I'm going to tell you a story. I used to work at Louisiana State University. And I used to do IT work and all that, but if I needed, there was an inventory that we needed to do. And normally if I move on with my inventory and then I gave it to somebody, that person would have to, if I didn't tell that person where the data set was, they needed to go ahead and create a new one
Lucio Daza Then now you have two copies of the same data set that I was working with. Two things that the IT department had to worry about securing because it contained information and serial numbers and a bunch of critical stuff that we couldn't share or we didn't want for that to land on the wrong hands. So, and then because I had those two or we had those two copies of data, then if somebody was going to grab the project and continue on that, they were going to come across two different copies if they found that dataset. And then knowing which one was the right one, and then knowing which one was going to give them the actual results
Lucio Daza So then find yourself in situations where, hey dude, where is my data? How can I do my work if I don't know which data set I am working with? Scott, have you seen this happening? And I know I keep telling you if you have seen this happening and that is the reason way I have you here. You spend a lot of time in the field more than I do. And I'm hoping you have a cool story about this. If you don't, that's fine. But have you seen a situation where lack of data access is a major thing or is a roadblock when trying to do analysis on, especially on data lakes or big data?
Scott Gay Yeah, absolutely Lucio. Yeah, of course. Right. I think this is something probably everybody in the audience can relate to. And my perspective on this is if we actually were able to collect and have good metrics around this stat, maybe a bit more detail, right? I don't know if anybody has done this before, but every enterprise just has BI extracts and data and Excel sheets and things like this all over the place. And the reason for it is to help people continue to be productive, right? To drive some kind of insights. It's, well, let's shim the data down. Let's aggregate it to a very high level to reduce the number of records in an Excel document or something. And then those things get shuffled around. They get changed offline or sent via email
Scott Gay And it's really because folks aren't able to drive real insights when they're thinking about a question with the current tools, right? Whether that's power BI or Tableau or whatever it is. Some of the architectures that folks have today just don't facilitate that kind of interaction from those BI tools directly to their data. And so you create extracts and do all these kinds of things. And I think that's really the cause of some of this issue. And of course you also have the other side of it, which is just the nature of data marts and how they grow as you add applications to your enterprise and different pieces like that
Scott Gay I think that's kind of a natural evolution that occurs. But again, tying it back to data lake, even if those marks are required for various applications or processes, they can all be collected back together in the data lake. And then if we can make that data available to consumers, then we can drive some insight, right? Data can live where it needs to for its particular purpose, but really have a centralized location to drive some value and insight and really help limit this. I think this is a big inhibitor really to innovation overall within enterprises. I think it is a big bottleneck
Lucio Daza Exactly. And I want to drive attention to this 46%. And if you pay attention to the numbers that we're showing you here, 46% and I believe 47% was another percentage that we had in the previous question. That means that these two, reason one complexity of the pipelines and also lack of data access, they get almost ... That means that most of the audience answered yes to each one of them. And in this case, not having self service access would be a big issue. And it goes back to the complexity of like, okay, I want my data. I want to analyze data and I want to analyze it now. I don't want to wait for it. Neither does IT want me to keep sending JIRA tickets for data requests and a bunch of stuff that are going to, as you mentioned Scott, is going to bottleneck this whole thing
Lucio Daza So 46 of respondents, they let us know that not having self service access is an issue. Lack of knowledge where the data is, is an issue. And also this all in a nutshell compromises collaboration. Because at the end of the day, how do I go ahead and share a data set that I'm working with if I don't have a way to catalog this information and share it with my team or do something that is going to allow us to do agile analytics. And where I can go ahead and say, "Hey, let's you and I work on the same data set and supply or our tests to these analyses or whatnot. And then let's go ahead and merge those changes."
Lucio Daza I want to go ahead and make a quick pause because we see a lot of amazing questions that are coming in. One that I want to address and I see that it was addressed already. I just don't know if everyone can see it, is if we're doing a demo today. Unfortunately, we are not going to have time for a demo. If you want to join us, next Tuesday at 2:00 PM Eastern time, I am going to deliver a live demo. Next Tuesday, the Tuesday after, and pretty much every Tuesday after that I will deliver a live demo where you can see Dremio in action. All right, so let's go ahead and move on to step number three. Reason number three, security and governance. And I think Scott, I think this is a juicy one because it goes back to if you are in a situation where you have a ton of users creating their own data sets, you lose governability of your data
Lucio Daza If you have a lot of users creating a lot of copies of the same dataset, you expose or you are exposed to a risk of a data breach because a lot of people are going to have a lot of data sets all over the place. This is a very, very important data point that I was able to find a few months ago. Last year or actually by 2021, which is just a year from now, the cost of data breaches only on the United States is going to accumulate $6 trillion. This is not pocket change. This is a lot of information that is being lost. And on top of that, the complexity of each one of these attacks is increasing by 50%. Bad people is getting smarter, but we need to be even better. So how this relates to the situation that we're talking about here in terms of data lakes. And this is another reason why many, many folks out there are still in the fence of moving data to the cloud and it's completely understandable
Lucio Daza So we have this broken now for you. And so we have some challenges here. We have management challenges at the bottom, which has to do with auditing. It has to do with defending the rights of users and also protecting data from users using authentication and administration. And on top of that, obviously aware some of that stuff can be solved using role based access control. In addition to that, we have data challenges on the data itself, which is what I just mentioned. You have the risk of losing data. You have the risk of not knowing where that data is coming from, what is happening to the data, who is doing what to the data and compliance. Compliance is huge. If you all remember, I believe it was in 2016, when we started getting hit with the whole GDPR thing and then just started
Lucio Daza And now we see it all over the place we have these accept cookies thing and compliance is huge. And obviously data integrity and encryption. So you have these challenges here. You have to make sure that you're compliant and you have to make sure that data doesn't land on the wrong hands. And we also have the management challenges, which has to do with the employees or not employees, but all the folks that are going to work with data that you have in your data lake. So first, this is something that I want to dig a little bit into. Because I think, I mean this could be a topic on its own. The first thing that you need to do is identify what are the requirements, right? And this is something that a lot of people should be familiar with
Lucio Daza We have, as I mentioned, we have the GDPR for Europe. We have HIPAA for medical records. You have CCPA for California residents and you have all these protocols. And all these components or all these requirements meet a criteria and it's very interesting. Each one of these things can be taken care for using authentication, access control, encryption auditing. Auditing is huge. At this day and age, there is no reason for you not to know who is doing what to your data assets. And on top of that, we also have network and security where you can go ahead and limit access and whatnot. So here, now we have what are your defense options? There are many options out there. And Scott, I think this is one of the things that we were talking about the beginning about the cloud data lakes at the beginning. And we talked about how amazing it is that there is cost control on data lakes
Lucio Daza There is also the flexibility to scale however you want. And on top of that, you have a limitless amount of options to defend and protect your data. Quick pause right here. We have an amazing question as well. And this is something else that I should have addressed at the beginning. Is this presentation being recorded? Yes. The presentation will be recorded. We are going to publish this in about a couple of days. What is today? Thursday. By Monday or Tuesday, we will have this on our library. The recording will be there along with the slide decks for you to take a look. And if you arrived late or you needed to leave early, you can go back in there, put your information and you will have access to this
Lucio Daza So going back to the survey. 30%, over 30% of users or participants on the survey told us that encryption, compliance, security in general is something that is of a lot of concern when moving to the cloud. Scott, what is your take on this? Have you had situations where, say for example, you're working on a POC, you're working on a deployment or you're just educating people in general, security, how often does it come to the conversation? How often does the topic shows up as part of the conversation that you're having with the folks that you deal with?
Scott Gay Always. Yeah. It's top of mind for everybody, which is where it should be, I think. But the ways to solve it are interesting, right? Without being able to drive access directly on the data lake and leverage the data there where it lives, it gets much more complex because now you have to go through that kind of the pipelining process that we talk about already, which is you're moving data around and creating many more different repositories where data is stored and accessed and needs to be controlled. So a lot of people either when you have a workflow like that, either you add a lot of overhead by doing security very robustly and properly you could say, or those controls just get mitigated by creating different data repositories for every team, right?
Scott Gay And then the enterprise loses visibility across what everyone's doing or sharing data assets and things like that. So I think there's those two paradigms and neither of them are phenomenal, right? On one hand you have a lot of overhead and the other one you have a lot of disparate data sources that you're kind of losing some of that lack of visibility. But you do get all the data limited to each team or who should have access. I really think what we've seen and people get success on is when you can drive the access directly on the data lake then and control governance at that layer, you're able to lower the overhead and still get that visibility across the organization, whether it's sharing assets or being able to audit and control things like that
Scott Gay So not only from an end user governance perspective, but also from just kind of backend controls an audit as well, right? Having that single place of where data's being consumed from, I think is highly valuable. And tools like Dremio and there's other services in the cloud and things like that, but really leveraging the power of the lake where the data lives is is hugely beneficial to limit both aspects of that
Lucio Daza And I think one of the keys as well is, and this is in general, if you want security in your data lake, just keep it simple. I mean we can obviously make sure that whatever tool you're using to work on your workloads, on your BI workloads or any workload that you have in your data lake, allows you to do encryption, audit and compliance as well. Dynamic masking. As Scott mentioned, you want to make sure that the right person has the right access to do their job. But I think the key is to just keep it simple. Because one thing that we have been able to observe is if you apply a very, very complex security structure in your cloud architecture, in your data lake architecture, what is going to happen is you're going to have a lot of people bypassing those mechanisms so they can go ahead and do their jobs
Lucio Daza And we have seen this happening many times. People just go ahead and make a copy of a data set on a thumb drive and then that thumb drive is not controlled by anyone. And then the thumb drive is lost. And so that is because either A, the person is not familiar with all the security processes that you have in your infrastructure or they are just way too complex, right? So if, and nobody likes complexity, and I think that is the premise of this presentation. Things should be simple in that way. At the same time, keep it robust, but don't make it so impossible that people wouldn't be able to ... or users, they will just find themselves in a situation where they want to go ahead and skip security controls. So let's go ahead and move onto, on to what? Nothing
Lucio Daza I cannot speak today. Let's go ahead and move to the next point right here. Now we're going to talk about cost and the scalability. And I think this is one of my favorite ones. Why are we not using, let me just emphasize, why are we not using our big data? And I think one of the ... this should have been reason number one. And I want to clarify that to the audience. We didn't put this in a ... The order is not the prioritization. It doesn't reflect the priority of each one of these cases. But Scott, what do you think of cost on scalability? I think I personally think one of the reasons is ... And I remember when I first started getting my hands a little bit dirty and kicking the wheels around AWS and GCP and of course Azure, I do remember thinking, I don't know that I want to put my credit card in there because things can get out of control quickly
Lucio Daza And it would be very, very expensive to ... I didn't know how much my workload was going to cost, I didn't know what was going to happen. Now it's something that I want to share with you and this is very, very important. In 2019, 70% of cloud investments went to waste because there is no right sizing because sizing on cloud computing resources was not doing properly. Let me emphasize this again. 70%. That means that only 30% of what the industry invested on cloud computing and cloud data lakes was actually used for what they needed to. Now, why is this happening? Very simple. Let's go ahead and do a comparison between ... and we're getting to a point that I'm really excited about
Lucio Daza Let's go do a comparison here between what was the data lake before on-prem and then what is now on the cloud. So what is happening at the beginning and is something that you can see here, there is a lot of modernization happening. We are very, very excited about that. What was happening at the beginning is that if you were working on an on premise data lake, the solution to improve was to just throw more resources at it. The paradigm was more is more, more is better. You want it to go faster, go ahead and add more memory. You want it to soar more, just go ahead and add more disk. You want it to ... So for a solution to be performance, to be something that is actually what you wanted, you needed to add more resources to it
Lucio Daza But then what happened in a situation where you had say for example, sales, right? You have a seasonal product. This product sells a lot during the winter and it doesn't really sell that well during the summer. Let's talk about skis or anything related to winter sports. So you have a situation where you have to prepare your data lake and your computing infrastructure for all this high demand. So you go ahead and throw all these resources in there. And then at the end what is happening is you go ahead and you over provision the necessary infrastructure for these high peak events. And then during summertime, when the product is not selling that well, where you don't have all those workloads, then you find yourself in a situation where you have all these engines running idle, wasting resources, and you're not really taking advantage of that
Lucio Daza So there is a situation here with sizing. Scott, have you seen in your experience, a situation like this happening before? Where tailored sizing of resources have been an issue depending on the workloads. Having too much or having too little of resources on the data lake
Scott Gay I like this topic and I think it kind of hints at what the overall name of the game in cloud is. Right. And I think that's efficiency. And it's about, yeah, we have awesome services now in AWS, Azure, GCP, whatever, for any workload virtually. I can just essentially click the up arrow, right, and just pump more resources at the problem. And yeah, it'll go faster or you'll churn more data or what have you. Really the name of the game is how to most efficiently consume those resources to get the most bang for your buck. Right
Lucio Daza And that is an excellent point that you're bringing there, efficiency. Because I think one of the things that we're focusing on this year and moving on is not just performance. Because performance is there, right? I mean as you mentioned, you just click the up arrow, just go ahead and add a stronger, more capable CPU to whatever instance and you're running. But how about efficiency? Like I mean that is performance. We got it. Performance, check that box. That's fine. But now we want to make sure that as Minesh here is telling us on the panel, we don't get slapped with surprise bill. And actually I want to give a shout out to Minesh Patel. That is an actual, a very, very good comment
Lucio Daza And if you don't mind, I'm going to share it here. He says, "Another key challenge related to cost in the cloud is visibility into cloud costs in a real time manner. You can get a slap in the bill, on the hand with a bill that ... and then you go into sticker shock." And actually this is true. I'm going to share a personal experience that happened to me in a different life. I was, Scott, I don't know if this ever happened to you. I hope he never does. And also if you can hear my cat in the background, I want to apologize. He's very obnoxious and I have the gain on my microphone all the way down so he doesn't bleed into the audio. But the experience is that I was working on an automatic deployment that I was trying to do of one of the products that I was working with. And I didn't know too much about all the little knobs that I needed to switch and flip when I was doing my configuration. And there was one setting and let this serve as a public
Lucio Daza So what happened is, and I want the audience to please go to your AWS account and your ADLS account and make sure that this setting in is not there. So what happens is I went ahead and did my thing with my EC2 instance and I shut it down at the end of day. So I was like, I was okay. I'm done with my test, I'm done with my demo and I'm going to shut it down. Went home. I do remember I took my family down. I live in Florida, so I took my family for a weekend trip out in the beach. And then when I came back on Monday and as Minesh was saying a few minutes ago, at the end of the month, I get a bill for over $2,000 on an EC2 instance that I thought I had shut down a month ago. So needless to say, that was a very awakening moment
Lucio Daza It was very scary. So the reason was because there was a switch in there that's a setting that said if you ... There was a protection setting on my EC2 instance that said, if the cluster goes below a certain amount of nodes, just go ahead and relaunch five more. So there is no downtime on the settings. So yeah, people don't let that happen to you. Don't be as stupid as it was. Check everything before you shut down your EC2 instances. So modernization. Reason number four, cost and scalability. That is one of the reasons why people is being shy or industries are being shy of using cloud data lakes. But now we're focusing on providing solutions and using solutions that are going to allow us to have better performance and use and get, as Scott said, have more bang for our buck
Lucio Daza We're getting close to the end here folks. Bear with us. Now, reason number five, performance. Scott, I think this is right here, the key. Performance. Now we had a ... I should have said here performance and efficiency, but I think performance is key. I mean, you want to do analysis on your data and that has to happen, I mean, almost at the speed of thought, right? You want to go ahead and if you're capturing sales data and you need to go ahead and create a dashboard for your boss or your entire team for you to create or to provide data driven decisions, that has to happen fast. You cannot afford the 6, 3 weeks that that takes for people to bring data to you. And this is emphasizing the situation that I was explaining at the beginning. Scott, do you want us ... you know this as well. Can you walk us a little bit through what is going on here and how we can solve this problem?
Scott Gay Yeah. I think this is the right time and performance kind of. To me it encompasses the whole picture, right? Where it's not only just query performance. We touched on it a little bit with the cost aspect, which is we can get the performance, right? We can add more nodes, we can add more compute and get the performance we need for whatever workload. Really it's about the overall performance of a solution. And I like to think of it and bucket it in with even organizational performance, which is how do we enable the consumers of this data to get access and make it fast enough where they can really drive these insight driven decision making. I think you lose a lot of that intuitive curiosity about a certain dataset or things like these when you need to maybe go through a request or wait a month of lead time for data to be curated and provisioned and then access and then, oh, I missed a column, so now I need to wait another month
Lucio Daza And that is the worst. When you are trying to create a report and suddenly you go like, "Oh shoot, I meant to slice this by region and I region in my dataset."
Scott Gay Exactly. So I think this picture kind of encompasses that journey that a lot of people, you kind of end up with. And it was out of need, right? It was out of needing to get that performance. And maybe not having a great option directly against the data lake storage. And I think that's kind of where we've ended up the past, I don't know, four years or so. People have been building and a lot of data being pumped to the lake like we showed before. And now it's a matter of well, what's the right solution to really, all the pieces we talked about, secure, gain access, expose that data and then all without needing to do a lot of complex and brutal development just to expose it
Lucio Daza Exactly. Yeah. No, that's a very good point. And then, yeah, at the end it boils down to performance efficiency. And that's why we want to introduce you, and this has been brought to you by Dremio. And the situation right here, Dremio the data lake engine will allow you to get rid of all those complexities. And once you're deployed, you can start analyzing your data right away. And we have a lot of stuff coming up and it's super amazing to see how, and you will see it on the demo as well. Hopefully we can see most of you in the demo that we have prepared for you next Tuesday. And you will see how fast you can just go ahead and connect to your data, start analyzing your data. And even if you need to do transformations through the pipeline on any field, any data cleanup, you can go ahead and do it there as well. And you'll see how easy this is
Lucio Daza So, bottom line, we talked about the five reasons why the industry is not asking big questions to our big data. They are still asking small questions or being shy. Reason number one, we talked about complexity. Reason number two, we talked about data access. Reason number three, we were talking about security. Then we were also talking about performance and efficiency. And here we are. Now you know the reasons. Now you know how to solve this. And there is a question that we got in the. This is a quick pause here. I'm going to talk to my friend Ohenio from Spain.
Lucio Daza All right, going back, my apologies. This one question that came in is, are these the definite list of challenges. And what you're seeing here is no. That we tried to condense all these answers into groups that made more sense. But as you can see, data cleansing, data discovery, security, access, scalability, integration, lineage, knowing what is happening to your data, knowing who is doing what, cleaning data as I mentioned at the beginning, that is a big issue, ETL, all these things. They are a very ... They are challenges still out there when using data lakes. So that brings us here, Scott to the QA. Isn't this exciting? Thank you so much everyone for being here with us. Now we're going to go ahead and read a couple of questions and there was a quick question here on do you mind giving some context with respect of data modeling?
Lucio Daza I am so glad that you asked that. If I'm pronouncing this correct, I believe the name is. So this person is asking if we can share some, what I would call best practices on how to model data on the data lake. If you go to our site, go to our library right now, go to dremio.com/library. You're going to find there a very, very cool white paper that we published on best practices on the semantic layer using Dremio. You are going to find that a ton of good information in there on how to perform this. And as we come to the top of the hour, I want to invite everyone to visit our tutorials and resources. You will find a ton of information in there on how to use the data lake engine with the technologies that you have
Lucio Daza Remember, one of the beauties of Dremio is that you don't have to go ahead and teach your users about another BI tool that you want, that you will have to use. They can continue using the tools that they work with. Just connect to Dremio and start doing analysis on their data in a self service manner. On top of that, Dremio university, if you have not joined it, please go ahead and take a look at Dremio University. It's a free online learning platform where you ... And guess what? You can launch your own virtual lab with Dremio enterprise edition in it. Take the wheels, do the exercises, try it out, let us know what you think. There is also a completion certificate that you can go ahead and print. And in addition to that, hey Dremio community is free as well. You can ask any questions that you want in there. Go ahead and post your questions. But that is it for me. And Scott, thank you so much for being here. It has been an absolute pleasure. Anything that you want to add for the kids in the audience?
Scott Gay Not right now. Feel free to send us a note or post in the community and then we're happy to engage over there for anything we didn't get to and we'll reply via email. Thanks for having me Lucio
Lucio Daza Absolutely. Thank you for being here. For all of you out there, thank you so much for participating today in the questions, in the polls. Thank you so much for being here with us today and we hope to see you again. Thank you and bye-bye.