Dremio Jekyll

Top 5 Data Industry Predictions for 2020: What Should You Expect?

Dremio

Transcript

Jason Nadeau:

Welcome. Thank you for joining us today. We're going to have a reasonably brief webinar. I think we can get you out of here in less than 30 minutes. We're going to talk about top five data industry predictions for 2020, and what you should expect.

I'm the speaker, my name is Jason Nadeau, I'm the VP of Strategy here at Dremio. Before we kind of dive straight into the predictions themselves, I'm just going to run through just a little bit about what the problem is that Dremio is solving in the marketplace, just to provide some additional context. I think you'll see some of the connection to some of the predictions that we're making as well, given the challenges that we generally see in the marketplace.

So, as we go, of course we want you to absolutely ask questions along the way, and we'll absolutely do our best to get through them all in the course of the webinar. If we don't manage to get through them all, we'll follow up, for sure, individually afterwards. So, the way to ask questions is to use the zoom control panel, and the very best way is to click the Q and A button that you see over there on the lower right. I think there is such a thing as chat, but the chat doesn't record the questions very well, and makes it more difficult for us to, to answer them. So please, if you have a question, that's great, use the Q and A feature, and then we can much more easily kind of a rundown them. So with that, let's get into it.

Top 5 Data Industry Predictions for 2020: What Should You Expect?

As I mentioned, let's just start talking a bit of context setting about what's happening in the data, and analytics world. The big thing that we see as, not just us, but the situation that exists with analytics is that this whole stack is a legacy stack that's been around for a long time, decades really. It's built on a bunch of different work arounds, three big ones in particular. So the first is, people have been centralizing all of their data in an expensive, and proprietary data warehouse-based architecture. You go through a lot of effort to move data into that data warehouse, or data warehouses. It doesn't matter whether those data warehouses are on prem or in the cloud, right. The architecture is still the same.

You've got to go through all this brittle, and complex ETL or ELT to massage that data, get that in there. Once it's in, it's proprietary, and you're constrained in terms of how you use it. What's frankly more interesting from an analytics point of view, and more problematic, is that by and large the data's still not performant enough to actually work with in a truly interactive way. So, then there's this whole other set of workarounds built on top of data warehouses, cubes, BI extracts, aggregation tables. Basically, a set of acceleration workarounds to try to make the data fast enough to actually be consumable by the different consumers that actually want to access it. So, data scientists with their different tool sets, and business intelligence users as well.

So, these different workarounds existed to make the data fast, but as you see looking at this kind of gray pyramid slash triangle that's going up in the background here, as you move up this analytic stack, you're working with less, and less of data. So, the cubes that are being created are these extracts, and whatnot. They're very narrow slices of data. While they do provide the performance that those end users are demanding, so often as I'm sure you all experience, you want to do something else with the data. You need additional data, and that means you need to go back through and get a whole, another project, and a bunch of time passing to redo the cubes, add in, join other data, and so on and so forth.

So, lots of restricted capability, and time. It's a real pain, and it's painful for everybody in the stack, right? If you're a data consumer, like an analyst or a data scientist, you're spending all this time waiting to get your data. Difficult to find it. The amount of analysis you can do is severely constrained because you don't have as much data. If you're the people helping to build that infrastructure, it is a challenge every day, right?

People can just get buried, building these cubes and extracts, hard to spend time modernizing the data architecture, and really making it, built for the cloud fast and open. Of course, there's all these copies that are floating around now too, which is difficult to govern. So you know, copies of data of all these cubes, extracts, everybody's got their own, they proliferate like crazy and it's really becomes a wild, wild West.

So, that's the sort of the world. What if we could eliminate the workarounds with a whole new architecture. It's just built differently, and that's what Dremio is doing. We call what we're building a data lake engine. It is really a whole new architecture on top of an open data lake storage environment, that's purpose-built for truly interactive analytics. So, here you can see, "Hey, it's a much simpler picture. We've got our data lake storage."

Top 5 Data Industry Predictions for 2020: What Should You Expect?

For most people that's going to be, in the cloud with AWS S3, or Microsoft's ADLS can be on prem, still to with Hadoop for example, but we see a lot of modernization towards cloud data lake storage, and our data lake engine sits on top of that and makes that data available to all these different users and consumers. So what is Dremio really, it's an accelerated highly accelerated to make things actually work at truly interactive speeds. It's a data lake query engine with a rich semantic layer. That semantic layer helps people to find share, interact with, and govern their data.

Top 5 Data Industry Predictions for 2020: What Should You Expect?

So, last slide on what Dremio is doing here, but really we're helping people like you to unleash your data lake and only Dremio is delivering an instant interactive response, right? So, like truly sub-second in many, or even most cases for the types of data that people are looking at. Highly efficient compute, so it's not just one thing to compute in the environment, particularly if it's in the public cloud. You want it to be efficient, you want it to cost as little as possible because cloud costs they're getting up there, and we deliver governed self-service.

So, we make it easy for all of these end users, and consumers to find their data in a governed way. So that you know, the right people see the right types of data. We have control, we have lineage, we have all sorts of things. Then you can see the results, right? We're eliminating cubes, eliminating extracts, eliminating frankly the whole category of data warehouses, kind of an interesting thing. We don't need that anymore in this new world.

The outcomes that our customers are seeing are pretty dramatic. So, a hundred times faster time to insight, right? From the time you want to start doing something until the time you can actually do that analysis. Since you're not waiting that 100X, you're doing more analysis, then you see the 10X more efficient. So, all of our compute is very, very fast.

That makes it also very, very efficient as well. So, that once those, instances in the cloud for example are spun up, they don't run for very long. They don't need to cause the work is getting done so much faster, and that's altogether a much more simpler and open environment infinitely. So, a little bit of perhaps a stretch there, but it really is a dramatic difference, and you can see it in the picture of the architecture itself, just how much simpler it is.

So, that's Dremio. That's kind of the broader environment that we're working in, and we're big believers in data lakes, and how the data lakes can really transform the environment for analytics. So, given all that, let's start making some predictions here. So prediction number one, cloud data warehouses turn out to be a big data detour.

Prediction #1

Top 5 Data Industry Predictions for 2020: What Should You Expect?

You can start to see why this would be once you think about that architectural shift that we just showed. So, there are many people who have on prem proprietary enterprise data warehouses, right? These have been around for many, many years. Teradata, Netezza, and many others. Of course those appliance-based hardware solutions, they tightly coupled compute, and storage together, and that made them pretty darn expensive to scale also, right?

So that was a real challenge in general with these on prem data warehouses. So, what has happened? Well, we started to see some additional market entrance with better data warehouse implementations that are running in a cloud native way. While that's a great thing and those new architectures will allow for independent scaling of storage, and compute within the structure of a data warehouse, and that's a good thing. They're still data warehouses and they're still proprietary. The data that enterprises have is still locked into those data warehouses. They're still quite expensive. They're definitely better than the original generation of on prem data warehouses, but they're still data warehouses.

So, what we're seeing is savvy enterprises are realizing they can avoid the entire cloud data warehouse detour, and just go straight to a modern open cloud data lake environment. It's a big architectural shift, and it is really creating a ton of value for enterprises, because now they're getting true separation of compute, and storage. So, in a cloud data lake world, the data is absolutely kept separate. It's in the data lake storage environment in S3 if you're in AWS, right in ADLS if you're in Microsoft and that's your data.

Then as an enterprise, you get to bring all sorts of different best of breed processing engines, tools, and technologies to take advantage of that data, to explore it, to analyze it, and you're not locked in. That kind of freedom is really, really powerful. We're seeing savvy enterprises move to this quickly. So, on this slide, not only do we predict the data warehouse to be a big day to detour, and enterprises just ultimately skip over them, or use them as a temporary place holder. We predict that 75% of the global 2000 will have a cloud data lake in production, or on pilot in 2020. So, that's prediction number one.

Prediction number two enterprises moved from raw performance to price performance. This is really interesting in what's the driver. The driver's the cost of the cloud, like cloud is not cheap. We also know that people are moving to it very, very quickly, enterprises, just about everybody has got a cloud monetization program in place, and 84% of users are realizing that "Wow, the cost of moving to the cloud is it can get pretty large." So, they're looking to implement some sort of cost controls.

Prediction #2

Top 5 Data Industry Predictions for 2020: What Should You Expect?

Now, at the same time the cloud is providing a whole new set of elastic capabilities. By elastic, I mean things can spin up, but they can also spin down. That's a really, really powerful capability that on-premise environments just don't have. In the on prem world, you buy your size, and you deploy your infrastructure, and you typically have to do so to support peak loads. Then you've got an over utilization, sorry, an under utilization problem, and over provisioning problem is the way to think that. So, the infrastructure is kind of relatively fixed. In that world it can be very difficult to reclaim any of that free space if you will, free utilization.

So, the focus has been, "Hey, let's focus on performance, better performance. So let's compare solution A versus solution B. Whichever one's faster especially for things that are performance intensive." Once we go to the cloud, the whole equation changes. The elastic nature of the cloud allows us to say, "Hey, not only can we throw compute at something, and make it fast. Well, that compute cost more." So, my costs start to scale with my performance. So, then the question becomes, "Well, how do cost scale with performance?" That is the price performance curve. So, in this new world, we can see that and we're seeing this all the time, is that the better solution is the one that delivers performance at a lower cost.

So, if you look at this curve here, you can see, follow the the solution B line where for the tip of that solution B arrow. In other words, that sort of on this curve anyway, at that point of performance, it can deliver that same performance at a much lower cost than then solution A, right? So, it's a lower curve. It's delivering performance at a lower cost everywhere along the curve. That is super exciting.

So, if you're an enterprise, that's what you want to be looking at. Not just absolute raw performance. What the price performance curves look like, and how different solutions scale their performance along with cost. It's really interesting that we think, in many cases, performance can almost be arbitrary. Like you can kind of get as much performance as you want. The question is at what cost. So, that's why we predict that enterprises in 2020 are going to move from looking at raw performance to price performance, especially driven, or particularly driven by their adoption of cloud based solutions.

Prediction #3

Top 5 Data Industry Predictions for 2020: What Should You Expect?

So, prediction number three, IoT data finally becomes queryable. So, IoT data is exploding, and that's part of the big challenge. There's just so much data. Gartner predicts that by 2025 we'll have 25 billion connected IoT devices. I think coming out of 2019, the number there is maybe closer to 19 or 20 billion. It's still growing fast, but that is an unbelievably large number. All of these devices are generating about 500 zettabytes per year, already. That's growing exponentially according to Cisco. So, first of all question is, how big is the zettabyte? How many people have even heard of zettabytes?

So, a zettabytes a lot of data. So, you think about petabytes, people might know what petabytes are, probably know what petabytes are, a thousand petabytes is an exabyte, and a thousand exabytes is a zettabyte. So, 500 zettabytes is a lot of data. Now, here's a question for people, of course, you probably have Google, you can go find this out, but you might wonder what comes after a zettabyte. It turns out the answer to that is a yottabyte. So, a thousand zettabytes is a yottabyte.

It sounds a little bit like Yoda, so we wonder what comes after Yoda byte? Is it a Vader bite? We don't know. It's not defined. It will probably be a little while before we have enough data, we need a new term, but zettabytes are all a lot of data. So, that's really difficult to query. But that data is landing into data lake storage environments, generally speaking. That's like the best home for this type of data, semi-structured data. So, they're not going data warehouses. So, the data is there, but the volume of it, the variety of it is really challenging to process, and query.

So, we're excited to see that our partners over at Software AG have built a purpose-built solution for IOT data, and analyzing IOT data. They call it the Cumulocity IoT data hub. So, with this solution we we're like super excited to see what organizations, of all different sizes that have IOT data, are going to be able to do with that data. Because now for the first time it really is queryable it can be explored interactively. You can join that IOT data with other operational data that you have, and really get some powerful insights, and ideas, which is why you see the light bulb over there on the right hand side.

So data scientists, building fancier models, better predictive power, business intelligence folks wanting to augment their reports, and their dashboards, and just do ad hoc exploration as well, machine learning, all sorts of, of really amazing stuff is now going to finally become possible in human timescales, right? Like an actual almost real time as well. Really interact with the queryable. So we're excited about that one.

Prediction #4

Top 5 Data Industry Predictions for 2020: What Should You Expect?

Fourth prediction, the rise of data microservices for bulk analytics. So, what's a micro service? What we've seen is in general this really interesting trend in IT architecture, where folks are moving from monolithic stacks to distributed stacks. Microservices really take that all the way, right? Really dis-aggregate if you will, the different components of an application into a bunch of very small components called microservices, and these things are all interacting, and working together to deliver the functionality that we need. But the trouble is that the transports that allow the communications to occur between these different microservices are fundamentally slow, and serialized. Like it is difficult to move data between microservices, between microservices, and the operational data stores that those microservice compute farms are pulling data from. So, like in practical terms, people can pull in or a microservice can pull in like on the order of thousands of records at a time. That's not a lot of data.

When you think about the types of analytics that the data scientists in particular want to get their hands on, and analyze. The end result of that is the architectures still tend to stay fairly monolithic and tightly coupled. That's a problem, because when architectures are tightly coupled they're slow to evolve. Like the benefit of microservices is that the individual piece parts can update, can evolve, the whole environment can innovate much more quickly and that just delivers a lot more value for everybody.

The slow communications is a real problem, but here's the good news. Arrow Flight is another Apache project. It turns out Dremio's behind this, not too surprisingly. I'm not the only one. We co-created it. Arrow Flight is really cracking open the communications path between these microservices and the rest of the environment. So, Arrow Flight is going to allow for billions of records in a massively parallel new architecture. So, that ultimately we can move to very loosely coupled architectures that are fast to evolve.

So, we think a term or coining here is this move from operational microservices, right? Small amounts of data, very operational focused, and kind of the transactional stuff that exists, typically in your database or perhaps data warehouse, to data micro services with this ability to pull in really bulk data operations. Like large amounts of data, and have that be worked with in your real time. So, that's going to just unlock a ton of innovation, these loosely coupled architectures that are fast to evolve. So, we're really excited for the potential of Arrow Flight to unlock this for all of us. We think this is going to be a big thing in 2020.

Prediction #5

Top 5 Data Industry Predictions for 2020: What Should You Expect?

Our last prediction, so Apache Arrow becomes the fastest project to reach 10 million monthly downloads. So, if you've been following data science, you'll have definitely heard of Apache Arrow, another Apache project that Dremio co-created. Really this is about columnar in memory data processing, and sharing. Analytics type data is all about that columnar type formats. Arrow really made this something that could be used by many, many different projects, and is already powering dozens of open source, and commercial technologies including obviously Dremio as well, but Spark and you see the list here, Python Dask, and so on, TensorFlow, across many different programming languages.

So the adoption and the downloads are really spectacular. So, you can see we're already over 6 million, or I should say Apache Arrow is already over 6 million downloads. We are predicting that kind of middle of 2020 we're going to cross the 10 million mark. That'll be about four years from the time that it was introduced. That's just super fast. You think about some of the other data science projects, big popular ones that are out there. So, Pandas, currently at 17 million downloads a month, but it took them about five years to get to that point is our estimation. Some other projects you probably heard of, TensorFlow, 5 million downloads a month.

So, haven't even crossed the 10 million line. You know, Jupiter PI, Spark, still very popular, lots of downloads. It's really, really great to see Apache Arrow take off, so much because this ability to have data sharing, in memory, fully columnar, much better representation. Just adds so much more performance to all these different applications, and technologies, including Dremio. So, that everything just works together, and we avoid all the serialization and deserialization that would have typically had to occur, the extra copies, the overhead and so on. Just making everything a hundred times to even a thousand times faster.

So, that's our predictions. Then we've got a little bit of a bonus here for folks to. We've been doing some primary research, and I just wanted to share just a couple, preview kind of survey questions that also kind of give you a sense of what's happening out there in the world of the data lakes, and the challenges are that are out there.

Top 5 Data Industry Predictions for 2020: What Should You Expect?

So, we surveyed over 1100 people across a bunch of different industries. You can see here finance, healthcare, it really kind of popped up on that list there. Most the of the organizations that we surveyed were over a thousand employees but a pretty good mix. So, when we think about, when we asked folks what they consider the benefits of a data lake to be, a couple things kind of pop out here. So, you can see over there on the right, yeah. But data lakes being used as a compliment to data warehouses, and that's no surprise, right? Because you'll have data warehouses, they're looking to modernize, offload things from those data warehouses. So, data lakes, becoming a useful way to do that. What's more interesting is the data lakes themselves being used for advanced analytics, and data exploration.

Top 5 Data Industry Predictions for 2020: What Should You Expect?

That's really what excites us. Really what we see customers doing as well as I mentioned earlier. Don't need a data warehouse to do that. They can just use this directly on a data lake. Then in terms of challenges, these are also not a surprise given that sort of original architecture, and what people are trying to do. As they begin to move over to it, and data discovery, there's a lot of data that's out there in that data lake, helping people to find that data becomes really important. Data cleansing, again, lots of different variety of data coming in and it does need to be cleaned up. You can't avoid that, whether it's data warehouse or data lake. So that will be a sort of an ongoing challenge. It's kind of part, and parcel with data, but the data access is another one, right?

If you're a data consumer, and you know that there's data in the data lake that you want, getting actual access to it is a real challenge, and takes a lot of time. IT needs to get involved. Data engineering teams need to build that stuff, and that slows people down, right? That's a bunch of the challenges we talked about earlier. So, here's a bit of a double click on that. So, if you depend on IT to access your data, how long do you have to wait? Looking at this, the quick math on it is for 70% of respondents it takes more than a day or more to get access to their data.

For 43% it takes a week or more. We routinely see people telling us it takes multiple weeks, to get access to the data that they need for their analysis. In today's world, today's hyper competitive economy, that's just way too long. So, people need more, and frankly, if you're on the other side, if you're in IT, and you're having to do all that work, getting buried by all those requests to build, data pipelines, and potentially more cubes, and various other things. That's not a lot of fun either. Really those folks want to be spending their time on strategic activities moving to modernize the data infrastructure.

So, that that brings us to a close, and I just wanted to say that was a quick survey, a quick preview I should say of a pretty rich survey that we put together, and we'll be sharing the details of that in an upcoming webinar. So, definitely stay tuned for that one. Can't wait to to share with everybody the the full set of the research there. So, with that, a couple of closing words, and we'll take some Q and A. If you want to go try Dremio go do it today, right? You can run Dremio and deploy it into, whether you want to do it in the cloud of your choice, or you want to run it on prem. You can go grab the software and start playing with it today. The community edition is free. We also have a Dremio University, which is where you can go, and really learn all about how Dremio works, how to use it.

It's really quite a rich, a place to go learn as well, once you've actually got up and running, there's a whole community there for you to learn from, share best practices, and ask questions and so on. That's a great vibrant community. So, these resources are there for you.

Okay. So, with that we will, that's the end of the prepared remarks and whatnot. Let's go to some Q and A. So first question that we see, again, using the Q and A tab. "Does Dremio do anything to help people to migrate from an enterprise data warehouse to a cloud data lake?" Absolutely, we do, and this is not from a professional services point of view, we do have services as an organization, but that semantic layer that Dremio has, which allows for data essentially a virtualization of physical datasets, and virtual datasets that the virtual data sets are what actual end consumers are connecting to, and the physical datasets could be in the enterprise data warehouse, and then in the data lake.

So, that semantic layer allows for organizations, and enterprises to modernize under the covers, move their data from on prem to the cloud. Change where the physical data sets are, and still keep the virtual data sets, and the connections between those virtual datasets, and end consumers, BI users, data scientists, and so on intact. So, the move becomes much more transparent, and easy to do and enterprises can do it at their own pace. So, absolutely we can help. It's a big part of frankly, of one of the main reasons why people buy, and use Dremio. They want to accelerate their analytics, but also want to modernize their data infrastructure.

Okay, next question. Let's see. Oh, for anybody who wants to know if they can get a copy of the video of the webinar, we will be posting the webinar, and so absolutely you'll be able to go, and grab access or grab that as well. Another question here, "Is it possible to know I reference of pricing?" So, the community edition is absolutely free. You can go and use that today. But once you've played with that, and kind of get a sense for what the capabilities are, the best thing to do is to reach out, get a demo, and see what the enterprise features are going to look like, and see what that looks like, and then talk to our sales team. So, we don't just give out pricing directly. So, definitely give a reach out to our sales teams, they'll help you to kind of walk you through.

Here's another question. "Is Arrow Flight helping to separate compute, and storage?" Yes, I think I touched on that a little bit earlier. So, the monolithic architectures obviously keep things together, and that's just kind of like what you have to do. Once you can actually start having this dis-aggregated approach, and Arrow Flight is absolutely helping to do that. We can start to say, "Okay, well let's put storage here, let's put compute there. We can still move data quickly." That creates this real separation, and we're big believers in true separation of compute, and storage. It's why we built our data lake engine to sit on top of data lake storage. So, it's bring your own storage with Dremio, right. There's no lock in with us like there is with a proprietary data warehouse. So, we're big fans of separating compute, and storage and Arrow helps with that. Arrow Flight absolutely helps with that as well. It's one of the big reasons why we're behind those Apache projects.

So, then another question and maybe, I think this is the last question we have and then I think we're done. That question is "How does Dremio give access to data, unlike a data lake." So, if I understand the question, your data lake's not the only source of data that you're going to have in your environment for sure. So, Dremio connects to and is optimized for accelerating queries on data lakes, but has connectors into a number of other data sources as well. So, that's really important. It's important so that you can join, maybe some other operational databases that you've got. Whether they're traditionally relational or others. Now Mongo DB, for example, noSQL databases. So, Dremio can absolutely reach in, and talk to those databases, and do joins.

We've got a bunch of advanced relational push downs capabilities, for example, to make sure that that's fast. The other thing that of course, having that ability to talk to multiple different data sources is useful for just doing that data modernization. So, we can help connect your end users, and your data scientists, and whatnot to data that's currently in, for example, a data warehouse, could be on prem data warehouse, could be a cloud data warehouse, and then help facilitate that migration over time as that data gets moved into the data lake.

So, I do want to be clear, there's a difference between data virtualization, and what Dremio is doing. We're not data virtualization. There were other virtualization solutions, if that's really what you want to look at, Federation, really the difference being that virtualization's about a sort of an end state, if you will, of having lots of distributed stuff. We don't think that's the right architecture. We think the right architecture gets to really put the vast majority of the data into the data lake.

There's lots of different distributed sources within it, of course, but that's where most of the data should be because it's just a fantastic repository. It's inexpensive, it's highly durable, it's very open, and it's very elastic, and it can handle these huge volumes. Really people should be modernizing towards a cloud data lake centric approach. But of course you're still going to have other sources around it. Dremio will help connect to those as needed.