Avoiding the Architecture Undertow: Building Lighting-Fast Queries with Blazing Fast Object Storage

Session Abstract

Organizations are increasingly levering analytics to turn data into insights for competitive advantage. However, the architectural considerations for platforms that support large data lake deployments of analytics applications change significantly as these efforts mature beyond small scale to large scale environments. One highly successful trend is the adoption of object storage in analytics allowing data teams to be able to analyze data anywhere and everywhere. In this session we will explore how to build out an enterprise scale data lake for lightning-fast queries with blazing fast object storage.

Video Transcript

Brian:    With that, I’d like to welcome our next speaker, Thomas Henson. He is the senior business development manager, AI analytics at Dell. Thomas, over to you.

Thomas Henson:    Thanks, Brian. Excited to be here and wow! It was great to have entrance music. It felt like I was back in person, we have the intro music, so I love that. I’m going to share my screen. I’ve got a [00:00:30] presentation here that I wanted to share with everyone. All right, so perfect. I think everybody can see that. Brian, just let me know if you’re not able to see that.

But as Brian said, my name is Thomas Henson. I work for Dell Technologies. I’m what we call a business development manager, specific to our global business for AI and analytics. What I do is I spend about half my time talking with customers around their architectures and how they’re building out from [00:01:00] their analytics and AI platforms. I spend the other time with our partners like Dremio, talking about roadmap integrating and building out solutions that can help our customers. It’s great to be able to partner with Dremio in different solutions like that.

The session today is Avoiding the Architecture Undertow. I thought I’d have a little fun and a little play on most folks right now are probably in the middle, or maybe they’ve already done their vacations this summer. A lot of folks go to the beach [00:01:30] and so I thought there’s a lot of similarities in what we see from an undertow perspective at the beach, to building our own infrastructure. Let’s jump right in.

If you’re not familiar, like I said, everybody probably familiar with an undertow, it’s a current of water below the surface. It’s below the surface, moving in a different direction from that surface current. Now, if you’ve ever been to the beach and the first time that you were there, you were probably like, “Well, this is pretty weird because a wave just almost knocked me down and now [00:02:00] it’s trying to pull me under, but everything seems to be moving in a different direction.” That current, it can be hard because you’re looking at it, it’s not on the surface, it’s happening below. Everything looks like it’s going one way, but it’s actually going the other way.

Now, when we talk about artificial intelligence and analytics, what are some of those that we have? As your organization has more success, expanded use cases. A couple of other things that they’re looking for, you’re expanding your data teams. These are all great momentum and great push [00:02:30] that are going one way, but on the other side, you’ve got your architecture and it’s like, “Well, how do we sustain this? How do we support expanding data teams? How do we build those new analytic models? How do we search our data?” When I was thinking about this session, I think it’s a perfect example from what we see with customers. Their architecture undertow, those are the challenges that actually come when your team has success.

If we were in the audience with you right now, I’d probably ask everybody, if we were live, [00:03:00] how many people, the business is doing good, they’re excited about what we’re doing from an analytics perspective. But man, it just creates more and more challenges. It’s a good problem to have, but you want to make sure that you’re starting with that base so that you can go through and scale as your team succeeds.

Now, what are some of these things that the teams are being asked for? Democratization of data, fast time to results for data, governance and [00:03:30] compliance. These are all the things that we want to accomplish. When we think about it from data democratization, we want more and more folks, and business units, and lines of business within our organization to take that data first approach. To understand and to give them access that they need. We still have to have security and governance, but to also be able to do it in a timely fashion.

Classic example, if you think about it, time to results. I worked on a project in the past and [00:04:00] one of my first big data projects many years ago was we were analyzing log files that were coming in and our whole thing was we were trying to find and prevent cybersecurity insider threats before they happened. Now, the challenge was we could get all the data in, but by the time that we analyzed it, if the data was already out the door, if it was months later, maybe it was an employee that was leaving, had already left, then we were just reporting on what had happened. So time to value. [00:04:30] I’m sure everybody understands time to market and they’ve gotten applications or digital coupons on their phone. If you don’t get it just in time, you’re probably not going to use it. If I’m getting that pizza coupon at midnight, for somebody like me, that’s not going to work. I’m normally in bed by nine o’clock. You want to make sure that those teams not only have access to the data that they need, but they can react to it quickly.

And it gets harder. The challenge of the data that continues to come in, it’s a challenge. I was talking with one government [00:05:00] customer and it was just amazing how much data they were coming in. They had 10 to 40 terabytes of data that would come in on a daily or even from a weekly perspective at times. That data just continues to go, so how do you make sure that you have a system in place that can hold onto this and the architecture built? But that you can also hand out and rapidly give access to the teams that need to have it. And also do it in an area that fits our budget. Budgets are bigger than they’ve ever been before, but they’re [00:05:30] flat. They are not increasing by the amount of data that’s increasing. I’ll show some stats on that. It’s pretty interesting when we talk to our customers.

Now, I stole these slides here from a presentation that Brett Roberts from Dremio and I did with BrightTALK about a month and a half ago. I thought it was good to really show what we’re talking about. He coined this, the Pyramid of Doom. You have your data scientists and you have your business intelligence users and what they’re trying to do is, hey, [00:06:00] they’re trying to make their dashboard support their users, find insights in a timely fashion. If you have the bottom here, this is where my role on the unstructured data team, that we exist to build that architecture. You have your data lake, but if you continually have that data coming in, how quickly can you get it to the business users? How quickly can you get it to the data teams? You have some kind of process because a majority of that data that’s going to come in as unstructured. So you’re going to add some kind of structure to it, or maybe even filter out certain data.

Then [00:06:30] if you have these different formats, you have one system for one data mark that has a specific format, you have another business or another line of users that are maybe using a different data mark, or maybe they’re shared, but you have all these disparate systems over here and you’re constantly in this time. Every step along the way you’re increasing time. Going back to time to value, I think there’s some stats out there right now from IDC that are saying by 2025, I think 40% of all data analytics will be [00:07:00] streaming. So it’s only getting faster from that perspective.

That’s why it’s important for us to partner with Dremio to build out that no copy architecture, to consolidate the stack to say, “Hey, as your data comes in, this gives you the ability to add that structure, be able to query it, to support those dashboards, support those teams, support your business users.” Turning weeks or months into hours or days and giving more access. As this continues to scale and you add at the top more of those users, [00:07:30] you want to make sure that you have that ability to.

Now, I want to take a step back and talk a little bit around really unstructured data and some of the things that we’re seeing there. As we look at it, if you’re not familiar, I know most folks are, with unstructured data, there’s so much potential in it and there’s so much data that’s created out there with it. It could be your home directory. This presentation in PowerPoint, anything that you have, your Excel sheets or anything like that in your home directories, that’s unstructured data. Think about your data analytics, [00:08:00] raw files, telemetry data. I worked a good bit in log files and log aggregation because they’re semi-structured, that’s unstructured data.

Could exist in your file shares, could be telemetry data that’s also coming from the internet of things. We see the boom here and it’s been talked about for a while, but there’s just so many opportunities to capture data at the edge and IOT is one of the areas that’s pushing it. You also have the archive data. This session is being recorded, other sessions at the conference this week are being [00:08:30] recorded as well, too. All those video files, those are unstructured data as well, too.

If we look out in the audience today, and I know at the conference we have folks from every different industry. Data analytics is not, hey, it’s only in financial services. It’s in energy, manufacturing, chip design, life sciences, it’s impacting everywhere. Even some folks on my team that specifically do media entertainment, it’s so interesting to see what’s going on from an AI and analytics perspective. The whole reason that we hold onto the data and [00:09:00] we capture these insights from unstructured data, is we want to drive that innovation. We want to get it to market faster and it’s really your differentiator. If you’re an organization, what’s the best way to find more information about your product, is in your own data. That’s what can tell you how to accelerate, how to go fast, how to make things better, how to bring more efficiencies within that data.

But for the users that I work with a ton, this is a challenge and they’re working into a new era. I’ll say I’ve been a part of Dell Technologies [00:09:30] since 2015. It’s interesting to see how much object has really grown. We have a stat from Gartner, around 80% of the world’s unstructured data is unstructured. If you look at all the data, 80% of that and every about 18 months to two years, data continues to double and unstructured data is a big portion of that.

But now we’re seeing a rapid growth and adoption in object. What we’re seeing is by 2023, IDC is saying that growth [00:10:00] will be 300 X over file versus object. It’s huge. A couple of reasons for that. A recent survey with IT leaders and business decision makers, 90% of all of them think that all new enterprise applications are going to be cloud native and that’s just by 2022, which happens to be next year. Time flies. The other thing is the support for cloud. Being able to do hybrid cloud, being able to support full cloud. We’ve had customers that have gone and went [00:10:30] all in on the cloud and everything moved to the cloud and they’ve come back and repatriated. Well, when they came back to be able to pull some of those workloads back, they had a growing need for object. I think as we start to see, object protocol is huge, whether you’re building out your own private data center or whether you’re supporting hybrid applications in the cloud.

Now, you may be asking, if you’re not familiar with object, is why the need for object? Why do we see it in cloud? Why do we see it in our private cloud adoption [00:11:00] and why is it specifically for AI and analytics? Well, the first one is scalability. Being a part of the AI and analytics community for so long, I may be biased to this, but I think that if you really look at what’s driving those numbers from IDC for data and unstructured data, it’s AI analytics. So you have to be able to scale it. It’s funny, I think maybe it was back in 2015, maybe 2016, I was at Hadoop Summit and I remember them talking about, hey, man, relatively large size of data. It was [00:11:30] anywhere from one to 10 terabytes that we were trying to analyze. We don’t measure in just that anymore. We’re constantly seeing the uptick when we talk about petabytes and measuring in hundreds of terabytes.

In fact, if you look at some of the stats, the predictions are that we’ll have 175 zettabytes by 2025. Every time one of those predictions comes out and you go through the cycle, you’ll see that those end up being underestimated. That’s been the trend and that’s what I’m going to say. Not that I’m [00:12:00] going to disagree with IDC or Gartner, but I definitely think that you’ll see an uptick in that.

Then the simplicity. Not only, hey, okay, I’m supporting 10 petabytes here, you have to have that simplicity to be able to manage it. You have to be able to support it. If you look at object and file and other ways of capturing data, you’re constantly in the shuffle where, hey, you have a bucket, yet you have a system that fills up and then you have to spread it out or separate in block systems. You may have to build out different ones and you run into [00:12:30] issues from that perspective. Then the third component that’s really huge, especially when we start to look at it, is metadata. Support for metadata. Each object that comes in, you have metadata that’s associated with that, that allows for you to search it and tag it and do different things that really not available in other protocols. Then the API support. Whether we’re talking traditional applications or modern applications being able to support those dashboards. Not only can you do your analytics in place on it, but you can also support those dashboards and applications with robust APIs within it too.

[00:13:00] Quick little background on what object is. If you think of it, first it’s files that’ll come in and you’ll have objects, has custom metadata on it, is able to search robust protocols. Being able to support rest and [inaudible 00:13:14] over HTTP and being able to support that out with our applications. Really that metadata support gives you the ability to do a lot of customizations that go back to those protocols and APIs.

Now it’s well suited for those static files and cloud storage. [00:13:30] One of the biggest strengths is the scalability. The scalability and the ability to search through that metadata and distribute it. Not only building it in your data center here, but also support globally on a global scale. Some of the limitations is the infinite bucket. We say the ability to continue to scale out and be able to walk through it. Now it’s not necessarily a database, so don’t get me wrong from that perspective. But if we look at it in the model, hey, as that unstructured data [00:14:00] comes into place, gives you the ability there and partnering with a solution like Dremio, you can have those lightning fast queries over your object protocol.

Now, I did want to touch a little bit on the public cloud. It’s not always when we see hybrid and a lot of folks are talking about hybrid and hybrid models, and what’s going on there. Well, here’s a couple of different questions that we got from a survey: 83% of business users have moved away or moved some part of app away from the public cloud. Now, [00:14:30] 34% of those that have repatriated, have done it for security issues. Talking with some of our customers, there’s a right and wrong way to do certain things. Some applications have a place from that perspective, but some have a place in their own environment. It may come from regulatory compliance. I have some customers that just can’t do that, or it just doesn’t make as much sense. Then there’s also users, the sense of loss of control and there’s concerns around security.

What we do from a Dell Technologies perspective is [00:15:00] we help customers decide, hey, which applications, how do we want to set this up? We can help build out those solutions to support hybrid applications, to support [inaudible 00:15:09] applications in the cloud and on-prem, and make that seamless transition as customers are starting to either build out their analytics application day one, or if they have something that’s in production and they need to be able to scale it as well, too.

Now, the elastic cloud platform or ECS, as we’ll say it, this is our solution [00:15:30] that we built from an object storage perspective. It’s cost effective at scale. We’ve got numbers to show in a couple different deep dive white papers that we can share, that show almost have a 60% lower TCO when compared to building out private cloud versus in the cloud. Then also lightning fast S3 for modern applications. Standardizing on S3 so that whether your users, whether your data teams know, hey, the queries that they’re running, it’s transparent to them that they’re using the [00:16:00] S3 protocol product. It just shows the interest that we’ve seen when we start to talk about S3 as a standard protocol. All that comes with these enterprise solutions, but the confidence that, hey, we have the enterprise cybersecurity in different components in there to be able to protect that data, to be able to do the governance part, to abstract a lot of that away for our teams.

Now, when we talk about ECS, it’s not like I said, it’s [00:16:30] not something that you would just put in your data center, it’s a limitless scale, and that’s even on a global level. Whether we’re supporting a factory here in the US and you also have maybe a remote office or a factory as well in Australia, you have that ability to keep pace. Tiers for secondary storage, global access, lower TCO, and limitless scale. This is why we’re so excited about this solution that we’ve put together. Now, [00:17:00] I didn’t want to show too many numbers on this, but I definitely wanted to show you and we have a calculator and we had a third party independent report that came through and did this. So when we say 60%, this is the data that really backs up what we’re talking about.

Customers choosing ECS over public cloud versus active in archive, roughly about a $1.15 per gig, per month with ECS versus almost $2.00, a little over $2.00 just depending on if it’s an active or an archive use case right there. [00:17:30] A lot of that comes from just that storage ability. It’s the ability to drive down lower costs and give your teams the flexibility to still consume object from that perspective.

One of the things that we started looking at in the last few years is as object becomes more of a standard protocol, what are the things that we can do from an ECS perspective to make sure we’re supporting, I won’t even say modern workloads, it’s the futuristic workloads. They’re modern today, [00:18:00] but as we were looking back, the first was all flash. We’ve seen an uptick in the need for all flash. When we really think back to where object was just a few years back, typically it was looked at as an archive or longterm retention. But now as more and more applications and more and more analytics is done on object, there’s a huge need there to support that. Multiple different ways, I think we’ve talked about it too as [00:18:30] we see streaming analytics, 40% of that data, a lot of that’s going to come from [inaudible 00:18:34].

Not only are you building a system that builds a data lake specifically for your analytic engines, but a lot of times what you’ll have is you may have some archive data over there that maybe analyzed every six months, maybe every year, you pull it up and do some batch analytics on it. Well, this can all be within that system and keep your data in place. Give your teams the ability to search it and build queries out.

[00:19:00] Now, I talked about how much object was a game changer. With all the different things that are working on cloud native support containers, like everybody, we have to pay attention to what’s going on in the dev ops community, also AI and analytics and high speed recovery. If you’re not on the infrastructure team within your data team, when you’re talking and working with that, these are all the reasons that they’re looking at all flash. When you go back, you can ask them how [00:19:30] much flash and how much that’s improved. This is where we’ve seen it from an object perspective, which is why we have released the EXF900.

When we released this, we’re working on our performance testing right now with Dremio to update some of our papers. I’ll talk about those here as we get to the end. But some of the early numbers that we’ve seen on this is in some areas, a 21% uptick in performance improvement. [00:20:00] It’s huge. Stream performance there with the NBME SSDs, within each one of those platforms, but also at scale. In one rack, you can have three petabytes of data. Remember what I was saying like, “Hey, one to 10 terabytes of data.” Customers are looking for their architecture and data teams need that petabyte scale and it gives them the ability to work for those modern workloads, like I said. Still using this open standard S3 protocol for all your AI and analytics [00:20:30] and even IOT to be able to support that growing data need that you have. We’re really excited about this and we can’t wait to publish some of the results of from our joint solution with Dremio.

Now, I did want to touch and go a little bit down memory lane. From a Dell Technologies perspective, we’ve been in object, we’re so close to 20 years, I’d like to say 20 years, but it’s only 19. Next year at SubSurface I’ll be able to say 20. But our ECS platform has been merged [00:21:00] from Atos and even I have customers who remember us from [Centara 00:21:04] as well, too. So we’ve been doing object storage for quite some time. But it’s interesting to see how that evolution to primary has evolved over the years. As we look back to the magic quadrant, I think it’s been three or four years since it’s been this specific quadrant has been out, Dell Technologies has always been a leader in the top spot there. That’s because of our built-in data protection.

[00:21:30] I was joking earlier with some folks at the conference this week that, I mean, think about how many different cybersecurity events that you see. It has to be ingrained in those solutions. We meet compliance and standards, future-proof and then like I said, this is something that we’ve been doing for 19 years and we continue to improve and evolve there. We’re thankful that we keep being recognized by Gartner for that leader spot within the file and object space.

[00:22:00] Now I will say, I talked a good bit about flash. When we did the original testing and the white paper that I’ll share here in just a second, the EX500 was a platform that we standardized on. We’re going back and doing the all flash, so we have the hybrid solution right now with Dremio, but also we’ll have that all flash as well, too. At the end of the day, what I tell a lot of customers is it goes back to the power of choice. Look at the application, look at the workload, see what you’re going to work with, but you have the option to choose whichever platform or however you want [00:22:30] to [inaudible 00:22:30]. You’ve got more performant needs, you can have all flash. Maybe you just have a policy for all flash. You have that existing there. Or if it’s more of a hybrid archive, some of the other workloads, you can go with one of the archive or hybrid platforms as well, too. Like I said, really excited about the partnership and the things that we’ve been working on together.

I’m about to open it up for questions, but I did want to say everybody will have access to the slides here. Here’s my email address, feel free to send me anything if you have any questions. Then we’ve got [00:23:00] a couple of different areas here. In the Dremio, Dell EMC partner portal, you can see all of our content that we’ve done together. Some of the webinars we’ve done in the past, the latest one that Brett and I did with BrightTALK. Then also we have our solutions over brief and the deep dive white papers showing all of our test results and some of the standard benchmarks as well, too.

Thanks everybody again, for the time and happy to hang out and answer any questions.

Brian:    Yeah. Thanks, Thomas. Let’s go ahead and open it up for Q&A. If [00:23:30] you have questions, use the button in the top right corner to share your audio and video. You’ll be automatically put in a queue. If for some reason you have trouble doing that, you can just post your questions in the chat.

Thomas Henson:    Do we have any music while the questions come in? Is it like Jeopardy? [00:24:00] Do, do, do.

It’s not refreshing on my side. I can’t see anything in the chat right now. It’s just giving me a spinning circle, so if there’s something Brian, can you-

Brian:    Nothing yet?

Thomas Henson:    Okay. I just want to make sure I’m not missing anything.

Well, [00:25:00] I guess if nobody has any questions, we can give everybody a few minutes back in to swing by the expo center. Anybody can always reach out to me with any questions.

Brian:    Yeah, and feel free to join the Slack channel. Look for Thomas Henson if you have any questions for him.

Thomas Henson:    Well, thanks again, Brian. I really appreciate the help here and thanks everybody for joining and having me at SubSurface this year.

Brian:    Thank you.