March 2, 2023

8:50 am - 9:20 am PST

Lakes and Lakehouses: The Evolution of Analytics in the Cloud with AWS

More organizations than ever before are adopting data lakes to drive business outcomes and innovate faster. As data lakes grow in size and scope, data lake architectures have evolved to balance agility, innovation and governance. Amazon Web Services (AWS) is a pioneer in cloud data lakes, to the point where Amazon S3 is now the de facto storage for data lakes. In this session, Rajesh Sampath, General Manager for Amazon S3 API Experience, and Rahim Bhojani, Dremio’s SVP and Head of Engineering, discuss the evolution of data lakes, capabilities required to build a modern data lake architecture, emerging trends, and how organizations turn data into strategic assets.

Session ID: KY202

Topics Covered

Lakehouse Architecture

Video Unavailable

Check again soon!


Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.

Rahim Bhojani:

Hi. Welcome everybody. Welcome to our session of Lakes and Lakehouse, the evolution of analytics in the cloud with AWS. My name is Rahim Bhojani, senior vice president at Dremio, responsible for engineering product and support. Prior to that, I spent eight years at Tableau and Microsoft. I’m very excited to host this fireside chat today. We will discuss how data lakes have been evolving, the capabilities required to build a modern data lake architecture, and how organizations turn data into strategic assets. But first, let me introduce to you Rajesh Sampath, GM Amazon S3 API experiences. Rajesh, welcome to Subsurface Live. Could you please share a bit about yourself and your career journey?


Rajesh Sampath:

Absolutely, Rahim. Very excited to be talking to you and our audience at Subsurface Live. I’m Rajesh Sampath. I am the GM for S3’s API Experience Organization. I’ve been with Amazon for more than eight years now. And before that I was with Microsoft for eight years. And prior to that, I worked for a number of consulting companies based out of India. All my eight years I’ve been with S3. I’ve worked on different parts of S3, building some core distributed systems, contributing to S3’s key launches, such as strong consistency and encryption by default and some of these security capabilities as well. So as part of the API experience for S3, I am very interested in making our data lakes and analytics customers continue to succeed on top of S3. And as part of that, I’m really excited to be talking to you and our audience here at Subsurface Live.

Evolution of Data Lake Architectures

Rahim Bhojani:

Thank you very much for participating today. Let’s start with talking about how, in your view, data lake architectures have been evolving.

Rajesh Sampath:

Yeah, I mean, it’s, it’s really exciting to see the whole evolution of data lake architectures. And I would even go back to correlating the relationship between data lake architecture, evolution, and the importance of data that we’ve seen over the recent years. I would assert that over these years data has grown in tremendous importance in terms of driving business decisions, insights, and to the point of becoming an asset for businesses and customers. And I can clearly see that as a correlation or as a result of evolution of these data lake architectures. For example, in the good old days when we were working on on-prem data warehouses, I’m sure both of us must have worked at some point in time in our careers on these setups.

I certainly had the opportunity to work on one of those really large on-prem data warehouses. Those were really good at helping customers get targeted analytics or reporting or inputs for decision making. But at the end of the day, they were actually a data silo. Customers were not able to get the full value out of their data. And then on-prem, Hadoop data lakes helped customers get more insights and value out of their data because now they could process massive amounts of data on commodity hardware. And you are a Hadoop expert yourself as well, Rahim. So you understand what I’m talking about, right?

Rahim Bhojani:

Yeah. And that was a natural evolution because data warehouses, while great at getting started and solving immediate problems, you’re a hundred percent right. They created silos. And then the Hadoop infrastructure grew out of that. But at the end of the day, that was still on-prem and that still would face both scale challenges, but also in cases where you need time to insight performance issues. And to unlock that you need the evolution of a cloud data lake. In the new modern open data architecture and nomenclature, a cloud data storage, something like S3 with the resilience and performance characteristics that are there is absolutely mandatory. I don’t think you could evolve data lakes to where they are today without that.

Rajesh Sampath:

Absolutely, totally true, actually, and I’m connecting back to the whole journey itself. Rahim, if you look at it, these on-prem Hadoop workloads, customers were not able to scale compute and storage independent of each other because they’re tightly coupled. They also had to do a lot of undifferentiated heavy lifting in terms of dealing with hardware failures, keeping software up to date, updating multiple applications concurrently or simultaneously. All of this resulted in customers spending a lot of their time on this undifferentiated heavy lifting where they could have been spending their time on things that would’ve really mattered a lot to their business on their core domain specific decisions.

Rahim Bhojani:

That’s exactly right. And you have some customers that prefer that because of the type of data that they have. But then they have to absorb that cost. Like the cloud object storage gives you the ability to get the experts to handle that part of that for you, and then you can focus on the business that you need to go serve, because our customers, at the end of the day, are serving some end business user, and it really behooves us to support that rather than spending time on managing infrastructure.

Rajesh Sampath:

Absolutely, and you’re right. I mean S3 is at the core of these architectures, right? S3 is the first largest and most performant object storage service and the storage service of choice to build a data lake. In fact, we are celebrating S3’s 17th birthday on March 14th Pi Day. We’ll also be having an educational streaming event on Twitch. And you know today S3 holds more than 280 trillion objects and serves over hundred million requests per second. Hundreds of thousands of data lakes are built on S3. As you called out, Rahim, like a lot of household brands such as Netflix, Airbnb, PG 8, or Expedia and GE are using Amazon S3 to securely build and scale their data lakes and to discover business insights every minute.

Rahim Bhojani:

S3 Growth

Oh, that’s amazing to see. So given that we have all these big companies like establishing S3 as a standard for sort of the shared data assets as you may speak. What are some of the other things you may have noticed that have accelerated this sort of shift and this growth?

Rajesh Sampath:

Yeah, absolutely, Rahim. In my opinion, there’s a number of factors that have gone into this exponential growth of data lakes on the cloud driven by S3. But if I can think of the top two or three factors, I can boil them down to these. Number one, first and foremost, is in my mind, customers have come to rely on S3 because of our relentless focus on fundamentals. S3’s unique distributed architecture gives customers industry leading durability, availability, performance, and security at virtually unlimited skate. And this goes back to the conversation we had earlier about how Hadoop customers can scale storage independent of compute in the cloud. And all of this happens at very low cost to customers.

This means customers no longer have to worry about scaling, deploying storage, provisioning, or anything like that. We do all of that for you in the background. The second reason is with S3, customers can process and analyze data in place with minimum data movement. You have really big customers run multiple concurrent analytics workloads on top of the same data without having to copy the same data over and over again to a different compute node. That is really powerful for customers running large scale data lakes on top of S3. And the third thing is S3 continues to lead with innovations improving security, data protection and agility for data lake and analytics workloads. For example, S3 Object Lambda helps customers perform Data redaction operations at the source, very close to the data storage so that multiple applications can use the Data redaction capabilities without having to build point applications. We see customers use S3 Object Lambda for example, removing PIA information or for doing some special processing of the data along the GET path. Or customers using data lakes when they want to give access to their data, to multiple applications and with different permissions levels and whatnot, can do so using S3 access points. Access points is another unique capability that allows Data Lake customers to provide targeted permissions to their applications without giving permissions at the bucket level.

Rahim Bhojani:

Well, that’s awesome. You’re building on sort of the notion of the acceleration of these architectures. This is personally for me, a very exciting topic. So allow me to geek out for a minute. As we discussed the data volumes as they’ve been increasing and the complexity of data as it has been changing, it naturally created a bottleneck with data warehouses because they couldn’t scale fast enough, they couldn’t evolve schema fast enough. It was too much of an effort that way. In came Hadoop, and that was somewhat successful. But again, it still had the same scalability and silo effect because of the on-prem nature of it.

Now, as we discussed, as you mentioned, this evolved into the decoupling of storage and compute and the resilience and the performance at massive scale for a platform like S3 allows Dremio to do what we’ve been doing, right? We’ve been a read only really fast engine prior to last year, but now with the evolution, which is the most exciting part in the last year of table formats has made, has allowed us to bring DML time travel partition evolution, schema evolution capabilities to our customers. There’s two main formats today, Delta lake and Iceberg that we see from our customers. And speaking about Iceberg, it’s a very open community, and it was brought to the forefront, but few people from Netflix and they created the definitions in the open community, and it’s been evolving really quickly. What are your thoughts around some of these other trends in AWS that you’re seeing from your customers?

AWS Trends

Rajesh Sampath:

Oh, yeah. I mean your call is very valid, Rahim, like the iceberg evolution is really exciting to see. And we see a lot of customers adopt these table formats, iceberg, Delta Lake all of these are really groundbreaking for customers. And we see that customers continue to build these open formats on top of the strong foundation that S3 provides with the operational fundamentals and the innovation that I spoke about. In terms of trends, Rahim, it’s really interesting actually, as to how customers are demanding more from S3 and AWS in gen, right? First and foremost in this economic climate, our customers are looking for ways to optimize their costs. They’re looking for more transparency and more cost optimization opportunities.

Not so surprising, right? But what we’ve done so far in S3 is like, we continue to invest in cost optimization for our customers. And, you know, S3 Intelligent Tiering and S3 Storage Lens are really great examples for how S3 has continued to innovate and pass on those cost savings to our customers. For example, let’s talk about intelligent tiering for a moment. Intelligent Tiering is the only cloud storage class that delivers automatic storage cost savings when data access patterns change without performance impact or operational overhead to our customers. Now, imagine being able to select the storage class of your objects when you upload S3 as intelligent tiering and S3 in the background automatically moving them to the appropriate storage class to optimize for storage costs, and then passing on the cost benefits to our customers. This itself has saved over 715 million dollars for our customers since the launch of intelligent tiering compared to S3 Standard. And those savings continue to accumulate for our customers.

Rahim Bhojani:

Yeah, that’s amazing.

Rajesh Sampath:

And then storage lens is another innovation that we delivered which gives customers aggregate account level visibility into object storage usage, activity trends, and even actionable recommendations on cost optimization and data protection best practices. Storage Lens provides a free version, actually, I do look at storage lens on my developer account just to make sure I’m doing the right things there. And it is auto enabled for every account. That’s what I wanted to call out. And then the optional paid version that you can upgrade to get 35 additional metrics, including identification of cold or abandoned buckets or lifecycle rule counts or status codes for activities. On average, with the additional metrics on the paid storage lens product, customers get six times more cost savings than the cost they pay for storage lens, which means they recuperate the cost, and they also use these insights to make corresponding actions on cost organizations or on data protection best practices.

So overall this is the first trend that we are seeing in terms of cost. The second trend that we are seeing, which is interesting, I’ll be interested in your take on that as well. Rahim is this move towards a decentralized set of data lakes, interconnected data lakes, forming this data mesh architecture, right? We spoke about data silos created by on-prem data warehouses with the power of S3, back cloud, data lake, modern architectures. You see customers move to this data mesh where individual lines of businesses can work on their individual data lakes and innovate faster. This gives them the agility and the ability to move fast with domain specific data. And when they need to interact with each other, they can exchange data across these data lakes, forming this true data mesh that really unlocks the potential of data for their customers.

We see this in large enterprises where departments such as marketing or finance or even legal and sales can have in their own data lakes running domain specific data analytics jobs, and then interconnected with each other with data exchanges or what we call as data products so that they can share the value across these organization boundaries. Taking that concept further, Rahim, it’s really mind blowing to see how that concept has also evolved in terms of scale. Customers are seeing this value of this data mesh architecture, and they are using this to share data across company or business boundaries. For example, AWS data exchange. We announced a preview at reinvent last November which allows customers to share their data sets or data products across multiple organization boundaries so that everybody can benefit from data products coming from other organizations, truly enabling data collaboration as a strategic asset.

Now with data exchange, the preview launch we announced in reinvent last year, the advantage is data producers can use S3 as their data source and enable multiple subscribers to get the data to access the data without having to copy the data for every subscriber. This means this truly unlocks the potential of data products, data assets, monetizable assets for customers. That is really groundbreaking and AWS data exchange has seen a lot of customers use this for multiple data products. And we are really excited about the adoption we got from the early preview. And this is the second trend that I wanted to wrap up with and I think you’ll go back into the modern data lake architecture that you folks are focusing on.

Data Mesh

Rahim Bhojani:

Yeah. And I wanted give a plug to storage lens because I told my QA team, I got a surprise bill of $11,000 per day, and I need to go tell them about that technology because I don’t want surprises like that. But going back to the data mesh concept, really if I look at it from a business user end, most of the big enterprises now, they have some kind of central chief data officer, and their job is to set policy, and their policy is usually around governance and scale and what’s blessed and what’s not blessed without this notion of having data at the center, which S3 allows you to do, you can’t be successful that way. Because one of the downsides of self-services, proliferation of data assets, and then if you are the chief auditing officer or you are the CFO, you want trails there, you want to be able to say, okay, they’ve been making the decisions on the freshest most correct source.

And this is a constant push and pull between the business units and the infrastructure teams or sort of the chief data officer, the people that are responsible for the quality and auditability of these assets. So a hundred percent agree with what you were saying with the trend. Even from a business process perspective, this architecture is also coming alive now. Some of these big banks and big retailers, they also talk about meshing this concept. And with Dremio, we take this a step further because in our product we have the notion of a semantic layer built in. So not only do you have storage, but data is at the center. With Dremio, you can create views into that data, which give it a business meaning, and then you can connect all these things together, be it from an end user analyst perspective or from the chief data officer perspective.

Rajesh Sampath:

The way you connected those two concepts is really amazing. On one end you have the agility play, and on the other end you have the need for governance. And when those two things come together, it’s really powerful. It’s really something, the dream that you would’ve had probably 10 years ago in terms of these architectures.

Joint Customer Partnerships

Rahim Bhojani:

That’s exactly right. Let’s talk about some of our joint customers and what we’re doing to help them. As an example, this morning I was speaking to a European customer who’s a trading platform, and they’re in the middle of a cloud migration. And the current storage platform is a S3 compliant device, and that’s step one for them to move to S3 proper. And what struck me is your deep partnership with these type of vendors allow these migrations to happen. Otherwise, in the old days, it would be a really heavy lift and shift. Do you have any sort of insight into how important these types of partnerships are? Because in the end, it benefits a customer to move to the modern architecture that we’ve been speaking about.

Rajesh Sampath:

Absolutely, Rahim. I think in general when we look at these data lake workloads, right? Our goal is to help customers succeed. Like you and I in our previous conversations, we’ve spoken about time to value for the data and time to get insights of the data. Helping these customers get that time to reduce the time to value of the data is super important. And we work with partners both on the analytics side and on the on-prem storage side to help customers quickly move, migrate their on-prem data lake workloads to the cloud so that they can get the value of the data super quick. And then some of these innovations that we build along the way really help customers move that real fast.

If customers want, there are other flavors of AWS S3 for them to build for the cloud in the future. We have AWS for outposts and S3 for outposts, for customers to use in those kinds of environments as well. And then we have the data ingestion capabilities through the snow family of devices and customers want to ingest huge amounts of data into S3. But overall, our approach has been that we work with AWS services for helping customers with ingestion, with building the architecture and analytics and partners are a key aspect of that overall strategy as well. We work with multiple partners to help customers find value for their data.

Rahim Bhojani:

This is really true also for native AWS applications because I spoke to a startup that we are partnering with, and they do ML models for causality of data, meaning they try to reason out why something changed and how it’s interrelated, and they build their entire stack on the AWS platform with Arrow sort of in the middle of it to share and get the analytics portion of it. And their CTO, his comment was, because we are built on a cloud platform like AWS, we don’t actually have to worry about the scale because their whole thing starts with ingestion, right? Any model’s only going to be as good as the data that you feed it. And so they can take advantage of the cloud scale, but then also work in open source and to be transparent about the things that they’re doing and how they’re doing it. In closing, what other innovations are you guys looking at from a S3 perspective that would benefit our common customers?

Future Innovations

Rajesh Sampath:

Yeah, absolutely. I called out some of the unique innovations we had brought into the space starting with object Lambda and also talking about access points. These help customers push more and more functionality into the storage layer and pass on the benefit of those functionality to multiple applications that are accessing the data. Another thing that I wanted to mention in this case was SD Select, which helps customers push SQL predicates or filtering clauses into the SD storage layer. So then multiple customer applications can take advantage of that. So that’s another thing that we’ve seen multiple customers use for accelerating the performance of their queries in some cases as well. Like, we’ve seen customers get up to 400% throughput improvements on their query performance. Overall, I think in terms of innovations and overall in the whole setup itself, we are looking at this as storage being the gravity, the center of all this data lakes and pushing for governance capabilities, pushing for access points and block public access or default behaviors and what default behaviors where you disable and pushing for a lot of innovations on helping customers improve the performance of their data lake workloads.

These are the ways in which we are looking at innovating, and we will continue to raise the bar on these kind of things. We work with multiple partners such as Dremio and first party AWS services to pass on these benefits to our customers so that when they make their choice, they have all these capabilities available for them irrespective of the technology or the tools they use to get value of the data.

Rahim Bhojani:

I’m really excited to hear about these innovations, particularly for us. We have a saying, if queries are slow, it costs customers money, and things like S3 Select would be really a big benefit to our customers. So with that, let me close by thanking you Rajesh for your time today and the dialogue. Me personally and I’m sure our listeners are going to very much appreciate it. There’s on demand. There’s a session for best practices for building and operating a lake house on Amazon S3. Please be sure to attend and watch it. There’s also AWS virtual booth that you can go and learn about all these things that we discussed. And thank you very much for attending and listening to this session of Subsurface Live. Have a good day, everybody.