March 2, 2023

11:45 am - 12:15 pm PST

Best Practices for Building and Operating a Lakehouse on Amazon S3

Flexibility is key when building and scaling a data lake; and choosing the right storage architecture provides you with the agility to quickly experiment and scale. In this session, explore best practices for building and operating a data lake house on Amazon S3, allowing you to use AWS, open-source tools like Iceberg, analytics tools like Dremio, and even Machine Learning to gain insights from your data. Learn how to build, secure, govern, and query your data lake; while optimizing for high performance and low cost.

Topics Covered

Lakehouse Architecture

Sign up to watch all Subsurface 2023 sessions


Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.

Jorge A. Lopez:

First of all, thanks a lot for joining this session. We’re going to talk a little bit about some of the best practices for building and operating Lakehouse on Amazon S3. So I’m sure you’ve heard a ton about the value of data, but sometimes it’s very difficult to get tangible, make that information tangible. So did you know that by making data just 10% more accessible, a Fortune 1000 company will see a 65 million increase in net income? Well, the same goes for reducing costs. We can expect up to 48% reduction in total cost of operations. And we always hear, especially during these times about the need for organizations to lower costs. But I think it’s also equally and sometimes even more important to think about revenue, new areas to generate revenue. And that’s what data provides, right?

In the end, return on investment (ROI) has two sides to that question, to that equation, right? So let’s hear about how a data lake can provide those outcomes for the business. Now I’m talking about the data lake because what we have seen is that legacy architectures are not conducive to the desired business outcomes, right? One of the major challenges that consumers are facing is basically that they have the data very siloed. They have legacy data warehouses, databases, huddle environments, and all of these are spread out within silos. So one of the things that many of our customers are doing is moving away from these traditional data architectures and consolidating ordered data into a modern data architecture with a data lake on the cloud.

Now, when when we think about a data lake, I know it can have many different connotations but the way we see it at AWS is as an architectural approach not a silver bullet, not a single platform or tool, but an architectural approach that allows you to store massive amounts of data into a central repository. So it’s readily available to be categorized, processed, analyzed, and consumed by diverse groups within an organization. Now this type of architecture, because of the many benefits, like the separation of compute and storage, the scalability on demand, the ability to pay as you go quickly, procure infrastructure, and then shut it down when you are done, allows you to run a lot of experiments, experiments that drive innovation, and you can run experiments in very low cost, and that will drive the business outcomes that you require. So the rest of this session we are going to go through some of the fundamental aspects of building the data lake, fitting it with data, implementing the security and data governance, and finally optimizing for performance and cost. So, let’s get started. And Huey, back to you.

Transfer Data to Cloud

Huey Han:

Sounds good. Thanks Jorge. So the first section we’re going to talk about on a high level, the different components that go into your data lake. And as you build your daily strategy, what are some of the considerations you want to have in mind on a high level? So, the first thing you want to decide when you build a data lake is where to store your data. Sort of your storage layer. And there are multiple dimensions here that you want to consider. Namely scalabilities in terms of TPS and how much storage. You can put securities, access control data governance, very important for a lot of customers and ecosystem integrations. Cause you don’t want to just store your data. You want to put your data to work and just to name a few of the top dimensions you want to consider. And from that perspective, S3 is a pretty good solution on all those grounds and we’ll go into that in a bit.

Today, as we host more than 200 trillion objects and average over a hundred million requests per second, security, as we offer different encryption options and access control options. Ecosystem integrations, which we’ll talk about in a bit, integrate with the different AWS services and also third party software like Dremio. So all that said, the storage layer is the first decision you want to consider. The second decision you want to make is that once you decide your storage layer a lot of customers need to migrate their data from a on premises environment to the cloud and decide how to access that data. And for that there’s multiple options that you want to keep in mind as well. So for migrations, when you just start off migrations, you might want to move a large amount of data all at once.

And for that, we have the Snowfamily which is a collection of physical devices. Once a large amount of data has been migrated, using Snowfamily you might also want to use another service that continually syncs data, new data from on-premise to the cloud. And for that you might want to look into DataSync. And the third thing we also want to mention is that besides DataSync, we also have AWS Transfer Family, which is specifically tailored for transferring files in and out of S3 or EFS over different protocols like SFTP and FTP. Last thing we want to mention is that some customers prefer to have a hybrid access where they keep part of data on-prem and part of data on the cloud. And for that storage gateway it’s something you want to keep in mind. It gives you on-prem access to data in the cloud.

Purpose-Build Analytics & ML Services

So now that your data is on AWS you’ve decided your storage layer, you’ve migrated your data, you want to put it to work. And for that AWS has multiple services that help you do different things. And this is not a session on these services, so we’ll just name a couple on a high level. So EMR, you can use EMR to run highly scalable ETL pipeline jobs for data transformations and EMR integrates with a lot of open source frameworks. Redshift, you can use it for data warehouses for high performance SQL analytics. SageMaker, which is the AI/ML tool that covers the entire machine learning lifecycle, whether you’re doing data explorations, model training, or model deployment. And besides AWS services, we also integrate with different third party software like Dremio, which you can do high performing analytics on AWS and on S3. And Dremio. Actually we did a lot of work with Dremio to integrate Dremio with different AWS services. For example, lake formation which will get you in a bit to help you enforce data governance.

Continuously Feed Your Data Lake

Finally, the last part of your high level data lake strategy is that now that your data lake is stored in S3, you’ve migrated your data, you’ve decided what AWS services and third party software to use to put data to work. The last thing you want to think about is how you can continuously feed your data lake to ingest and have new data and fresh data. And for that, there’s different things, different services we want to mention for streaming which a lot of times for log use cases you can use databases, managed solutions like Kinesis or MSK. Or if you want to run your own, you can run Kafka or Flink itself on our computer services like EC2 or UPS. Besides streaming, you might also want to run ETL jobs to continuously enrich your data sets. Besides EMR or Elasmap review, we mentioned the last slide. We also offer AWS Glue which is serverless data integration services that make it easier for you to create, run, and monitor ETL workflow.

Implementing Security, Resiliency, and Data Governance

Okay, so we. Last section we talked about on a high level, the different components that you want to think about when you’re designing and building your data lake in AWS S3. Now we’re going to dive into the specifics. One particular topic that’s becoming very popular among customers is data governance. And for that, on this slide, we show S3. The different security features that S3 offers for you to help you enforce data governance that you want to keep in mind. So just to go in over them very quickly. For IM bucket policy and access points these offer IM based very expressive primitives for you to craft permissions and help you tailor access to your specific needs. And these are really good for infrastructure shaped access. Besides the access control and using IM policies there’s also a range of encryption options.

Whether you want to do client site encryption, server site encryption, whether you want to bring your own key or have AWS manage key for you. There’s a service specific feature, whether from S3 or from other AWS services like KMS tailored for that need. One thing here we want to particularly call out is that in 2020 we launched S3 Bucket Key which can reduce your KMS request cost by up to 99% and this is done by generating a bucket level key so that S3 doesn’t call KMS every time there’s a new objects. And by cashing this bucket level key and reusing that key we can reduce your KMS request cost by up to 99%.

Another thing we want to mention is Object Lambda, which is a compute layer in front of S3 that can modify and process the data on the fly. In the case of data governance and securities, we’ve seen customers use Object Lambda to redact PI information before returning the object to you on the server side. And this really simplifies the workload for you and makes it a little bit safer so that you don’t have to maintain and manage client side modifications. Last but not least, we also offer Object Lock, which allows you to block object version deletion. And that is very important for a lot of customers that care about compliance and regulatory requirements. And want to use Object Lock to block deletions, whether malicious or accidental.

Amazon S3 Access Points

We go to the next slide. And one access control feature here we want to particularly highlight is Amazon S3 Access Points. With data lake workload becoming increasingly more important, which is the topic of this conversation, one common pattern is that customers store a lot of data in one bucket and have that bucket and share that bucket with many applications, teams, and individuals, and managing accesses to this shared bucket. Using a single bucket policy in this case can become pretty complex and onerous. And S3 Access Points help you simplify this and help you scale this. What you can do is that you can create hundreds of access points representing either one data sets, one use case, one team, or even one individual in front of this shared bucket. And with each access point, you can create a unique policy tailored for that dataset, that team, that use case, and that individual.

And this allows you to scale as your permissions for a large number of shared dataset in a single bucket. And back in the summer last year, we have some really exciting news for access points. Now you can create up to 10,000 access point per region per account which really further enables you to scale your access control on S3. And in addition Access Point now supports SageMaker, Redshift and Cloud front with the tighter integrations which allow you to use Access Point directly within these applications. And that sort of, it’s come back to the ecosystem integration part where we discussed earlier. Now I’ll pass it back to Jorge to talk about lake formation.

AWS Lake Formation

Jorge A. Lopez:

Thanks Huey. So AWS Lake Formation is another key service that makes it a lot easier to build and manage data lakes on AWS. I won’t go into all the specifics of the service, but something to keep in mind, some of the key capabilities include basically quickly important data from all your data sources and then allows you to describe and manage them in a centralized data catalog. This is super important, of course, the ability for organizations to understand what data they have, where it resides in order to be able to take advantage of it. It also allows you to scale permissions more easily with fine grain security. So it provides capability such as row column level, and even cell level permissions. And also centrally manage access to available data sets from the different engines.

So one of the things that also allows is to share data even across different accounts. And that enables more sophisticated architectural patterns like the data mesh, where you can think of data as consumers and producers. And you basically categorize the data by domain, or you divide those by domain. And I should say that Dremio also supports lake formation, is integrated and works with lake formation.

Leverage Open Table Formats

Another key thing that we are seeing moreover is customers adding or the desire for customers to have database-like capabilities like asset transactions, time travel performance optimizations with data compaction in their data lakes. So one of the advantages of building your data lake on industry is that it supports frameworks that allow you to do just that. One example will be Apache Iceberg. I know this conference is very bullish on Iceberg and a lot of the customers are using and adopting these frameworks. So these kinds of table formats are well supported within the Amazon industry. And the nice thing about these frameworks is that they also provide broad engine compatibility, right? Like Spark, Trino, Presto and so on.

Optimizing Performance and Cost

Okay, so now let’s talk a little bit about how you optimize performance and cost. First of all, when building your data lake, it’s important to keep a few best practices in mind. Performance tuning is an intuitive process, so it’s a good idea to continuously check and monitor, adjust, right? In general, you should think of S3 as a highly distributed system. And in order to get the best throughput of this system, you need to horizontally scale by partitioning the data. One way to do this is creating prefixes. Amazon industry performance is defined by prefixes, not by bucket. And your applications can achieve around 3,500, plus copy or delete operations. And 5,500 git requests per second. And this is per prefix. So say if you have 10 prefixes and you partition your data along those prefixes, you could potentially achieve up to 55,000 rates per second in this example, right?

So that’s about throughput. Now, you also need to optimize the file size. If your files are too small, typically less than 128 megabytes, execution engines might be spending additional overhead opening lots of small S3 files. On the other hand, if the files are too large, the query waits until a single reader has completed reading an entire file. So one way to fix this is by using the compaction capabilities that come with, for instance AWS Glue, which is another key service or like formation but also using these processes from frameworks like Iceberg, for example. Finally compression and columnar data formats are very important. As we mentioned before, you can store the data on any format you want, right? But storing it in the right format will help make query significantly faster. We talk about horizontal scaling, that helps a lot with the throughput.

Now, if you compress and use the right formats you can increase the query latency and also lower your storage costs. For example, Parquet is a very popular option here. And by using Parque GC a given query can run up to 30 times faster and result in up to 99% lower cost than a similar query with data in a text file. And of course always check for more advanced capabilities which you can know and the latest as the case in order to obtain the fastest performance.

S3 Intelligent-Tiering

Now, another very simple trick I would like you to remember is to use Amazon S3 Intelligent-Tiering. S3 Intelligent-Tiering automatically stores subjects in three access tiers. A frequent Access tier and in Frequent Access tier with 40% lower cost and an archive instant access with 68% lower cost than the Infrequent Access tier. Now, the way it helps you save cost is by automatically moving the data to the most cost effective access tier based on access frequency. And this doesn’t have any performance impact or operational overhead. You always get millisecond access.

So for instance, objects not accessed for 30 consecutive days are moved from the Frequent to the Infrequent Access tier. If the data is accessed, then it gets put back into the Frequent Access tier. So to be honest I don’t, I cannot think of one reason why customers wouldn’t use this as a default storage class for almost any workload and especially data lakes and analytics.

Storage Lens

Next Storage Lens. This capability we introduced back in 2020. Overall Storage Lens helps customers gain visibility into S3 usage and activity trends. At the pocket and all the way down to the prefix level. And then it makes actionable recommendations to implement data protection best practices, as well as improved cost efficiencies. So once you turn it on from the console you just need to wait two weeks and you will get a nice dashboard with over 60 metrics that you can use to identify cost savings opportunities, as well as best practices for security and data protection.

Reference Architecture

Okay, so far we have seen a lot of different services which I’d like to think about like Lego blocks or building blocks. And just as with your favorite building toys, you can really assemble them in very different ways depending on your specific use case. So I just wanted to give you an example. For every data lake that you will see, you will have an ingestion layer with services like Kinesis or Kafka and then you have a transformation and storage layer where you will find S3, the core of the data lake, and then your data catalog your data governance and policies, as well as the transformations. And then you will have a consumption layer with all your analytic engines like EMR, like Dremio. Your data warehouses like Redshift, as well as business intelligence and dashboards in machine learning.

And that leads us to the end of the session. Some of the key takeaways. First I will encourage you to use Amazon S3, especially intelligent tearing as the data fabric of your data lake. Then obviously follow security and performance best practices. Always try to block public access and use encryption by default. By the way, all new objects now on S3 have these characteristics. Now always think about optimizing size and using compression columnar data formats and think about formation. That will save you a ton of time when building and managing and operating that data lake.