Subsurface Summer 2020
AWS Data Lake Architectures
Customers today analyze more and more data and are looking to modernize their analytics stack. Data lakes on AWS have become a popular architecture for massive scale analytics and also machine learning. In this session, we will take a look at the general data lake architecture on AWS and dive deep into our newly released analytics service, AWS Lake Formation, which can be used to secure your data lake. As part of this session, we will review some of our customers’ data lake architectures and use cases that are being solved.
Raghu Prabhu, Global Manager, Data Lakes, AWS
Raghu Prabhu is a global business development manager for data lakes on AWS, where he’s been for over four years. He helps organizations across all industry verticals around the world build data lakes on AWS that are secure, massively scalable and cost-effective. Before working at AWS, Raghu was the co-founder and CTO of a marketing tech firm where he developed marketing ETL applications and APIs into marketing and CRM products.
Hi everybody. Thank you for joining us for our session. Just a quick reminder that after the presentation is finished, we will be doing live Q&A. You'll need to enable your, ask for permission to share your camera, and your video, to ask a question live in the session. And, with that, I want to thank Raghu, for taking the time out of his schedule to speak with us today. And with that, Raghu, go ahead and take it over.
Thank you, Melissa. Hi everyone. Thank you so much for joining me today, and my name is Raghu. I'm a global data lake go-to market leader. I take new products, our analytics products that help you build data lakes on AWS to the market. So, let's get started. We got a lot to cover. Here we go. So, this is our agenda today. Why do you need a data Lake, and why should you build on AWS?
Then we have some architectures that we are going to go through Vanguard, Epic Games, Asurion and Salesforce. And then finally, depending on how we do on time, we will spend few minutes or seconds on AWS Lake Formation. That's one of our latest services, that will help you build data lakes. So, I would like to kind of introduce that service to you. So, that's the agenda. What I would love to do is to go through item one, and item two rather quickly, so we can spend bulk of the time looking at these architectures. Okay. So, let's get going. So, why do you need a data Lake?
Well, I think this is something that we all understand really well. All the attendees here have probably experienced this, one way or another, which is data is growing exponentially. You just have more data. And, you are ingesting data from more, and more sources. There are just new sources all the time. And, you're onboarding new users, and new applications all the time. And, we have more data scientists in the room, at least in my experience in equal number, to the data engineers as well, so we have more people making use of this data. So, all of these things are growing and changing, and therefore, data lakes may become a little bit more relevant, solution for you.
Okay. So, this is just something as a baseline, we understand this technology quite well, which is a data warehouse technology, there is nothing wrong with it. I personally love it too, but there are certain limitations in a data warehouse technology. So, we take those limitations, and we remove those limitations in a data lake. So, use this as a baseline. Databases and data warehouses are of box. They have a limit with respect to either storage or throughput. But otherwise, these are great technologies, and what you will find is once you cross about five petabytes of data in a data warehouse technology, you're going to feel the pain of either maintaining it, or at least cost of having to keep the data in the data warehouse technology.
So, this is a good solution, if you don't have a big data problem. But, if you do have a big data problem, I think you're going to find this solution to be quite limited. And therefore, you could make a case for data lakes on AWS. All right. So, cloud data lakes are the future. What do customers want? Customers want to eliminate data silos. Customers want a single data repo, a single storage for your data lakes in the cloud. Our customers want to keep their data securely, and use open file formats, not proprietary file formats, they want to use open file formats. They want to keep their costs low. All right. And, they want to analyze their data in multiple ways. They don't want to be constrained, or have any kind of rules in how you are going to be consuming the data. You choose how you're going to consume the data.
So, you need the flexibility of the analytics that you're going to do on top of this data. You need the flexibility to do real-time analytics. You need the flexibility to do batch type analytics. The best way to think about this is following air around you, is ambient, right. You need to surround yourself with good, clean data. All right, just like air, all right. Air should be cheap, or as close to free as possible. You need the same exact thing with data. Data needs to be ambient in your organization. That's very, very important. And, data Lake technologies help you do that.
Okay. So, we are getting into a specific's here. We are at the 50,000 feet. Okay. So, pay attention to my mouse pointer. Okay. As you can see, this is a slide that I put together. So, it's not the prettiest of things, but that's okay, I'm going to walk you guys through it. So, when we build data lakes, we are going to start with the storage, because you start your data Lake journey with the storage system; because you are looking to bring all your data into a single data store. And, that data store on AWS is S3. So, S3 is your data Lake, data store. You're going to bring this data two different ways, batch load. And, this probably just completely makes sense. You're going to either batch load your data into S3, or you're going to stream your data into S3.
And at the same time, what type of data are you bringing into S3, structured data, semi-structured data and unstructured data. What's structured data? Structured data is a schema first data. It's your data, that's in databases, in data warehouses. You define the schema, and then you pump data into them. All right. That's your structured data. Semi-structured data are the troublemakers, because it's application logs and any kind of log analytics, it's social media data, it's CSVs, Jason, XML, Avro, Parquet. The schema is not explicitly defined, in Parquet and ORC, they have schema built into them. But in a CSV, which is an extremely popular data format, the schema is implicit to the data. Meaning, you can look at the columns and you can say, "Hey, this is string, this is integer. So, that's semi-structured data.
And majority of the times, that's what you're dealing with. And then, the third type of data is unstructured data. And that can be PDFs, word documents, voice memos, videos, that kind of data. So, three types of data structured, semi-structured, unstructured brought in two different ways into a single data store, such as S3. Once you land your data in S3, that's when you get to monetize your investment on your data, you get to do analytics, which is mostly SQL, dashboarding, reporting, and, or machine learning, right. So, this is an end-to-end structure of data lakes on AWS. So, this is at the 50,000 feet. I know you guys are technical. This is a more of a technical slide, has the same story, but this is more at the 10,000 feet. It's the same story from left to right.
You're bringing data from various different data sources. Okay. That can be streaming, like I said, right, from databases, data warehouses, any kind of log analytics, that type of data being brought into a single data store, such as S3. When I say S3, and I've been doing this for quite some time, it's data that's being put in S3 across various different AWS accounts, various different buckets; and of course, within those buckets, various different prefixes or sub folders. Okay. So, you bring your data into S3 across all those different tiers. Once you land your data, you want to identify the schema of your data. That's when things become a little bit more real. Schema of the data can be deciphered easily by Glue crawlers. Okay. Glue crawlers can go, and read the schema of this data, and create the metadata tables for you inside Glue Data Catalog.
All right. What's Glue Data Catalog? Glue Data Catalog is a hive compatible metadata store. All right. It's a hive compatible, highly scalable, highly reliable, metadata store. So, the metadata tables get created inside the Glue Data Catalog. And, because of the hive compatibility, I'll start with the EMR. EMR, our big data analytics solution. Apache Spark, Apache Hive, or even Apache Presto, can use this hive compatible data catalog for the schema. Amazon Athena, our interactive query service, which basically means that if you can write SQL, you can use Athena. That again, uses the hive compatible Glue data catalog for schema. Redshift is our data warehouse. The nice thing about Redshift is we have built a lake house architecture, that means you can extend your Redshift data warehouse into a data lake solution. So, you can have these Glue Data Catalog, metadata tables show up as external tables.
And, you can write a query joining tables that are native to Redshift, and you could do, maybe an inner join to an external table that is sitting outside. And, we have this engine in between called Spectrum that will do the heavy lifting in going and reading data from S3, applying the schema and giving this data to you. We also have an ETL solution, it's Spark and Python based. It's completely serverless. So, from ETL to interactive query system, EMR for big data solution, and also a data warehouse solution.
We have partners too, such as Dremio, and many others that build on top of this ecosystem. So, it's not just Amazon Services, or AWS services. There are a whole bunch of partners. And of course, Dremio included, that fit into this type of an architecture. Okay. So, a quick time check. We are coming up a year. We released Lake Formation about a year ago. So, we're going to just quickly talk about the service. If you look at before and after, right, this is an architecture we have had for about, I think, three plus years. And, we needed a security and governance solution on top of it. So, we added one. So, we released Lake Formation about a year ago, and it has three goals. Okay. Goal Number one, to be the security and governance layer on top of your data lake. Goal number two, management of S3 policies. If you don't want to manage S3 policies for analytics, you can offer the job to Lake Formation, and Lake formation does a fantastic job managing access to S3.
Third, it also helps you with ingestion. It's easy to say that, "Hey, let's ingest your data from various sources into S3," but we do make that process easy for you. We have built some ingestion services called blueprints inside Lake Formation. Why should you build on AWS? Well, there are four pillars to building data lakes on AWS. Data movement, you got to bring data from wherever into S3. Data lake infrastructure, so you got to store, and secure this data. Analytics, you're going to run SQL on top of this data, and you would want to be doing machine learning, and data visualization on top of this data. So, there are four pillars to this. And, there are many different actions and activities that you're going to do as you build this massive, scalable data Lake. We realized that, and therefore, if these are the actions that you're going to engage yourself in, these are the services that you could use in building data lakes on AWS.
Compliance. So, we are probably the most decorated, most compliant, most certified cloud provider than anyone else. We have some of the generic, compliance certifications. We have U.S., and also other global certifications. So definitely, if there is something that you would want to know, we probably are certified. It may not be here, but definitely check with your account team. And again, We got tens of thousands of data lakes on AWS across all industries. I'm very happy to tell you that I work with most here. So, lots of logos, and these are some of my favorite ones, because I work with them. All right. So, one of the goals of today's session is to walk you through some of the sample reference architecture. We can spend hours talking about this. So, what I'm going to do is I'll have you guys see a pattern across these architectures. It's really not that hard.
There are patterns to everything, and have handpicked some of the ones that we can talk about, and there is an underlying pattern to them. So, that's the goal. Let's go Vanguard. Well, who doesn't know Vanguard, big finance company. And, what they wanted to do was something really, really interesting. They wanted to create a ready for analytics data platform. I am working with so many customers who are trying to do this exact same thing, ready to query data platform. Which means data is ready for you, and if you're an analyst, you a data scientist, you're a data engineer. You would want to come, find your data, and start querying your data, because it's ready for you, for analytics. Okay. So, they wanted to ingest about a petabyte of data, and of course ever-growing, and they had more than a hundred different data sources. And, the way they wanted to do, they liked Presto on EMR. That was the choice of their analytics solution.
So, Presto on EMR, data on S3, about a petabyte to get started. And, what would be the best way to do this? They wanted it cheap, and they wanted it scalable, and they wanted it ready. All right. Okay. So, there are a lot of... Just pay attention to... I'll walk you guys through this. There is something that I want you guys to notice [inaudible 00:16:58] in Vanguard. From left to right, again. They're bringing data from hundreds of sources, databases, many files, many different third party vendors, API sources being brought into S3. Okay. But, the part that I want you guys to pay attention to is the shaded area. Okay. There are three shaded areas. And, this is a raw layer in S3. Okay. So, there are three layers to their data Lake, a raw layer, a cleansed layer, and a ready for analytics layer. All right. When we help our customers build data lakes, we see this pattern all the time.
The raw layer. You want to land data as quickly as possible with as few transformations as possible. So, you want to land the raw data into S3, get the data, okay. That's your raw layer. You may be landing data in Jason, Parquet, Avro, just anything the ingestion service is going to deliver, you're going to just land it in your alley. Then you want to take that raw data, and you want to make something out of it, ready for analytics. You want to take this to the next step. And, most people use a columnar data format such as parquet, or ORC. And, in that process from raw to cleansed, you're going to drop some columns because it's either just completely irrelevant, or too much sensitive data, and you would want to drop them. And, you will clean things up in a normalized, maybe the state names, capitalization, you'll normalize the data.
But otherwise, you're going to carry the data almost pretty much as is, from raw to cleansed, with some modifications, okay. So, this isn't mostly Parquet or columnar to data format or ORC. You can query this data, but this is again, too much detailed data. Most people create another layer, more aggregated data. Aggregated per day, aggregated by the account, aggregated for the quarter, aggregated for a particular customer. So, this is the more aggregated data. The detail is also moved sometimes with the aggregated data, or the details are left in the cleansed layer. So, what you're going to offer to your customers is the aggregated layer, and the cleansed layer, if they need the details for it. That's just kind of how it goes.
And my point for the Vanguard architecture is to show you the patterns of data ingestion, and data transformation, and the layers of data lake that you're going to possibly build when you build a data lakes on AWS. Okay. And in this use case, they like using EMR, and they're using a ton of EMR in processing this data from one layer, to another layer.
All right, Salesforce, Data Management Platform, DMP. This is popular with... This is an ad tech platform, actually. So, there is a real-time component to this. They ingested 40 petabytes of data, a lot of data growing about 4 percent, week-over-week, which still is a ton of data. Okay. And, to make all these things happen, they spin up. And, this is the pattern that... This is the point that I'm trying to make is this spin up, 3000 EMR clusters, transiently; meaning there is very few long running clusters, they will spin up a cluster, do a job, and shut it down. Most it is done on spot. Transient EMR clusters is their popular choice for data transformation. And, like I said, there's a real-time component to this, because this is an ad tech platform. Okay. Again, left to right. Just follow me. I'm going to walk you guys through this.
On the left, there is data collection, many different sources of data. And, you will see the sources that they have. They will ingest data into S3. But for real-time, they are going to put data into Kafka. All right. They do three types of operations on top of this data. So, they have three pipelines. One is an event-based pipeline. Okay. It's more ad-hoc, human-based. Next is a batch mode, and the third is real-time streaming. Okay. So, they got three pipelines. For most of this, they use EMR, and like I said, about 3000 clusters coming up, and going down on spot every day. Then, there is something else I want you guys to pay attention to on top of segmentation of pipelines, and the transient nature of EMR. Although all data that gets processed goes to S3, but for certain types of data, for all OLAP type queries, they put that in Redshift, okay, immediately.
And this is another pattern, for data that they need almost instantaneous because they are running an ad platform, they need a data store that is milliseconds, even nanoseconds; and for that, they put this data in DynamoDB. Three different types of data transformation pipelines, three different types of data stores, but everything goes to S3, because that's their single source of truth. Next, we got to go quick. Fortnite, Epic Games, I'm sure we have people who love Epic Games. I'm sorry, Fortnite here. Fortnite is free to play, but they offer micro-transactions, okay. So, they going to have to present opportunities to the user to kind of use those transactions, and make a transaction. So, this is a real-time use case, which means they're going to analyze user behavior, and offer them with opportunities to make those transactions.
So, how does it work? Again, look at the pattern here. A whole bunch of data ingestion sources move through Kinesis. Okay. They have two pipelines, one is a near real-time pipeline, and one is a batch pipeline. Near real-time pipeline uses Spark Streaming directly with Kinesis. There is no time to put data on S3 with real-time. It's Kinesis, Spark Streaming in DynamoDB, immediately. So, there are APIs waiting to use this data. So, how this went, and the batch mode is you can put this in S3, you can put this in databases, you can write a batch mode, ETL process, it back to S3, and then you would do BI and dashboarding. See two different pipelines. I got one more. Asurion, they are a leader in device protection, and customer support. What they want to do is customer sentiment analysis.
So analyze, calls from call centers, social data to understand customer sentiment. And, they got about 300 million customers. So, a lot of data, and a lot of customers. How do they do that? Again, look at the patterns here, a whole bunch of data sources, they got a real-time component that goes through Kinesis, straight into DynamoDB. Okay. So, very little data transformation aimed straight into DynamoDB. And, there are APIs that are feeding off DynamoDB. And for the rest of it, it goes into S3. They have EMR do the transformation, and they put some data into Redshift. So, my point here is to show you patterns, near batch mode, real-time, how some of these services are used with our customers, that give you scale. Okay. A quick 30 seconds on Lake Formation, definitely check it out.
It's a security and governance layer on top of your data lake. We have an ingestion service. We have a security layer. So, you get to pick your user. You get to pick which database, which table, and what type of access you're giving. And, we also have data stewardship permission as well in Lake Formation. Next, as you build data lakes, you want to share a table or two, or a whole database with another account, that may be a customer or a partner, go for it. We have external accounts support. You give the account number, you pick what you want to share, and send, it goes. Definitely check it out. And of course, BI and dashboarding. So, that's it. I think I'm on time. Thank you so much.