Subsurface LIVE Winter 2021
Plan, Design and Build a Successful Data Lake on AWS
Data lakes are a popular approach for organizations to break down data silos and make data accessible to their users in a secure manner. There are also numerous ways a company can build a data lake to fit their business needs. However, there are just as many ways to fail. In this session, you will learn from the journey of others how to best plan, design and build a successful data lake that will scale and evolve as your business needs and demands increase.
Roy Hasson, WW Analytics Specialist Leader, AWS
Roy Hasson is a Worldwide Analytics Specialist leader at Amazon Web Services, where he helps transform organizations using data, analytics and machine learning. Roy serves as an expert advisor to customers across all industries to transform their business and become a data-driven organization by building a cloud-native modern data architecture on AWS. He is also a product leader driven by the voice of the customer to guide the development of new innovative services and user experiences for AWS. Prior to AWS, Roy spent 15 years working with tier-one service providers to design and deploy large-scale data systems used to serve today’s cable modem, voice over IP and wireless data services.
Well welcome everyone. Thank you for joining us for Plan, design and build a successful Data Lake on AWS presented by Roy Hasson, a worldwide analytics specialist leader at AWS. Before we get started with the presentation, just a couple of housekeeping notes. As everybody knows, as you’ve gotten used to the Subsurface Conference, we use Hopin, which enables you to go ahead and ask your questions [00:00:30] live at the end of the session. So, at the end of the session, you’ll see a button up in the top right-hand corner of your screen. You can use the button to ask to share your audio and video. We’ll then go ahead and let you into the session, and moderate a live Q&A. You can also ask your questions in the chat, we’ll read them back, and I’ll moderate those questions with Roy at the end.
The other thing that we want to remind you is that over on the right-hand side of the platform, you’ll see a tab that [00:01:00] says slide-out. That’s our quick session survey. We really appreciate it, if you could go ahead and click on that slide-out, when the session is over and fill in your feedback for the session, that’s how we can continue to improve sub-surface. As we iterate this in the months to come. And with that, I am going to turn it over to Roy. Roy, thank you so much for being here today and have a great session.
Thank you, Melissa. I appreciate it. So hi again, everyone. [00:01:30] My name is Roy Hasson. I’m actually, a principal product manager at AWS, focusing on the AWS Glue and AWS Lake Formation services. My talk today will cover the reasons why you’d want to build a Data Lake and the benefits of expanding it into a Lake House architecture. We’ll dive into three scenarios of building a Data Lake and shed some light on, to what considerations you should make along the way. [00:02:00] So as most of you may already be aware, data volumes are increasing at an unprecedented rate, exploding from terabytes to petabytes of data, and sometimes even exabytes of data. Traditional on premise data analytics approaches can’t handle these data volumes because they don’t scale well enough and are simply too expensive. We hear from companies all the time that they’re looking to extract more value from their data, but struggle to capture, store and analyze all the data generated [00:02:30] by today’s modern digital businesses. Data is coming from new sources, is increasingly diverse and needs to be securely accessed and analyzed by any number of applications and users.
Against this backdrop of increasing data volumes and the desire to get the right data to the right people, 2020 has shown us that we need to be prepared to deal with change. Many companies are taking all their data from [00:03:00] various silos and aggregating it in one location, what we call a Data Lake, to do analytics and ML directly on top of it. Over time, these same companies are storing other data in purpose-built data stores, like a data warehouse to get quick results for complex queries on structured data or a search service to quickly search and analyze log data, to monitor the health of production systems. For example, we see that more and more companies are using both [00:03:30] Data Lakes and these purpose-built stores, they often time need to also move data between these systems. For example, moving data from the Lake to purpose-built stores, from those stores to the Lake and then between purpose-built stores. It’s kind of how a successful basketball team needs to be able to play inside out, outside in and around the perimeter. So let’s dig into each of these concepts, starting with the inside out.
[00:04:00] Think of storing your data in your Data Lake, and then moving a portion of that data to a purpose-built data store to do additional machine learning or analytics as the inside out approach. Your data lives in a central Data Lake repository and for specific use cases, you move a portion of that data out to the analytics services for additional processing. For example, you may want to collect web click-stream data from your applications directly in the Data Lake, and then move a portion of the data out [00:04:30] to the data warehouse or into Dremio, for your daily reporting and dashboards. So let’s look at outside in. As mentioned previously, these purpose built stores will continue to hold a portion of a company’s data and while you have data in purpose-built stores, you may need to move some of it into your Data Lake to do additional analysis and machine learning that may be difficult or sub optimal to do on top of the purpose- [00:05:00] built stores. For example, performing complex aggregation on a no SQL database is less than ideal.
Think of this as an outside in approach, your data lives in a purpose-built data stores that are optimized to give you the performance scale and function you need, and you can move a portion of it into your Data Lake for additional processing when you need to. For example, you may want to move the query results of the sales of products in a given region, [00:05:30] from your data warehouse into your Data Lake to train product recommendation algorithms using machine learning. Now let’s look at movement around the perimeter between the purpose-built stores. In many situations, you may want to move data from one purpose-built store, like your relational database to another like your data warehouse or between multiple relational databases. Think of this as moving the ball around the perimeter. This means being able to easily [00:06:00] replicate and combine data across multiple purpose built data stores. For example, your product catalog may be in a relational database and you want to move it to a search service to make it easy to search for products.
So let’s put it all together. It’s important and correct to use the right tool for the right job. And it’s equally important to ensure the data can easily get to wherever it’s needed with the right [00:06:30] controls, to enable analysis and insights. As data in these Data Lakes and purpose-built stores continue to grow, it becomes harder to move all of it around because data has gravity. To enable themselves to get the most out of their data at any scale, customers are rapidly modernizing their data architectures. To achieve this, customers are building modern data architecture that allows them to rapidly build and scale [00:07:00] a Data Lake, using a broad and deep collection of purpose-built data services that provide the performance required for use cases like interactive dashboards and log analytics. Also, enabling them to easily move data between the Data Lake and purpose-built data services, setting up governance and compliance in a unified way to secure, monitor and manage access to their data.
Choosing a cloud service that gives them lower costs without [00:07:30] compromising performance or scale, we call this a modern cloud-based analytics approach to Lake House architecture. It’s not just about integrating your Data Lake with your data warehouse or enabling asset transactions on top of an object store, but it’s about connecting your Lake, your warehouse, Dremio and all of the other purpose built services into a coherent system. Let’s double click on this and see how to design a Lake House architecture. [00:08:00] So here you can see three common data architecture design patterns. First is a single account design that is simple to get started because data ingestion, storage and consumption all managed within a single AWS account. Second is a hub and spoke design that designates an AWS account as the producer of data, also the data owner and domain expert.
There are multiple data consumer accounts that either [00:08:30] subscribe to data feeds from the producer or simply query the producers accounts directly. This is useful when you have a central data team that is responsible for all of the data assets. Third is a data manage design popularized by the consulting firm, ThoughtWorks that brings together domain-driven architecture and self-service platform design to enable independence with global governance and oversight. Let’s start with a single account design. [00:09:00] So first we need to ingest some data. Typically, it would be from operational database on premise or in the cloud. Most commonly you’d want to either take periodic snapshots of data or capture and stream changes in real time. You may also have streaming events from your microservices that need to be ingested in real time, such as ad impressions and ecommerce tracking. Having lots of these [00:09:30] connectors is a bonus, making it really easy to bring in data from SAS providers and many other data sources using an engine like Apache Spark is a popular option because it gives you speed, scalability and flexibility. Spark is ran in Sequel, Scholar and Python, making it testable and also portable.
Putting it all together, having a workflow orchestration systems such as AWS Glue workflows, or even Apache Airflow, make it easy to automate, monitor [00:10:00] and alert on your pipelines that you built. The data you just ingested need to be stored somewhere, Amazon has three object stores, and is ideal for this purpose because it’s ability to scale almost infinitely, support high rate of read and write operations and be extremely durable. You will typically see a three tier design starting with raw data being ingested and stored in a standard file formats [00:10:30] such as JSON and CSV. This gives you a common baseline if you ever needed to reprocess the data, it’s common to define an S3 life cycle policy to archive the raw data into S3 glacier after some period of time. Next, you want to verify the quality and completeness of the data, comparing source record counts to what was ingested and validating the data types match the schema. Then you can start to process the data [00:11:00] cleaning up common glitches, such as data and timestamp formatting, filling in empty fields with NOL’s or even default values and flooding nested records.
The output of this process is typically stored in a refined zone, which can be a separate S3 bucket, or even just a folder within an existing bucket. Data in the refined zone tends to be stored in an optimized file format, such as Apache Parquet [inaudible 00:11:25]. At this point, the data is ready to be consumed. [00:11:30] So this is where kind of the data scientist will access data for exploration and model training. Analysts will also come here to experiment and do their analytics. However, many times there is a need to curate some of this data and produce a tailor made datasets to serve a specific needs, such as pre-computing popular dashboards, or even loading product data into elastic search to update your sites search [00:12:00] index. Now that your data is clean, stored and optimized, you need to make it discoverable. There are many open source options today, such as the ones released by Netflix, LinkedIn, and Uber, as well as a few commercial alternatives.
AWS Glue data catalog is a serverless and cost effective option. It allows you to automatically crawl your data sources, extract the schema or learn dynamically from your real time [00:12:30] stream and update it into the catalog. Because it’s compatible with Apache Hive meta store APIs, you can easily connect many different query engines such as Dremio. So we ingested and processed the data, we landed it in industry based Data Lake for storage in a format that’s easy to consume. We catalog the schema to make it easy for users to find what they’re looking for and to plug in their tool of choice [00:13:00] to query the data. But before we can open it up for users, we needed to define owners and permissions. Being in a single AWS account, makes it just a little bit easier. We can use the AWS Lake Formation to define Data Lake admins that will delegate permissions to owners or stewards responsible for managing and auditing access to their respective data sets.
These dataset owners can configure fine-grain access permissions on databases, tables and [00:13:30] columns for user identities defined in either AWS identity and access management, IAM users and roles or SAML based identity providers such as Microsoft active directory, Okta and Auth0. So congratulations, you’ve just created your first Data Lake, hopefully it wasn’t that complicated. What will make it into a Lake House architecture is the ease in which you can offer to your consumers [00:14:00] in a self-service manner. Query engine, such as Amazon Athena, Dremio, data warehouse such as Amazon Redshift, machine learning notebooks and automated training with Amazon SageMaker and any other tool of their choice, all fed and governed by the Data Lake. Up until now, we covered the basics you really need to build a Data Lake within a single AWS account.
Authentication is fairly straightforward, [00:14:30] either using IAM or an existing, sorry, an external identity provider, you ingest and clean and verify the quality of your data before storing it into folders on [inaudible 00:14:41] for analytics and ML. You can expand your data pipelines to further curate the data if needed by the business, but that’s not always necessary. The data is automatically cataloged, so it’s easy for your users to find what they need and in most cases, you will need to define granular permissions [00:15:00] to control access to this data, which you can easily do in AWS Lake Formation. A typical evolution from a single account Data Lake design is to expand by offering to share data sets with a consumer account. So this can be your dev test account in other line of business, or even a business partner that needs access to some of your data. Lake Formation makes it super [00:15:30] simple to share data sets across AWS account in a tool agnostic way. Once a data set is shared with a consumer account, it can be queried like any other table.
The consumer account can use their tool of choice, such as Dremio to query the data directly or join it with other data sets. Lake Formation allows you to share entire database, a set of tables and even specific columns within [00:16:00] a table. In Reinvent 2020, we announced previous support for role-based permissions as well. Because we share the data at the catalog level, while automatically managing S3 permissions for you, the shared resources can be easily accessible from your choice of tools. You can eliminate the need for a complex and slow views and simply define what the users can access based on a set of permission [00:16:30] rules enforced in a consistent manner by Lake Formation. Right. So you no longer need external systems to define policies and enforce, it’s now managed for you in one place.
Taking the producer consumer model a step further. There are situations where you may want to separate the governance from the producer. This pattern is common in highly regulated environments, such as financial services and healthcare. In this case, your [00:17:00] Data Lake processing and storage is managed in a single AWS account. Data catalog, permissions and governance is in a second account, which then shares approved datasets with multiple consumer accounts. Again, the consumers may have their own data processing pipelines. They may have local sandbox storage for their own data, but are not really allowed to publish any data sets back into the main corporate Data Lake. If any data [00:17:30] set should be published, they must be moved to the Data Lake account and followed the agreed upon process. This design maintain separation of concerns and enforces a process with owners that can be easily audited. When considering the producer consumer model, you’re looking to enforce clear separation of concerns with a well-defined process to maintain a single source of truth for your data.
Data processing will be done centrally [00:18:00] typically with a dedicated data engineering team, in a consistent matter that’s simpler to audit and maintain a high bar for the quality of the data. So data ownership, access permissions and auditing will be handled centrally typically by a dedicated governance and security team. A natural extension of the producer consumer model is a Data Mesh. [00:18:30] So Data Mesh combines the producer and consumer roles into a data domain. For example, the e-commerce store team will be responsible for their own domain. They understand the meaning of their data. They know what good quality means for their data and how to best present it, so other teams such as the recommendation team or search team can gain the most benefit from that data. Therefore, it makes sense for the e-commerce team in this example, to create [00:19:00] their own Data Lake with their own technology stack and business logic.
Each data domain you create is like a microservice within a large data fabric or platform. As with a typical architecture, microservices communicate over well-defined APIs. For analytics NML, it’s simple to use Lake Formation, data sharing capabilities as we’ve seen in earlier designs. So each data domain will decide what data to expose and how to [00:19:30] secure it. They will also be responsible for fixing any quality or mismatch issues with the data. That’s how each data domain is fully owned by a team, that is responsible for the input and output of that data. But to really turn data into a product, we expand data sharing and exposing a set of APIs that can be integrated with any application. In Reinvent 2020, we announced [00:20:00] a preview feature for Lake Formation that exposes a new data API, allowing you to submit queries directly through a Lake Formation.
With fine grain access permissions and query acceleration, the APIs provide a powerful way to turn your data domain into a SAS platform and enable new ways to monetize your data. This is really, really powerful and this is why a Data Mesh architecture is really interesting. Using Dremio as an entry point into your data [00:20:30] domain is another approach that you can provide fast access to your consumers dashboards and analytics use cases. So a Data Mesh design has its appeal, but it’s not for everyone. It requires the entire organization to think and treat data as a product. It typically involves dedicated teams to manage the full life cycle of data, truly own it end to end, within each of these data domains. [00:21:00] By building on AWS, customers can standardize on our technology stack that’s well integrated, easy to maintain and scale, and is cost effective for the organization. When applied correctly, a mesh design enables each team to innovate at their own pace, increasing overall business value.
Okay, so kind of recapping everything that we talked about. There are many ways to extract value from data, [00:21:30] but to do it in a sustainable, scalable, and cost-effective way you need to build a Data Lake. A foundational Data Lake is based on a decoupled architecture, including a scalable, highly durable object store, a central catalog to facilitate discovery and access and governance to enforce fine-grained access permissions and the ability to audit usage. Your Data Lake should leverage open file formats and engines [00:22:00] to enable code reuse, portability between systems and reduce the potential lock-in. Once your Data Lake is built, you can easily extend it into a Lake House architecture by plugging in purpose-built and best-fit tools to deliver on your use cases. So that if there’s one thing that you remember from this presentation today, let it be that building a foundational Data Lake is crucial to being successful with all of your data and ML initiatives. [00:22:30] Don’t take shortcuts.
As you saw from my presentation, getting started is simple, and we’re always here to help, both AWS and Dremio. I did want to give just one quick shout out to Excalidraw, I’ve used their open and free drawing tool for all my diagrams. So if you like it, definitely go check it out, pretty neat tool. With that I’m going to wrap up and say, thank you again. My name is Ray Hasson. My [00:23:00] inFormation is down below here. So if you feel please connect with me, send me any questions, comments, anything. Love to have a conversation. So with that, I’ll pass it back to Melissa and we can go ahead [crosstalk 00:23:12].
Thanks Roy, we’ve got a couple people in queue, and I’m going to go ahead and do this. If you want to go ahead and stop sharing your slides so we can bring people up on screen. And let’s start seeing if we can get any of these people going. In the meantime, don’t forget to ask your questions [00:23:30] in the Slack. I have a question from Hamid, is sharing only SQL or [inaudible 00:23:39] files to be shared, files have lots of different records in them, belonging to different accounts users, so control is not granular.
Yeah. So the way that data sharing works is when you register a table into AWS Glue data catalog, the data [00:24:00] sets that’s backed by that table is, a bunch of files on his tree, right. Objects on his tree. Now, typically those files are representative of a single table, right? So you’re not going to mix and match different files to represent a single table. So that table definition, the schema and objects themselves, when we share it through Lake Formation with another account, the other account just sees it as a normal table. They can use SQL on it, they can use Spark or Presto or Hive or Remy [00:24:30] on that table. It doesn’t really matter whether it’s a Spark job running on it or SQL query.
Great. Thank you so much, Roy. I have another question from a Cdev, does AWS Lake Formation allow for attribute level access control and masking?
So in the fullness of time, that is definitely something that we would like to provide for our customers. Right now we support database, table, column and in preview rows.
[00:25:00] Okay. I just want to remind people, if you do want to ask your question live and on camera with Roy, you can go ahead and push that button to share your audio and video up on the right hand side. And then you can… There we go. It looks like we have James Gosnell joining us in just a second. James, does your camera work? Can you hear us? In the meantime while he’s getting loaded, I’m going to go ahead and ask another question for you. Any other ETL tools preferred for AWS Data Lake build?
[00:25:30] It’s really up to you. I mean, our recommendation, as I mentioned in my presentation is to stick to more tools that give you portability both of your data and your jobs. There are plenty of commercial options out there that you can use Visual ETL and things like that. But then after years of using them, you kind of realize that your job and your code may be locked in. [00:26:00] So focus on tools like Spark and Hive and Presto that give you that portability. So if you write your code in Sequel or you write in Python or Scholar, it’s easier to take that code to something else if you feel like you need to change your tool.
Hi, Kenneth, welcome to the stage. Kenneth, why don’t you go ahead and ask your question to Roy?
Yes. The question I have is this, if you are not using Spark, so is there any way of using just Dremio against the Data [00:26:30] Lake and then create a secondary environment just for the analytics, reducing workload?
Yeah. Thanks for the question, Kenneth. Yeah, absolutely. So again, the Data Lake on AWS is data in S3 and Schema in the data in the Glue catalog. Dremio can plug into that and be able to create a data just like any other place.
Curious that in a month of data, data [00:27:00] from the Data Lake , because I want to deploy for a specific say for instance, for the executives, then I will be able to extract the data time, want, and then package it up, coordinate a database of my choice and then use Dremio against that. Is that possible?
Yeah, exactly. I mean, you can use Dremio against the data in S3, or you can put the data into, I don’t know, another database let’s say and have Dremio pointed out as long as [crosstalk 00:27:28].
Okay, so if I want to create environment [00:27:30] that is more like a, a subscriber type of workload, meaning that I subscribe from the Data Lake to deploy to different audiences depending on security profile and other things. So how would that work? How would I be able to do this using this tool, using Dremio?
Yeah, so in a producer consumer model, I showed an example where you have one AWS account that hosts the data on S3 and exposes the [00:28:00] schema to the catalog, and then using Lake Formation, you share that table, right? The table definition with another account, that second account is your consumer, right? That’s your finance team, for example, they are using Dremio because that’s what is driving their dashboards. Dremio will then be able to see the table that you shared exposed to them in the Glue catalog in that account. And then from Dremio perspective, it just looks like a table. They created a data as they normally do, and [00:28:30] the end consumer doesn’t need to know.
Okay. So in this type of integration, what are the security like for instance, for a bank, you need a lot of encryption and some kind of security firewall on whatever rules that you need to create for that sharing. In this type of integration, are there any security concerns that we need to be aware of? Something [00:29:00] that you need to subscribe to?
Yeah. Honestly, I think we should take this offline because this gets pretty deep. There’s a lot of things here that we can discuss. Yeah, I think we should probably take it offline. If you want to reach out to me offline, I’d be happy to [crosstalk 00:29:17] conversation.
Sure, I will.
Okay. And I think we’re just about out of time for questions-
Okay, thank you.
But I did want to say I’ve pinned a chat message at the top that has a link to [00:29:30] Roy’s Slack channel, in the Subsurface Slack community. And so if you go ahead and click that link, Roy, will be over there for about the next 30 minutes, answering questions in the Slack channel. And then I also wanted to say, remind everyone to go ahead and click on the right hand side on that slide-out tab so that you can go ahead and give us some quick feedback on the session. Also a reminder, we’ve got two more great sessions coming up at 11:05, the 2021 state of data operations, [00:30:00] emerging challenges and expanding cloud data ecosystems. And then we also have a great session presented by Galton Kosha of Adobe Iceberg at Adobe challenges, lessons, and achievements. So go ahead and make your way over to the next sessions. And again, Roy will be over in the Slack channel for the next 30 minutes answering questions. And Hamid, I’m sorry we didn’t get to your question today-
But maybe in the next session. Thanks so much everyone, have a great day.
[00:30:30] Thank you.
Thanks everyone. I appreciate.