Best Practices for Building a Scalable and Secure Data Lake on AWS

Session Abstract

In this session, Roy will share architectural patterns, approaches and best practices for building scalable data lakes on AWS. You will learn how to first, build a data lake and second, extend it to meet your company needs using the producer-consumer and data mesh architectural patterns. You will learn how AWS Lake Formation makes it simple to deploy these architectures by allowing you to securely share data between teams using their choice of tools, including Dremio, Amazon Redshift and Amazon Athena.

Video Transcript

Roy Hasson:    … Arthur. Appreciate it. Again, my name is Roy Hasson, I’m a principal product manager at AWS. Thank you for this opportunity. I think this is a really exciting time, I saw there’s a lot of really, really awesome sessions, hopefully I get a chance to catch them as well. What I wanted to do today is just talk about the best practices for managing or at least scaling a data lake on AWS. [00:00:30] So the way I want to go about this is to at least start by explaining what is a data lake? And I’m sure lots of people here know what it is, but it’s a good way to level set. So the way that we think about it is a data lake stores all of your data in a central and secure repository. It’s all cataloged and made discoverable to users to be able to consume it.

The main benefit for a data lake is really to break down [00:01:00] these data silos that organizations have and encourage more collaboration to be able to extract insights and guide better business decisions. So we also recognize that just creating a data lake is not enough. And we often see customers struggle to take full advantage of these data lakes that they’ve built. So to help customers benefit from their data lakes quicker, we introduced this idea of a Lake House approach. So what [00:01:30] a Lake House approach is at a high level, it starts with the core data lake in its center right, and builds around it. Using tools like Amazon S3 to store the data, to be able to give it that scalable data lake, that you need to be able to ingest more data into it. But also be able to provide purpose-built data services around it, to consume it in ways that makes sense for the use case and makes sense for the consumers who are actually using that data.

Do [00:02:00] it all in an automated way, so that data can flow in from the data lake, that single source of truth to these purpose-built stores in an automated way. So you don’t need to worry about standing up ETL pipelines for specific jobs, there’s processes, and tools already in place to help do that. But then also be able to centrally manage and govern that data in a way that offers security down to the fine grain access controls in the column and the row level, when users are trying to access that data. [00:02:30] And lastly do it in a performing way so that no one tool gets an advantage over the other, and you have good performance across the board, regardless of what tool your users are willing to use. So again, it’s not a specific architecture, it’s just more of an approach, of a way to solve this problem.

So the main benefits that we see with a Lake House approach is that you can take full advantage of your existing investment in your S3 based data lake without having build new ones [00:03:00] or even make any major changes to your existing one. The Lake House approach also follows common best practices and fits in with well understood design patterns like the producer consumer model, and also Data Mesh that you’ll hear lots about in this conference here. And this allows you to create, decoupled, distributed and also highly scalable data platforms that can meet your business needs. So the data platform also can easily provide [00:03:30] ubiquitous access to data from a wide range of tools like Dremio and Amazon Redshift and Amazon Athena, and having that ubiquitous access is really what drives that self-service nature of analytics.

And then be able to also raise the bar on security and privacy is really, really important. It’s not enough just to give access to data, you’ve got to do it in a secure way. And the Lake House approach allow you to define standard best practices and [00:04:00] processes to encrypt the data, make sure data is classified properly and is discoverable in a way that users can understand whether they have access to it or not. And then when they are accessing it, doing it in a secure way. So let’s just take a quick super high level view of what an architecture inspired by the Lake House approach can look like. So starting off very simple, we have a lot of connectors right, there’s a lot of different data sources out there, [00:04:30] but you need tools, you need capabilities that have these connectors to bring it in. So for that we want to bring in through a data preparation service.

In this case, I’m showing AWS Glue as that engine, because it offers a large number of connectors, it offers different capabilities for different users, whether you want to write code or you want a visual offering tool, you can do all those things. But be able to take that data and prepare it before you land it [00:05:00] into the data lake. Commonly, if you look two, three years ago when we talked about it, it was really about bringing data into the lake first and then figuring out what to do with it. But we learned over the years that it doesn’t really make a lot of sense because it’s extra work. So can we bring the data in and already prepare it, already clean it, and put it into the data lake, so we have better data, we have better quality and better accuracy for that data before it becomes available. And to make it available, we want to be able to catalog [00:05:30] it.

So a component of this architecture is a data catalog that can be populated with the data sets from the data lake. Once the data is cataloged, it’s easily discoverable, but we also got to make sure that we’re securing it. So be able to set fine-grained access control on top of the data and define governance policies around the ability to audit, and classifications, and lineage and things like that are critical to be able to scale your data lake. So as you [00:06:00] add more data, as you catalog more data, and you get more users who want to access it, if you don’t have the proper guidelines and guardrails in place around security and governance, it becomes really, really hard to scale. And we’re seeing that with our customers that have taken the data lake approach and then tried to scale with it. They start to hit some of these roadblocks, and we’re helping them work through how to make that work.

And once the data is cataloged and secure, now you can start accessing it. So whatever that may be right, this is typically analytics, [00:06:30] machine learning, we always talk about that. But what if I have some other purpose built stores like a transactional database or a time series database or a graph database. That’s not necessarily an analytics use case, but it is a way for me to access, to analyze the data in different ways. I want to be able to have a mechanism to take the data from the data lake acting as the source of truth, using that data preparation service, that automated service, to move that data into a purpose- [00:07:00] built system that allows me to deliver the use cases that I need to. So beyond just analytics and ML, there’s other use cases of course for the data. So this environment allows you to leverage that without having to reinvent the wheel or create new solutions to it. That’s at a high level, what a Lake House inspired architecture could look like.

At the core of that as you saw, was the data catalog, right. The data catalog is a really important aspect of any type of data platform, [00:07:30] especially when you’re trying to scale, right. Doing something small, having a self-managed catalog like a Hive meta store or something else may make sense, but as you scale and you have more producers and more consumers of the data want to discover what exists, having that catalog is really important. So because of that, this Glue catalog really serves as a single place to register all of the data assets, but also find and understand what’s available so you can enable access [00:08:00] to that data. So there’s a lot of catalogs out there on the market, commercial catalogs, open source tools, and they all do different things, but very few of them do all of it.

The Glue Data Catalog gives you the ability to catalog the data, annotate it and tag it and put information about it, but also enable you to access the data from engine. So it’s not just about data inventory, it’s also about data access. And that’s an important aspect when [00:08:30] you’re trying to scale out your data platform. Around that, we’re still trying to manage the data lake. They’re still secure in other aspect, and it’s still not as simple as maybe it would be if I take all of my data and just shove it into a data warehouse, right. Just say, “I don’t want to deal with this. I’m just going to shove it into a data warehouse.” That works. It’s not the most scalable, it’s not best the best practice to be able to scale your platform.

But AWS Lake [00:09:00] Formation is one of our services that helps to make that simpler, by offering a way to automatically manage the changing data in S3, which traditionally has been a bit more difficult, securing it and also making it discoverable and shareable. So this makes it a lot easier to onboard new data sets and also expose them in a self-service, but also secure way to consumers. So to make data easier to [00:09:30] onboard, to make ingestion of data and preparation of data simpler and easier, and also be able to update this data in S3, we introduced at re:Invent last year, what we call the Lake Formation Governed Table, and this is in preview today. And it offers you reliability through asset transactions, to be able to create data, update data, delete data within the context of an asset transaction, giving you more reliability for that data, especially as there are multiple [00:10:00] writers and readers of the same data.

It also offers storage optimization to make performance of the data in S3 better for all the different systems that need to consume it, right. Rather than optimizing for one engine, let’s just optimize the data store and be able to give better performance for everybody. And then also be able to offer you the ability to do versioning of the data and be able to do time travel, go back in time to previous versions of that data. So just double [00:10:30] clicking into that a little bit more. Governed Table Transactions offer you two modes. The first one is we call table based transactional, table level transactions. And this is really designed for bulk inserts or deletes, or modifications to datasets. Typically, when you have a large data ingestion pipeline, that’s reading hundreds of terabytes of data, it just simply needs to take that data and dump it into a lake, maybe into an existing table, maybe into a [00:11:00] new table.

This offers you a way to do it, in a performing way, but also within the context of a transaction. So open, start a transaction, read the data, do some transformations, write the data. When that completes, we commit that transaction. And if all, if users are querying at the same time, they don’t see any changes until we commit the transaction and that the commit succeeds, right. So there’s no impact to the [inaudible 00:11:26] consumers while you’re doing this work. And this has previously been a challenge for customers [00:11:30] who are trying to manage different versions or have to create a table on the side and do all the transformation and then copy it to the live table in order not to impact users. So this is a way to do it without having to go through all of these complexities.

The second one, and probably the most exciting one that most people care about is the row level transactions, right. And this is the ability to take individual rows and insert, update, or delete them in the data [00:12:00] lake. So instead of having to go to S3, download the file, modify it, insert, update, delete those rows, and then write it back into S3. You corral the Governed Table APIs, and you say, “I want to update this record or this row.” And based on that information, Lake Formation will automatically find the appropriate files in S3, modify them and then make that data available for consumers. And we do it very, very fast, so the user doesn’t really have any kind of impact when this is happening. [00:12:30] Now, the main difference between this and some of the alternatives on the market is that the metadata right, the manifest, or the transaction log, or whatever you want to call it is not managed by a physical file in S3, it’s actually managed by the Lake Formation service.

It gives us the ability to have much higher concurrency for reads and for rights. So if you have multiple engines, multiple ETL jobs or systems writing to the same table, while you have hundreds of queries running against [00:13:00] that same table, we’re able to manage that because it’s all going through a service and is not bottle necked by a specific set of files on S3. It also makes it easier to port that metadata. If you wanted to take that metadata and use it in something else like in auditing or in some kind of disaster recovery replay or something like that, it’s all part of the service, you can easily call the API and get that metadata back out. And then the third one that I wanted to call out is that because it’s a set of APIs, [00:13:30] you don’t need any custom libraries or any custom code to be able to use it.

It’s an API, just like any API in AWS, and if you want it to interface with this Governed Table through a Lambda function, over your CLI, over maybe a custom application, you can simply call the API, write up, write data, read data all very easily through that API. So it makes integration much, much simpler into your data lake. Okay. The next one [00:14:00] is around automatic compaction. So one of the things that’s been a challenge for customers is a lot of small files, how do you manage them over time? It becomes more difficult and performance degrades. Governed tables allow you to simply turn on a flag and say, I want to automatically compact small files within this Governed Table. And a Lake Formation, just manages for you behind the scenes. All right. So those are the data management features.

There’s lots more, we can definitely talk about it. [00:14:30] If you have any interests, we’ll love to connect with you on this. But when we’re talking about scaling your Lake House architecture or scaling your data lake. Data sharing is made easy with Lake Formation and this is really how we scale. Be able to take an entire database and share that database and all of its tables, or just specific tables, or even just specific columns and rows with any other service that’s integrated with Lake Formation, it makes it easier for you to scale your environment. So let’s take a quick look at what that looks like. So when we looked at the first [00:15:00] architectural diagram, the Lake House inspired architecture, it boils down to what this picture here has.

So I have a few S3 objects, different data sets, maybe I’m using Governed Tables, maybe I’m not. I also have a Glue ETL that does my transformation, my enrichment, my ingestion. I also have the Glue and Lake Formation data catalog, that’s catalog all my data sets. And maybe I have Amazon Athena is just running some queries on it, this is typically maybe [00:15:30] in a single account. But now I want it to scale it. I want to add more users, more consumers, maybe they live in their own account, they pay their own bill, they have their own stuff, I want to be able to extend it. So I’m going to add another customer account, right. This is a another AWS account, I got Redshift there, maybe I’m even using Dremio to query my data. It has its own local catalog with Lake Formation. I’m not showing an S3 bucket here, but there’s probably a local bucket as well.

But what I can do is I can actually create a resource [00:16:00] share that shares tables or databases or columns like I showed before from the data lake account to the consumer account. By simply sharing that resource with the consumer account, those data sets, for the metadata itself just becomes available in the catalog in the consumer account. When the user runs a query inside a Dremio or Redshift, the data access is access directly into the S3 bucket, in the data lake account. But it makes the data easily accessible and discoverable, [00:16:30] secure, because it can define fine grain access permissions on this resource share, but also allows the data lake user, the producer, to continue to manage that data and don’t have to copy or duplicate any of it.

So now I can just simply add more consumer accounts and create these shares, and I’ll show that in a little bit. Transitioning a little bit, and I think there’s other sessions in this conference that talk about Data Mesh, so I’m not going to go deep into what a Data Mesh is. [00:17:00] But again, think of it as a way to decouple the organization and give the organization more autonomy. Different lines of business, different teams, maybe different product teams, responsible for their own data platform, their own tech stack, to be able to deliver on a data product. Now the Lake House approach is not a replacement for a Data Mesh, Data Mesh is a great architecture and is a great solution for organizations to scale. The Lake House approach offers you actually a few things. The first [00:17:30] one is it offers you a common tech stack.

Now, whether you come up with an AWS native tech stack or a mix and match or whatever that may be, it doesn’t really matter, but it gives you a common tech stack. So you can say when you’re building a data node, as it’s called in a Data Mesh architecture, just use this common stack. So we all know what it is, we all know how to work with it. We understand the scalability, durability, and availability capabilities of it, so we know it’s going to scale. And we know that we don’t have to fight [00:18:00] the scale problem with different implementations. We know that it’s secure and compliant. This is extremely important in financial services, so when there’s a lot of compliance regulations that have to apply. If you’ll start creating different tech stacks across different data domains, your security team is going to be pretty mad with you, so that’s something that’s going to make it a lot easier.

It’s also simpler to manage, right. You have a common set of skills across the organization that you can share, you can learn from each other, you can share [00:18:30] best practices. So it’s quicker to ramp up new data domains, it’s quicker to, it’s simpler to manage it, and you have standard operation procedures and monitoring tools that just makes it easier across the board. And ultimately it’s more cost-effective if you’re doing the same thing. Economies of scale, you get better costs, and hopefully better leverage on the roadmap as well. So, one thing that comes across with Data Mesh is that in Data Mesh, it’s typically thought of as [00:19:00] this peer to peer type of environment. But there are many cases where a Federated Data governance is really important, right.

So not just having consumers and producers talking to each other, but having a single place, and it doesn’t have to be centralized or coupled, and it can still be distributed, but it’s a single place for users to come in to say, “Where is this data set? Or who do I ask for permission to this dataset? And now that I’ve asked for permission, I want to define those permissions, right.” So have a common security [00:19:30] feature set that we can define those permissions across users and across data sets in a single place. It makes it easy to secure, it makes it easier to audit, right. Also, is a common identity, right. If I’m using different data domains, maybe I want to go from one domain to another, I want to access data sets in different domains. If they all have different ways of authorizing identities, now it becomes difficult.

This one uses IAM, this one uses SSO with [00:20:00] open ID connect, this one uses SAML right. Now it becomes really difficult. So how do I do that? How do I create a common identity provider? As well as be able to classify data target and apply those tags as permissions. So Lake Formation allows you to create that federated Central Governance engine, or a layer that sits on top of your Data Mesh and your Lake House architectures. So let’s just take a quick example into what [00:20:30] this looks like. So again, starting on the left side with the Dalek account [inaudible 00:20:34] we just have our buckets, but you’ll see here that we have in the middle of this Federated Data Governance account. And we also have a consumer on the right side.

So the first thing we do is we actually, those S3 locations with Lake Formation in the Central Governance Account. And by registering those locations, now we’re saying Lake Formation, you’re responsible for managing security [00:21:00] to those buckets. We’re also going to populate the metadata, and we can use the Glue crawlers for that, we can use other mechanisms to populate the catalog information. So now we have Schema’s, we have metadata information about those datasets, and we’ve registered them for Lake Formation to manage. The next thing we do is we actually share those table resources back into the data lake account, into their own Lake Formation catalog. By doing that, we’re saying, “Hey, we’re [00:21:30] managing this data, these tables, but we’re going to share them with you because we want you to manage the data underneath it. We will manage the central account, we’ll manage the security and the discoverability, but we want the data lake account to have ownership over that data.” Modify the metadata [inaudible 00:21:47] deal with the data itself, with partitions, et cetera, et cetera.

The central account does not want to manage that. So the Dalek account then inside of Lake Formation creates a local database where those resources [00:22:00] live, also create some resource links to make the connection. And then it incorporates its own data processing engine, whether it’s Glue or other services, to be able to transform the data, enrich it, update partitions, et cetera, et cetera. And when it does that, it automatically updates the Schema and the partition information in the local catalog that then propagates to the Central Governance catalog. So with all that we still, we have [00:22:30] in the central catalog, the metadata, so we understand the table information. The data lake account, the producer is responsible for producing the data and managing the data and updating it. So now users on a consumer side can come and discover that. And the way they do it is by having the central data governance account, share the table with the consumer.

Now that the table is shared with that consumer, they can define permissions on top of that. They create their own database and table [00:23:00] resources. So they can give it a different name if they wanted to. And then they can grant different principles, select permissions, or write permissions if they wanted to on top of that data, right. So now that those permissions are defined in a consumer account, the consumer can discover the data sets that are available to them, they want to go query it. And they can bring whatever tool they want that integrates with Lake Formation, to be able to query that data. So I know this is a lot of stuff here, but the simple [00:23:30] takeaway from this is you have the producer of the data on the left side, you have the consumer of the data on the right side. And in the middle, you have the governance service that define, that catalogs the data, makes it discoverable, and allows you share data sets between these accounts while defining fine grain access controls.

And to scale it, we simply add more consumer accounts or more producer accounts, and we follow the same exact process, and everything works exactly [00:24:00] the same. So again, this is a Data Mesh model with Central Governance. You can also remove the Central Governance and be a pure peer to peer Data Mesh environment just the sharing will happen between producers and consumers. All right. So the next thing I wanted to do is I wanted to just quickly talk about how do we enforce permission to the Lake Formation? You’ll see here in the middle here, and I know it’s kind of hard to read, but this is a Lake Formation unified data access [00:24:30] layer. It’s a new capability part of some of the features that we have in preview today, but here is how this works.

So the user on the left side runs a query and it runs a query on one of the integrated services, we’re working with many third parties to integrate there as well. That query gets translated by the service into a read API that calls the Lake Formation unified data access layer. That layer then calls Lake Formation policy manager to check [00:25:00] for authorization and to get temporary credentials, to be able to access the physical data sitting in S3. The unified data access layer actually maintains those temporary credentials and then calls S3 to read the data. Those credentials are never passed back to the service and not to the user. So it get those objects, those objects are returned back to the unified data access layer. It then applied the security policy, so whether it’s a column level filtering or a row or sellable filtering gets, applied in [00:25:30] the Lake Formation unified data access in a consistent way, and then the data is returned back in Apache Arrow format, that’s easy assuming for the engines to consume it.

The next step is I want to update. I want to insert. Maybe I want to delete data as well. So the user may be through a SQL interface like Athena, when you want to do a insert into or something like that. Or through the APIs wants to call that operation. The engine then calls [00:26:00] the right API to the unified data access layer, which then writes the data back into S3 and merges all those deltas behind the scenes. So the user doesn’t really know of anything, it doesn’t have to touch S3, everything is done through Lake Formation. So this is one capability that give you consistent way to read data in a secure manner, but to also write data. Now this works great, we have customers who are working with this in preview today, we’re working with many partners [00:26:30] to integrate into it, but there’s also another mechanism that we’re exposing for partners, to make it easier for them to integrate with Lake Formation.

This is what we call credential vending, and this is how it’s going to work. So the first thing is, is you got to go into Lake Formation to enable vending for the partner. And this is for the customer. So if the customer says, “Hey, I don’t want to vend credentials to partner X.” They don’t configure them in Lake Formation. If they say, “Yeah, we love Dremio.” We want to give them the ability to do it. They configure that inside of Lake Formation [00:27:00] so later, if something happens, they can pull the plug and say, “No, we’re not going to give you credentials anymore.” Not that’s ever going to happen, but you have the option. The next thing is the user runs a query. Let’s say they run a query on Dremio, Dremio then calls Lake Formation and Lake Formation then needs to verify the Dremio is the service that it is, based on a tag that we defined.

Once it does that, we pass back trusted service credentials to Dremio. These are a set of AIM credentials that they can use to call privileged [00:27:30] APIs inside of Lake Formation. So they do that, right. They understand the query. They said, “Hey, I want to read data from S3.” I’m going to call Lake Formation again and say, “Hey, I need you to authorize user Roy to be able to query this specific table.” So Lake Formation authorize that calling principle and returns back a set of temporary S3 credentials that are scoped down to the query. So if you’re only queering a specific table, we give you credentials only to that [00:28:00] location in S3 for that particular table, and not to everything. Dremio gets those temporary credentials and simply calls S3 using the standard gate object APIs, to read the data with those credentials. Processes, the data returns the results back to the user at the end of that query.

That’s the vending API. We have it in preview. If there’s folks out there interested, we would love to talk to you about how do we integrate into that. So then just to [00:28:30] summarize. Data lakes really allow you to break down these data silos and really scale the volume of data, right. As you add more data, you want to be able to scale it, so a data lake gives you that option. The Lake House approach that we recommend offers a common tech stack in a way to simplify the way you deploy your data platforms across your environment, whether it’s a single account, producer consumer account, a Data Mesh, whatever you choose, the Lake House approach makes it easier [00:29:00] and give you some guidelines and guardrails of how to best do it at scale.

And then Lake Formation really simplifies the way that you build these data products in the context of a Data Mesh, but it really offers you the catalog, the security and the access mechanisms, as well as the sharing to make that data accessible and self-service for your users. And that’s how AWS native services, partner solutions, Dremio, and others as well, give the user the [00:29:30] ability to go use the data for analytics, machine learning, or whatever they want to do in a simple way, without having to duplicate data or create common systems, or common ETL processes just to serve those use cases.

So with that said again, thank you very much. I know there’s a lot of stuff here. If you want to follow up with me, you can find me on LinkedIn, I’ll be happy to connect. You can sign up for the Lake Formation feature previews, if you want to learn more about it. There’s also a couple of really good blog posts here about [00:30:00] Data Mesh on AWS. So check them out. And with that, thank you so much. And I’ll pass it back to you, Arthur.

Arthur:    Thank you, Roy. That was an informative session. I hope everybody enjoyed that. I’m going to go through some housekeeping items here real quick. First question that’s top of mind is how did the, how do you get access to your presentation?

Roy Hasson:    Arthur, I can make my presentation available, and if Dremio can share that out no problem.

Arthur:    [00:30:30] Yeah. Certainly. Let me do this. If you email me this email address here, I’ll put it in the chat. I will make sure you get a copy of this presentation. But also know, all this recording will be available after this session. So real quick here, we’re going to jump to Q&A and make sure that we get your questions answered. If you’d like to join our live Q&A, go ahead and click the upper right, share your audio and video, and you’re automatically put [00:31:00] in the queue. And obviously we have some questions through the Q&A. So I’m going to go through to our first live question. Let me see if I can cue it up here, in one second. [inaudible 00:31:08], can you hear us? All right. Well, as I’m trying to figure that out, let me go through some of the Q&A here. Do you have some examples of security compliance guard rails for a data lake?

Roy Hasson:    I’m sorry, [00:31:30] say that again.

Arthur:    The question is, is from Steve Dotson, what are some examples of the security and compliance guard rails for a data lake?

Roy Hasson:    Yeah. Some examples, and I know we don’t have a lot of time, but some examples is data classification, right. To be able to define classifications for your data, automatically classify the data as it’s coming in and then defining fine grain access controls on top of that. So classifications, then you can then audit, [00:32:00] fine grain access controls for the data, and then be able to audit not just the classification of the data sets, but also the usage and the access patterns for your users. So those are the kinds of guardrails that exist today in Lake Formation that you can automatically define. As new data gets ingested into the lake, they get classified and those tags get populated in Lake Formation, and then security gets applied automatically on top of that.

Arthur:    I can tee up another live question. [00:32:30] We’ll see if technology is going to be on my side at this point. Give me one second. Let’s see. Nope. It did not come through. All right, we’re having some platform issues, let’s see what other… You want to take one more question, we’ve got about two minutes left. You can see the Q&A, which one do you want to pick out of here, Roy? Bring it all together for us.

Roy Hasson:    There was a question that I answered from Stefan, does Lake House refer to Databricks Lake House concept? I think [00:33:00] that’s a really good question because there’s some confusion around terminology. So no, I don’t want to speak for Databricks, but Databricks Lake House approach really talks about the decoupling of the data warehouse by using Delta Lake as a way to provide data mutability and asset transactions in S3. That’s what they mean by a Lake House.

What we mean by a Lake House is not necessarily an architecture or technology solution. It’s an approach, right. It’s a way to solve the problem where [00:33:30] you have the core data lake in the middle and then supplemented by purpose-built data stores to help you analyze the data in different ways. So less focused on the storage and making the storage transactional as it is in Delta Lake, but more focused on the central data lake. And how do we expose the data to more purpose-built systems so users can take more advantage of it.

Arthur:    Sterning. It looks like we’ve got about 60 seconds. [00:34:00] Is there another one you want to pick out of here before we wrap things up?

Roy Hasson:    For the data lake, for the Lake Formation APIs that it expose, a data connector that is queriable by SQL [inaudible 00:34:16], what version of data? Yeah. So let me try to answer this one. So Isaac. The data lake, the Lake Formation API that I showed you, in particular, the unified data access, is [00:34:30] a new layer that exposes a few sets of, a few APIs. The first one, think of it as a data access API that allows you to submit a SQL query. Now it’s a data scan query, so select star from table X, it doesn’t support joins or aggregation, it’s simply a way to read the data from S3 in a managed way while applying security on top of that.

So this is a new queering APIs that we’re exposing that you can use. The same thing will happen for writing the data. [00:35:00] You can simply call the API and say, “I want to insert this row. Or I want to update this row.” And we will manage it all behind the scenes. So all the metadata in the catalog will be managed for you in these Governed Tables while this unified data access layer gives you those interfaces to easily interact with the data, whether it’s from Spark, or Dremio, or just a CLI, or a Lambda function.

Arthur:    Right. We are at [00:35:30] time. Thank you everybody for taking the time this afternoon with us. Roy, is going to be within the Slack channel that I’m going to repost here right now for the next 30 minutes, taking some Q&A. So please feel free to join that Slack channel and continue to engage with, Roy. And again, thank you so much for your time today, everybody. I hope you have a great rest of the Subsurface. Thank you.

Roy Hasson:    Yep. Thank you everyone. Appreciate the time.