May 2, 2024

Solving Large Scale Data Access Challenges with Amazon S3

As you build a data lake or shared datasets on Amazon S3, managing access is essential. You need strong guardrails that protect your data. Within your organization, you require granular access control for your data with strong controls around authentication, authorization, encryption, and auditing. Come to this session to learn about common and successful patterns for implementing your access controls at varying levels of granularity and scale to maintain tight control over your data.

Topics Covered

Governance and Management

Sign up to watch all Subsurface 2024 sessions


Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.

Aalhad Kulkarni:

And hope you’re doing good this morning or afternoon. My name is Alad, and I’m a senior software engineer at Amazon AWS S3. And today we’re going to be talking about solving large-scale data access challenges with S3. A little bit of background about myself. I’ve been with Amazon for about five and a half years. Most of my time in Amazon has been spent on security controls and making them better for our customers at S3. So I’m super excited to talk about this topic with you all today. All right, let’s get started. So a quick agenda for today is I’ll do a quick permissions overview. We’ll talk about options for granting access to your data in Amazon S3, and that’s where a majority of our conversation today is going to be aimed at. We’ll do a quick introduction of a new feature for S3 access grants and how that’s solved for your use cases. We’ll do a quick recap at the end to kind of bring everything back together. 

Permissions Overview

Cool. So let’s start with permissions overview. Security is not really a new concept, right? Like, let’s say you’re an organization, you have a bunch of users that need access to some kind of data, right? So you have the ability for you to express the right data access for the resources for the right users is really, really important. Let’s not talk about this portrait here. Let’s talk about an S3 bucket. This is Mr. Bucket, by the way. This is our mascot. Really cool if you’ve seen that re:Invent last year. But we’re not going to draw him on the presentation, right? So we’ll just use a simplified version of an S3 bucket. So if you have data in AWS, most likely you’re going to store it in an Amazon S3 bucket, right? So you have a bunch of users, you have different data sets, you store it in an S3 bucket. So it’s pretty simple, straightforward workflow. Many customers use S3 today. So how does that relate to kind of like data lakes, right? So when you have large amounts of data, let’s say, and by definition, like you need a centralized storage location for your data lake. So you have a bunch of producers, you have a bunch of consumers trying to access data in this one location. And typically, if you’re using AWS, your data is always going to be in an S3 bucket, right? So you need ways to be able to share your access to your data across the board, right? So shared with users from different accounts, different companies, different organizations, or even within your accounts. There can be lots of different ways you can actually share the data. 

So quick reminder, before we get into it, is that S3 is obviously private by default, right? So if you create a bucket, you’re the owner of the bucket, you can share the data with other folks, right? So you have to do something to be able to share the data outside. Few best practices that Amazon S3 always recommends our customers is encrypt your data, enable block public access, and disable access control lists. S3 does this by default for newly created buckets starting April of 2023. But just a quick reminder for everybody listening on Scott, to make sure we have the right best practices. 

Options for Granting Access to Amazon S3

All right, so you have an S3 bucket, you want to share the data, right? So what options do you have today? And this is the meat of our talk today, right? So we’re going to talk about a lot of different concepts, a lot of different patterns. We’re going to start right at the bottom with some IAM basics. We’ll talk about the first pattern using IAM techniques and S3 prefixes, kind of a straightforward standard pattern in AWS. Talk a little bit more about S3 access points when you start scaling this. We’ll take a hard right, go into structured data sets and how permissions are formed with S3 and with structured data sets. And then we’ll take a hard left and go talk about an IAM session broker pattern which works for dynamic access at scale. And then we’ll talk about S3 access scans as another feature to kind of help you scale access controls. So if you see, like from bottom to top, right, like I’m going from simple scale and granularity to like more and more complex, like as you kind of like go towards the top. 

So let’s start with IAM basics, right? Most of you might already be familiar with it, but I’m going to run through it very quickly just to give everybody a background in any case, right? So let’s say you have an IAM principal who is your caller and needs access to do a get object call to a bucket, right? So retrieve an object out of an S3 bucket. You have an IAM policy, which is on the user side. And you have a policy on the resource, which is your S3 bucket policy, right? Another quick reminder, if you have already seen a policy like this, right, like essentially you can write an allow statement to do get put object on a particular resource. Now, the important part is here that the resource can be as fine grained as exactly the object. So in this case, yellow/submarine.json, or you can like combine them into like what we call as prefixes in Amazon. So you can say, hey, like this person has access to do get object and put object for all of the objects within the yellow prefix, right? So they get access to dandelion.jpg or submarine.json. 

So what happens in a cross account access, right? So extremely basic, you have two authorizations, right? Does the caller account allow the caller needs access to the resource? And then the resource owner needs to grant the caller access to the resource, right? If both the authorizations succeed, you’ll get access. If either one of them denies, you’ll not get access. A simple, straightforward cross account access scenario in AWS today, right? 

What happens in the same account situation? Like, so let’s say you are within the same account, like your bucket owner has shared the data with somebody else using an IAM role. Both policies do come into play, like the user policy and the bucket policy, but they’re authorized in one go. Either one is totally fine if you want to get access, as long as the other one does not deny, right? So again, a simple authorization scheme for the same account scenario. All right, so that was basics. Now let’s start with pattern number one. So I want to share data. What do I do? So we talked about two different aspects of policies right now. 

Resource Policies

So let’s start about with the resource-based policies. So I have a bucket, I need to share access for my bucket to another user, right? So in this case, I’m going to grant this blue role access to the blue prefix of my bucket, and note that this is the same account use case, right? If you have used bucket policies, right, you have heard about limitations on like, let’s say the number of statements you can have, or the max size of the bucket policy today is about 20 KB, right? So if you have a policy like this, you know, a simple straightforward policy, this policy says, hey, you can do a get object for the blue role on the blue prefix, right? What if you want list permissions too, right? So same permissions, it’s about 554 characters. So we went from about 250 characters for the get object policy to about 554 characters for get on a list. So why does this matter, right? This matters because as you try to start scaling this use case, right, like if you have a simple use case, this works really well, but if you start scaling this, let’s say you can have up to about 30 prefixes with distinct access patterns that can fit in a bucket policy assuming 300 characters for each policy statement here. So your primary constraint now becomes, you know, the number of prefixes go into the policy sizes. So super high level, right, but like as you start to scale, and we’ll build up as the scale goes up to see like what different options do we have. But if this works for you, great, right, like, that’s all you need to do.

Let’s talk about the other side of it, which is I am principal policies, right? So you say, okay, like, I can only put 20 kb worth of statements in like a bucket policy, what if I don’t want to use the bucket policy, right? I can have I am site policies, I can create a bunch of different I am roles. So in this case, I’ll create a blue role that has access to the blue prefix, like the yellow role has access to the yellow prefix, red role has access to the red prefix. But what if I want access to both red and yellow prefixes, right? I need to go and create another role. Let’s call it the orange role has access to both prefixes. So in this case, like if you start doing the math, right, you have 1000 roles in an I am I am roles per account, right, you can go up to 5000 max. So your number of unique permission use cases kind of like primary constraint goes towards the I am role. And again, note that this is for same account use cases only. So bucket policy sizes, prefixes is your limitation, I am roles, the number of I am roles is the limitation on the user side. So if you’re kind of like within that realm of, you know, 30 distinct prefixes, and let’s say less than 1000 I am roles, you can still live within those boundaries, you know, that kind of like works for you. But a majority of the times in like a data lake scenario, like you have plenty of consumers and plenty of producers and need of finding and access controls, right? 

So how do you kind of like scale that up? So we’ll, as I build this up, right, like we’ll, let’s talk about I see access points, I see access points as a feature that essentially, will start helping you out in these scenarios where you start running into these limitations. So we talked about this simple use case, right? Red, yellow, blue, what if it’s these many different data sets and different users, right? You start running out of running out of limits, as we mentioned. So what are the access points, if you’re not familiar with them, as the access point, it’s kind of like act as additional endpoints on a bucket. So each access point has its own access point policy. That’s 20 kb. And you can have up to 10,000 access points. It’s a concept that is associated with a bucket. So with that, like you can scale quite a bit, right? So in this case, again, like a simple example, you’ll create an access point for every single one of these use cases. Each of these access points can have fine grained policies on each of these, you know, user sets or things like that. And then on the bucket side, you can essentially have a coarse grained policy, like you can set up data parameters that say like only allow one particular TLS version or secure traffic only like things like that, like you can have a coarse grained policy on the bucket, but fine grained policy of access controls on the access point. And then you can scale like really, really well across all of these different endpoints. 

Do S3 Access Points Fit Your Use Case?

So how do you know if access points kind of fits your use case, right? So again, this is a pattern that we’ve seen customers use. Most of the times if your patterns are static, right? So there’s not a lot of permissions that are changing, you know, like you have applications that are pretty static during permissions. So you’re not, let’s say you’re a team, you’re not adding too many members all the time, you’re not moving too many members all the time. It works really well in these scenarios. Also delegation with card is that we talked about, you know, fine grained access policies with access points and then coarse grained on the bucket side can help you scale out pretty well as well. Some of the considerations that you have to like look into when you talk about access points is, you know, the discovery mechanism, like I’m the red user, how do I know I need to use that red access point to talk to my red data sets in S3, right? So as an administrator, you’re going to set that up, but then you have to tell those users, hey, like go via this access point. So that’s kind of like something that you think about. It’s definitely doable, right? But you have to have some kind of like mapping that tells you that. And then the rate of change of required access patterns, if your access pattern starts getting dynamic, right? It gets a little harder, like permission changes, in addition for teams, removal of people from teams, it gets a little bit harder. 

So we’ll talk about the dynamic session broker use case in a bit. But some considerations as you look at S3 access points. All right, let’s take a little bit of pivot now and go talk about another pattern of access for structured data. So S3, right, is a blob store, right? So we store bytes. S3 does not know like what the content of the object is. But like a lot of customers use S3 to store structured data, right? So tabular data, like let’s say Parquet files or CSV files that are kind of like representations of like rows and columns in a traditional like SQL like table, right? S3 does store structured data as well. 

So let’s take a quick example. Let’s say I have, you know, a CSV file here in S3. It’s just some listings, you know, there’s a seller, there’s an ID, there’s a location, latitude, longitude, and then there’s a price of the item. So something very similar to like a row and column based in a table, pretty simple SQL interface. But it’s a CSV file that’s stored in S3. Like S3 does not know what the content is. So permissions in these scenarios are often tied to the schema, right? So I’m an administrator, I can see all rows and columns. That’s pretty easy. But if I’m like some data user, call it some data user, like I have an IAM role, I can only see the price and the ID of the item. Like I don’t see the seller name, I don’t see the location of the seller. And that’s kind of like tied to the schema. So if you want to represent that with let’s say S3 and IAM policies, like how would you actually do that, right? You cannot represent that today in an IAM or S3 policy because IAM and S3 cannot filter the output of an S3 object. So this kind of pattern is very common in data lake scenarios where you have structured data sets that customers need access to, and they need to use that for analytics and machine learning or anything else, right? So how do you actually grant fine grained granular access controls on these data sets?

Lake Formation: How and Why it Works

So you kind of like talk about schema based permissions, right? So I’m showing you like an AWS lake formation console view, lake formation is AWS service that essentially allows you to grant these kinds of like permissioning schemes on your rows and columns and tabular data that is stored in S3, right? So in this case, I’m saying, hey, like some data user has only access to the ID and the price, right? So how does this kind of work? So I’m some data user, I go to Athena. Athena is another service that I can run queries on, right? So I say, hey, select star from demo listings where price is less than 140. Athena goes to, Athena goes to lake formation, that says, hey, like this data, and this person needs access to this data, goes to glue data catalog, gets some metadata, goes to S3, gets the data. And then Athena knows everything that is to know about this user and what it has access to. So it’ll filter the query and give you what the user has access to. So very like common pattern for structured data sets, if you have structured data sets in S3, like you can register them with lake formation and have like fine-grained access control schemes with lake formation. 

Session Broker Pattern

Okay, the next pattern that we’re going to talk about is an session broker pattern, right? So session brokers are very similar to like token vending machines. If you’ve heard about token vending machines, they basically rent tokens, right, for accessing resources. So a client goes to a token vending machine, says, hey, I need access to this resource. Token vending machine authenticates, authorizes that client and provides a temporary token. And then client uses the temporary token to access the resource. Very straightforward architecture. In AWS, we can actually do this by using an IAM session broker. And this is, in this case, like the example I’m showing is a do-it-yourself use, right? So let’s say I have built a service, which is an IAM session broker. I’m a client, I go to the session broker, say, hey, I need access to this S3 bucket. The session broker will authenticate, you authorize, you go to Amazon AWS SDS, which is going to provide you with temporary credentials, right? And then you can use those credentials to talk to S3. So how do I implement this, right? Like, so if I’m building this out as a session broker, a session broker needs to authenticate the caller, right? Then authorize the caller. So should this caller be able to access the S3 location? If yes, go to SDS, get some set of temporary credentials. You can scope the temporary credentials down to exactly what you want, right? And then return them to the client. The client then uses them to sign requests to S3. 

So why does this pattern really scale well, right? The primary reason is that the SDS service is built as a regional high-volume data thing, right? So it can scale at really high DPS. You can use S3 for what it does best, right? Scale and data throughput. You can have a minimum number of static IM resources. So you don’t have to worry about a lot of these different mappings. You don’t have to worry about IM policy limitations or bucket policy limitations or running into limitations with the number of roles that you create, because your session broker component is kind of doing the work of a token machine. And lastly, the session broker scales with the number of sessions and not the data or even the object request. So you can make a request to a session broker once, get temporary credentials, and make a million requests to S3 out of that. And S3 would support it. 

S3 Access Grants

So what if you don’t really want to build this yourself, right? And that’s where we talk about S3 access scans. So S3 access scans is the feature that is essentially a session broker, a managed session broker with an AWS. Before we go into a little bit of detail, I want to talk about a reminder of why we actually built this, right? So managing bucket policies is harder, like you run into bucket policy limitations. We talked about IM policy or role limitations as well. That’s challenge number one. Challenge number two is mapping users to data. Most of the times, our customers are active directory users. So I wake up in the morning, I log on to my Amazon laptop as you know,, right? I need access to an application, let’s say, that talks to an S3 bucket, but I have to map that to an IM role inside AWS that has access to S3 bucket, right? So the more users you have, the more groups you have, you have to kind of like have the same kind of mapping in AWS for S3 to be able to understand and give you access. Most of the times, you don’t really want to do that. So that’s challenge number two. 

Challenge number three is auditing access, right? So S3 gets the information, S3 gets the request, you use AWS CloudTrail to log your event. Most of the times, it reflects the IM identity and not the user identity. So it won’t really say, it will say my IM role, like, you know, within a particular AWS account. So that’s challenge number three. And those are the three challenges that we were looking into, essentially, when building out a dynamic session broker, which is S3 Access Grants. So let’s start with introducing what the feature is, and then I’ll give you an example about how it works. So we have two things that Access Grants does. First thing is management of grants to say, hey, like, who has access to what and can scale up to 100,000 grants, which is the soft limit today, and give you a high value of scale. And the second thing is it’ll act as a session broker and provide you temporary credentials to have scoped on permissions to be accessing the data. We also have integrations with different AWS services like EMR, you can run EMR jobs on. 

But let’s go into a quick detail about how to use S3 Access Grants. So I’m showing a console view, it’s pretty straightforward. The first part is like you have a bunch of grants here, who has access to what, there’s a bunch of S3 prefixes, you have a simplified access level, which is read, write, read, write. And then if you notice the grantee, we support two types of grantees, right? Like we support IAM, of course, as we are IAM. And then we also support directory identities, which are your end user identities in your active directory. And we’ll talk about how we can kind of like get into that and which AWS service you can use to integrate that. Okay, so just to kind of like, show you an example. So we have a grant scope, the blue prefix has access, been given to the blue role, the yellow prefix has been given access to an end user identity group in your active directory called the yellow project group. So once you have the grants, like how does this kind of work, right? So again, very similar to a session broker pattern, you have IAM principles. So let’s talk about IAM principles today. Go to the data application, you go to S3 access grants to get temporary credentials, and you talk to S3 bucket. So this is my blue group role. I had this in my, who am I, API and AWS. I go to S3 access grants to say, get data access, I want permissions for this blue/data.csv, I want read permissions, access grants will match your requests, give you temporary credentials, you’ll also see like, you’ll get temporary credentials for all the data under the blue prefix, because you have access to that. And then you can use those permissions to talk to your S3 bucket. Pretty straightforward. 

Data Users

We’ll talk about authenticated data users now. So I was talking about like, end user identities as your active directory users, right? So your authenticated data users can be registered with AWS by using a service called AWS IAM Identity Center. So you have active directory users and groups in your organization, you sync them into your identity center. And now AWS knows about them. So you can associate that with your S3 access grants. And basically, S3 access grants now understands your end user identities. So you can add grants to your end user identities. Once you add grants to your end user identities, the data path for your application kind of remains the same session broker use case. So you log on to your data application, you know, you go authenticate, you go to IAM Identity Center, you get an onward token to talk to S3 access grants, you go to S3 access grants, say, hey, I need access to this particular resource in S3, to call an identity better role. And then you can use those temporary credentials to talk to S3 bucket. So that way, like S3 access grants now supports end user identities as well. 

We also simplify the audit part of this, which is challenge number three here, right? So you’ll notice here in the cloud trailer, when we have something called as an on behalf of field, which now talks about the original user identity. Some more basic information, you know, it’s a regional feature, grants are for buckets in the same region and same account. AWS access grants also support cross account use cases using resource policies on the access grants instance or resource access manager as well. 

S3 Access Grants and Other AWS Services

So a quick overview here of like other AWS services and access grants. So kind of like start thinking about, hey, like, how do I think about different use cases? And what can I use for which use case, right? So we kind of like lay the line saying, hey, S3 access grants is your unstructured data access, you know, use case and link formation is kind of a structured data access use case. So in let’s say an EMR use case, like you do like spark 30.parquet, you’re reading unstructured data, EMR will go to access grants, get credentials, talk to S3 bucket, you do spark.readout table, you’ll go to EMR and EMR will go to link formation instead and go talk to S3. That’s all for S3 access grants. 

Then just to recap what we kind of like talked about today, we went through a lot of different concepts, right? And we basically like, if you want to share your data, you have a lot of different options, you can do IAM, S3 access points, session brokers, you can use S3 access grants, unstructured data, you can go to AWS link formation. So a bunch of options as you grow in granularity and scale. And that’s all I have today. And please feel free to reach out to my email address. If you have any questions, I know we covered a lot of different concepts in detail today. But I hope that you enjoyed the talk and I can take any questions if you have any.