Making Cloud Big Data Platforms Open and Secure with Dynamic Data Authorization

Session Abstract

Cloud big data platforms like Amazon EMR are popular because they offer tremendous flexibility to use open source frameworks like Apache Spark, Apache Hive and Presto, and they efficiently provision compute resources as needed. Because organizations do not have to invest time and money in their own infrastructure, they can greatly reduce computing costs and accelerate time-to-answer. There’s a catch, however. With openness and flexibility come a responsibility to use confidential, personally identifiable and regulated data responsibly. Simply evaluating how to enforce data authorization policies across a variety of user activities on open platforms can be time-consuming, and for that reason some organizations do not even consider cloud compute for big data when working with sensitive data. Now you can.

Video Transcript

Itay Neeman [00:00:05] Hello everyone. It’s great to have a chance to speak with you. So we’re going to talk about today kind of these big data platforms and how to keep them open and secure with dynamic data authorization. So first, just a little bit about me. I lead the engineering team here at Okera, been here for a few years and have been in the data and big data space for many years now. I’ve spent many years at Splunk kind of building that capability out and helping scale that from a daily ingest of terabytes to petabytes, and then spent some time with Microsoft as well. And of course you can reach me any time, my email is just itay@okera.com. And then just a little bit about Okera. We help some of the world’s largest organizations keep their data secure and let them access the data that’s valuable to them while still making them follow the regulations or any other security requirements that they have. And we’ll talk a little bit about the learnings that we’ve had working with some of the world’s largest organizations.

Itay Neeman [00:00:59] So what are we going to cover? When is this talk going to be most useful to you? So, first of all, if you have a sizable data lake. It doesn’t have to be hundreds of petabytes, but probably you have some data you want to protect. You want to be able to access that data with multiple engines on top of it, so maybe that’s going to be Dremio and Spark and you’re maybe using Snowflake as well, but all these different types of tools. And you need different kinds of policies to authorize access to that data – and we’ll talk about what some of those are – any other kind of sensitive data that you want to protect. So let’s talk about a real world example. So take FINRA. So FINRA is the Financial Industry Regulation Authority. So obviously all the trades that are happening, all– kind of all of the transactions that happen in the financial industry need to be regulated, it’s kind of self– the self-appointed body of regulation for the industry. And they’re probably one of the largest data lake users in the world. Their lake is in the hundreds of petabytes, they’re using S3, they run over 150,000 nodes of compute every day, which is just crazy if you think about it. And then in terms of the workloads that they’re running on top of it, it’s looking at hundreds of event– market events a day, and they’re doing this for things like fraud and abuse, insider trading, looking at all those transactions to be able to see that data. So obviously, as they’re looking at this data, they need to make sure that on one hand they can scale to the level that they need, that they’re able to use all the different types of tools in order to do these types of workloads and work at the scale, but at the same time, do so securely, ensure that people are not accessing sensitive data that they shouldn’t, know about the regulations that they have to abide by. Obviously, since it’s financial data, it can be very sensitive as well. And so companies like FINRA and others use kind of what we call these open compute clusters. And so you can think of kind of the cloud vendor offerings here, so this would be things like EMR on AWS, HDInsights on Azure, Dataproc on Google Cloud. And then there are ones that are not associated with the cloud vendors as well. So this might be just running Spark on top of Kubernetes or using Domino Data Labs or using Sagemaker on AWS. And the goal here essentially is you have one cluster where you can run multiple compute engines on one of them, so you might have one cluster that’s running both Spark and Presto, as an example, or Spark and Dremio, and so on, and your goal is to be able to use the same kind of underlying infrastructure to be able to run different kinds of analyses, as well as be able to spin up different clusters for different needs. Right. So you might have one cluster for interactive and multiple clusters for batch and so on. So we’ll walk through some of these examples. And the reason that this is useful is it helps you kind of separate that– your notion of storage and computer. This has kind of been the innovation of– birthed the modern data lake ecosystem. And so you might have a data lake say that’s on S3 or Google Cloud Storage or ADLS on Azure, and you might have different clusters accessing the same data, right. You have one cluster, for example, that might be for a particular weekly batch job that takes a day to run that runs on the weekend to compute some funnel analysis or anything like that, maybe that’s running on Spark. They might have another cluster that’s used for interactive analytics where users are just submitting SQL queries kind of ad hoc to look at the data. And these different clusters are looking at the same data. Right. So it’s going to be agnostic and kind of leverage any innovation that’s coming out in the ecosystem as well as evolve what– what you’re leveraging. So maybe today you’re using Spark and Presto, tomorrow you’re going to use Spark, next you want to do some data science on top of it. So you get that flexibility to go and do all of that, and kind of you have multiple compute engines that are looking at the same dataset. And you have different access patterns that’s running on the same infrastructure. If you think about this batch vs. interactive workload, maybe the batch job is submitted by Airflow and kind of running without any human interactivity and that’s fine. But the ad-hoc cluster, you might allow users to SSH on the machine to submit a PySpark job or they’re going to use the Presto CLI to go and run some query, and any kind of access that they might have where they can submit a custom script to run on the node in a distributed fashion. And this is needed because the business has all kinds of ways that it needs to run these analyses and users are going to be comfortable, some of them with end user BI tools, some of them are going to want lower level SSH access, some of them are going to want to submit custom code, and you want to be able to support all of these things on the same infrastructure without saying like “hey, each use case needs its own dedicated infrastructure with its own security.” And that’s really where the problem starts coming in. And obviously, some of the benefits that you’re getting here is you’re going to kind of just nearly increase scalability. I would talk about FINRA as an example. You’re going to run hundreds of thousands of nodes a day here so you can kind of scale this up and down as you need to and kind of get that elasticity, which is kind of both automated and it’s easy. So if you don’t need a lot of capacity, you can just scale it down. If you need more capacity, you can go and scale it up. You don’t need to pre-configure, pre-purchase these things — very easy for you to do. And you’re kind of leveraging market pricing to do all of this, but at the same time you also leverage savings by running multiple of these engines on the same infrastructure. Right. So you might have one cluster, say, one hundred nodes that allow you to run both Spark jobs and Presto jobs at the same time without needing to say, hey, I need one hundred node cluster for each of them, which would obviously double your cost. And all of this allows you to focus your attention and resources on things that are important to you, like the analyses you’re running, rather than how do I need to go and run data infrastructure at scale? And then obviously common use cases, machine learning, ETL and so on, and these are the things that all of you and the stock are doing kind of every day, the organizations that we work with are doing every day. That’s– these are the things that can actually provide value to the business.

Itay Neeman [00:07:21] OK, so we went over, what are these open compute platforms, what are they used for, this need to separate compute and storage to be able to run multiple types of access patterns and so on. But what are the problems that people are running into? Well, what we found is people run into kind of two core challenges. One is, often people have access to data they shouldn’t have access to. And at the same time, they actually don’t have access to data that they need. And that might sound contradictory, but it turns out that this can be very messy. And obviously, when you say people here that can be a person running an interactive job or a machine doing some automation, an algorithm– like an ML algorithm doing training, and when we kind of talk about sensitive data or data– like this might be data that’s confidential for business reasons, it might be data that’s PII or any other kind of regulated data as well that you can’t share for regulatory reasons. And our goal is to flip this right, rather than turn this into a challenge, you want to turn this into a strength. You want to say that people can’t retrieve data they shouldn’t have access to and people can access data that they need for these legitimate business purposes. And so how do we go and do that? And so the core of this is really applying kind of the concept of fine grained access control. So just to recap, coarse-grained access control would be the Bob user either has access to the dataset or they don’t. While fine-grained will allow you to do things like column level, row-level, cell-level enforcement. So what are examples of this? Right. So column level, you might say Bob has access to the sales transactions table, but any column that is marked as sensitive, like Social Security number or address, should be masked, so replaced with Xs. So that would be an example of column level policy. Row-level would be you need to filter out data that the user shouldn’t see. So maybe for Bob that is, he should only see data that’s coming from his region. So maybe he’s in the US so he shouldn’t see any data from Europe to ensure data residency requirements. And similarly, for cell-level, it might be you need to take into account other cell information, so it might be Bob can only see data that is belonging to his org, and so what– what org it belongs to is in another column. And so data that’s from another org, he can still see the rows, but the sensitive data is masked. So that’s an example of cell-level logic. And if you think back to the FINRA case before you can think of like how this might appear. So maybe an analyst doesn’t need information about which financial institution a particular transaction came from, and so maybe that column is going to be masked. Or maybe they should only see data for a particular financial institution, so you go and use row-level enforcement, and so on. And the way that we want to achieve this fine grained access control is what we call universal data authorization. Kind of what we’ve seen the organizations that we work with has been, kind of using this construct have been most successful in actually allowing their users to get access to the data that they need. So the goal here is that you’re dynamically authorizing data access when a user is asking for it. And this is part of your overall kind of data governance practice. And it has three core capabilities. First is universal policy management, so you have one place where you define your policies and you don’t need to define them multiple times for different ways that you’re going to do the access. You’re doing this policy decisioning on demand, that’s the dynamic part. So you don’t– you’re not doing it once and storing it in storage and then accessing that. That’s every time you to access it, you make a decision, can this user see it? What type of fine grained access control do we need to apply? And then you get the intelligence of how people are using that data as that data access is happening. Obviously, this is going to be accessed from multiple places that could be BI tools, scripts, batch jobs, and so on. And obviously agnostic to the data platforms, so it doesn’t matter if you’re running this on Spark or Presto or Python script or anything like that. In order to do that, there needs to be multiple enforcement patterns. So there’s no one size fits all here, you want it to universally apply, obviously, and so there are different ways to, to go into that. For today we’re going to focus specifically on applying that fine grained access control, FGAC, in these open big data frameworks.

Itay Neeman [00:11:32] And so what does this look like just from a broad architectural perspective? So you can see on the right hand side we have the, our end users right, these are data scientists, analysts, application users. And they’re using all types of tools: big data tools, BI tools, programmatic tools like the rest API scripts, and so on. And they’re accessing data sitting in various places, object stores like S3, data warehouses like Snowflake, RDBMSes like Postgres, and so on. And the goal here is as they are running a query or some type of data access, this needs to go to this universala data authorization platform, which is going to check against the policies what they’re allowed to do, it’s going to make a dynamic decision – for this user, what permissions do they have on this dataset – and provides the– essentially the enforcement back to happen dynamically, so maybe that’s masking, row-filtering, and so on. And then it’s storing this intelligence about what access happened, what sensitive data was accessed, etc. Then these are obviously connected to things like your business catalogs like Collibra or Alation, the technical metadata stores like the Hive metastore or Glue and so on, as well as tied to your identity authentication systems like Okta, Azure AD, Active Directory, and so on. And so if we start zooming in on this architecture and I have these– this notion that enforcing these policies is not so easy, right? Because if they’re accessing something from a script, that might look different from enforcing it where they’re accessing from a big data tool or from a BI integration and so on. So let’s take one example. Say you just want, like you’re using something like Dremio or using Presto, and you just want to say, hey, I’m going to submit a BI query, I’m the Bob user, I want it so when I submit that query that– to ensure that I’m only accessing data that I’m allowed to access. So that query is coming in from the BI side here, it should be enforced and it’s being sent down to storage to actually get the data and then execute it. And so when we’re really dealing just with SQL, it’s easy to think of this as a logical rewrite right? You have the original SQL and then you rewrite that SQL to have the policy and masking, row-filtering, any other transformations that need to happen. Now, that SQL transformation might happen by changing the query plan or actually physically rewriting SQL, there are multiple ways to do it. But this is a relatively easy way to think about how to actually do enforcement. And for many other tools, for example, like Ranger or others, this is what they do. They basically tell the tool like Spark, Presto, Dremio, etc, here’s how you can go and rewrite the query to go and do that. And for some cases, that works really well. And if you’re going to just dealing with SQL and that’s your only entry point, that’s a good place to start. But what do you do when your engine doesn’t know how to enforce a policy that you have, maybe a sophisticated policy like differential privacy? Or what happens if the storage system doesn’t support the data model that you need? Or if you don’t actually trust the compute engine, like what if somebody could use it to try and access data that they shouldn’t? And so this– these problems are actually quite prevalent on these open compute platforms.

Itay Neeman [00:14:42] So let’s think about this question like, can you trust this client? Right. So, again, let’s take EMR as an example. So EMR, you have Presto, Spark, Hive, all running here. And this is all well and good when you’re just working with SQL. And because you just submit a SQL query to Spark, Spark goes and executes the query by accessing data in S3, and Spark is leveraging the cluster-level IAM access that it has to go and access the data. But the problem is, what happens if you allow your users to SSH into this cluster? Well, now they can use that cluster level access as well, right? IAM is very coarse grained, it doesn’t really differentiate between different users. It doesn’t allow you to do fine grained access control. You really have policies that you have access to everything in this bucket or everything in this path. And similarly, even if you don’t allow them to run SSH but just allow them to run custom code, how do you ensure that that custom code, that PySpark job or so on, or even a custom UDF that they can install, doesn’t access S3 directly and thus circumventing the policies that you’ve done at the SQL level? And really what is happening here is you’re taking a cluster and saying, hey, we, we kind of trust this cluster to not do anything wrong, even if the users can do a wrong thing, whether it’s maliciously or accidentally, if they wanted to. And that’s not a good place to be, because you’re now kind of relying on people to go and do the right thing, and your security can very easily be circumvented. And so if you think of an organization like FINRA from the beginning, that’s really a problem because they want to allow people to access these clusters and this compute in all kinds of different ways, including SSH, and– but if they do that, they can’t know that people aren’t accessing that data directly. Or– so they have to end up limiting what they can do, which limits the flexibility that they’re offering to those users.

Itay Neeman [00:16:36] OK, so what do we want? Right. We have our magic security wand here on the slides, so if we go and wave that security wand around, what are the things that are important for us? So we think about this in three ways. So the first is security. We want to take the zero trust approach and we don’t want to trust either the user or the compute engine to only do the right thing. We want to take into account that maybe it will do the wrong thing, whether, again, maliciously or accidentally. We need to make sure that this works at scale, so if we think back to the FINRA scale, 150,000 nodes, how do we ensure that if we add the security, we’re not ruining the scale of the system? Because that’s one of the main selling points. And then finally, we want simplicity. We want to make sure that we don’t need to manage these policies in all kinds of different places. So let’s zoom in. Right. So before we had– let’s zoom into one node. And so we had this IAM access of the cluster or node level. And all of these tools are accessing the storage directly. That’s kind of insecure for all the reasons we mentioned – which is people can go and circumvent that, that they can access this node directly or run custom code – but we want to change this to secure. Right. So now we’re saying there’s no IAM access of this cluster and node level, and we’ve introduced this out of process, kind of secure enclave of data access that gets this ephemeral access to storage when needed, and we’ll talk about what– how that happens. And now instead of Spark or Hive or Presto or your SSH script accessing data directly from S3, it’s accessing it through the secure enclave, which is going to do the actual access. And this enclave that’s in short providing that fine grained access control. So zooming back out to the bigger picture, we see that every node essentially has this secure area. And what happens is, whenever a query or data access request comes in, that goes to that kind of policy enforcement engine that we talked about before, that gets authorized, so it’ll be told, yes, you can go and run the query, but you need to do this row-filtering and this data transformation so it comes back as authorization, and it gives you this time limited and ephemeral access to storage. And that access is only usable by the, the secure kind of data access enclave. So you don’t need to go and trust that Spark is going to use this access to go read the data directly, because it can’t. It’s encrypted, doesn’t have access to it. Only this secured area has it, using process isolation to ensure that this enclave can’t be broken into. And at the same time, it actually scales automatically with your cluster, right, because you’re running this on the same infrastructure, on the same hardware the rest of your compute is running on. So the marginal cost of adding this security is zero, but you’re significantly increasing your security posture. And now we’ve reduced the need to go and manage IAM nearly at all, because all of the policies that we have are only needed to be done at– inside this kind of universally– universal policy engine and policy store that we talked about as part of universal data authorization. And so if a user were to SSH into this cluster, they couldn’t do anything. They couldn’t actually access the data directly. They would only be able to do so when they ran a query or went through the normal data authorization that would get authorized and then they would– the access would happen. And so you’ve kind of solved this problem of different patterns needing to access the data in different ways.

Itay Neeman [00:19:59] And so kind of tying all of that together, the things that are important takeaways for you. You want to have this universal data authorization that– so that– that part is critical so that you can provide that secure access to the data that your users need. Right. That allows us to solve that initial challenge that we had before. And you want to ensure that… Or, rather, the existing solutions that people use, right, like what folks do with Ranger as an example, is insufficient because it’s really relying on the engine to do the enforcement for you. But maybe the engine doesn’t support the type of enforcement that you want to do, or you don’t trust the engine to be running in a totally secured environment. And you might have different patterns of accessing that data. And so you need to have this mechanism of kind of separating out the secure access into a secured layer so that you can’t– the access can be circumvented. And finally you want to provide kind of that flexibility to your users, right, if a user needs to access something by SSH, or they need to access something by submitting custom code or just by writing a SQL query, all of these are valid, and you want to go and support all of them.

Itay Neeman [00:21:09] So that’s it for our talk today, I’ll kind of see if there are questions both people want to ask over voice as well as over the chat, I’ll look at the Q&A and see the questions. Great, so one– one kind of question that came up was “is metadata driven agent a plugin-based solution where possible access points are covered?” It’s a great question. I will say, a plugin-based approach is definitely part of these different enforcement patterns that we, we do at Okera and that are necessary to support these different types of tools, but it’s actually not sufficient for all access patterns. So first of all, not every tool is going to support plugins. An example is– are these query as a service tools like Athena or BigQuery, where you can’t go and install a plugin. So you want to ensure that you have enforcement for these different patterns. And certainly metadata is a critical part of it, right? That’s part of this kind of policy store and policy engine, they need to have the metadata for the system. And then there’s another question, it’s kind of does the enclave work when Spark uses its cache to answer the query? That’s a great question. And I will say this topic of caching and security have some tension between them. And typically, I would say, security can make caching somewhat problematic because you want to ensure that you’re caching things at an appropriate level, which is at the user level, and not using one cache to answer questions for multiple users, as obviously if you have row-filtering or data masking, users might be allowed to see different aspects. And so I’ll say, the compute engines are becoming more adept at this and being able to use caches for individual users and kind of keeping that context. But this is something that I would say is still relevant. But it’s a great question. OK, any other questions? I’ll see here. And then, of course, if you think of a question later, or just want to chat, I’m available on Slack, on the Subsurface Slack, as well as we’ll have the speaker channel to answer questions later and I’ll be there too.

[00:23:27] Oh, great. All right, well, thank you so much, Itay.

Itay Neeman [00:23:29] Great, thank you.

[00:23:32] Thank you. And again, please feel free to join the Subsurface Slack community and search for Itay’s dedicated channel if you want to ask him questions directly. And be sure to visit the Expo Hall as well to check out the booths, get some demos and win some great giveaways. Thanks, everyone.

Itay Neeman [00:23:50] And we have the Okera channel and booth as well. So please stop there as well for more question specifically on Okera.

[00:23:55] Fantastic. Well, thanks so much, everyone. Hope you enjoy the rest of the conference. Oh, wait there. We have more questions. Oh.

Itay Neeman [00:24:06] OK. Sorry, just saw them come in. Yeah. So I think, Sudhakar, that some kind of major customers for Okera – certainly feel free to stop by the Okera booth for more detailed things there. So FINRA would be a great example of a customer that we work with to help solve this problem at scale across different types of engines and compute platforms that they need. So that’s one example. And then we’ve got a follow up question on kind of– does it use other access control tables for access control? I would say not necessarily directly, but you could think of it as kind of the policy store that exists as part of the kind of enforcement agent or this data authorization server that we saw in the diagram is where the policy is stored and needs to be controlled. That said, some policy might say, hey, in order to decide who can see some data, that needs to be joined with some other table, so this is– typically you can think of it as consent management and so on. And so that’s a type of policy that you might want to present as well. And then that can be given to say, to Spark to execute and Spark will use the enclave to go and access both those tables as an example. Thank you, everyone, and please join the channel for any follow up questions as well.

[00:25:15] Thank you. Have a good one. Thanks, everyone.