Subsurface LIVE Winter 2021
Scaling Data Access and Governance on Data Lakes: Challenges and Common Approaches
With the rise of data lakes and adjacent patterns such as lakehouse, data teams have been able to leverage the increased scale and agility in their use of data. At the same time, the call for responsible data use, privacy regulations and focus on security has only grown, requiring new approaches for the data lake world. This talk will discuss the challenges that teams face when trying to secure their data lake access, common approaches and trade-offs, as well as what the future holds.
Itay Neeman, VP of Engineering, Okera
Prior to joining Okera, Itay Neeman worked for Splunk, serving as both the engineering and site lead for the Seattle office while also personally contributing and leading several major projects, including a new product for Splunk’s cloud initiative. He has a master’s degree from the University of Washington and a bachelor’s degree from Brown University, both in computer science.italy_nme
(silence). [00:01:00] I just realized that I was on mute. Hello, everyone. Welcome to this session. We are going to get started in approximately two minutes. Please grab a seat, get some water. [00:01:30] We will be right with you.
[00:03:00] Hello everyone. Once again, thank you so much for being here today. I want to welcome you to this session, Scaling Data Access and Governance on Data Lakes: Challenges and Common Approaches. Presented to you by Itay Neeman, VP of Engineering at Okera.
Before we start, there are a couple of housekeeping items that I would like to run by the audience. We’re going to have a live Q&A session at [00:03:30] the end. If you would like to participate, I will go ahead and prompt you. You will have to enable your camera and microphone to make it interactive. Otherwise, just ask your questions on the chat window that you will have on the right-hand side of your screen.
Also, please don’t forget to join our Slack community. And as well, visit the Slido tab on your browser. Where you can leave feedback on the conference and the presentation. Without further ado, [00:04:00] Itay, the time is yours. We have you on mute, Itay. I’m sorry. Sorry to interrupt.
No problem. Thank you for telling me. That would have been good to find out now rather than five minutes in. Thank you everybody for coming. Really appreciate you attending the presentation. Hopefully everybody had a chance to eat some foods. And is now in a good lunch lull to talk about this very exciting topic. To digest all of this information. We’ll talk today about [00:04:30] Scaling Data Access and Governance on Data Lakes. The challenges that we have, and common approaches that people take to solve these.
Just very quickly about me My name is Itay Neeman, I lead the engineering team here at Okera. Where we help some of the world’s largest organizations solve their data access and governance needs safely, securely, responsibly. Before Okera, I spent many years at Splunk solving the same types of challenges. And really helping people scale to their data needs. [00:05:00] As well as Microsoft building some operating systems there.
Before we start it’s helpful to think about what’s the problem. Securing data access and governance is really hard. I think we’ve heard this all throughout sessions. During the conference and the keynotes. And the months and session, if you had the chance to go to that one. And it’s only become increasingly important. It’s helpful to take a little walk back to understand how do we get here? What led us to this place, [00:05:30] where we were now in a challenge in terms of data, access and governance?
First going to what’s the sequence of events from a technical level that has gotten us to today. If you think about the last 20 years and the data landscape, you really see the first part of the century, 2000 to 2010. Focused very much on kind of what we call single engine single solution data platforms. Where you would have one RDBMS or maybe you would do have a data warehouse like Teradata. [00:06:00] And that would be where all of your data is. And so if you needed to secure something, you could just do it inside that tool.
And then that evolved into the single distributed platform. But still one platform. And this is really the promise of Hadoop, where everything would be stored in HDFS. And then you would launch Cloudera. And you would use Hive and Impala and then Kudu on top of it to go and access your data. And then it was very easy for you to say, “Hey, I’ll [00:06:30] just use the security that Cloudera there gives me, or the Hadoop gives me on top of my data. Because everything was really in this one place. But what’s happened over the last five years is we now have what we are here to talk about today, really which is the data lake. Where you have lots of different engines. You have Dremio, Presto, Spark, Snowflake, and so on.
And then you have many different places where the data is stored, like S3, Azure Data Lake Storage, Google Cloud Storage, HDFS. [00:07:00] And all of these are talking to each other. And you want to make sure that you’re able to have security and governance that works across all of these. And while we’ve seen this explosion on the technical side. We also now see this explosion on the privacy and regulation side. Where kind of data privacy has become ever more urgent. Where on one hand you have regulations that have come on board like GDPR, CCPA, LGPD [00:07:30] in South America. As well as the enforcement aspect of it. And so you can see how the number of fines has only increased over time and is really accelerating. And every week you now hear of a new company being assessed a fine. Whether it’s in California or in Europe and so on. And you need to make sure that fundamentally you are not going to get fined.
And together with this complexity on the technical side and the privacy side. We’ve also increased the sophistication of the types of policies [00:08:00] that we want to write. Before, we would be very happy with a policy like, every sales analyst or every person that’s in the sales analyst active directory group. Should be able to access the sales transactions table. And everybody else should be rejected. A very binary yes or no answer. But today we see our customers and other users want to be able to define much more sophisticated policies. For example, is a real policy that we’ve helped a customer do. [00:08:30] Is every sales rep in your organization should be able to access the sales transactions data. But they should only see data or rows that belongs to their home region. And their home region might be defined an active directory as an example.
And any PII data should be masked if that particular sales record belongs to another rep. But if it belongs to the sales rep, that’s the query, they should actually be able to see that data. And then [00:09:00] another part of the policy on top of that. Which is to say, if you’re a sales rep, but also sales director, then you should be able to see data for all the reps that report up to you. Now you’re bringing this hierarchical HR information into it. And then you also want to make sure that you’re seeing this across all regions. For those reps, the region piece doesn’t apply anymore. But for everyone in the system, ensuring that you can only see data for Swiss customers. If you’re a Switzerland-based employee. And has to comply with a Swiss data residency [00:09:30] and data access law.
And so you can now see how you have this very sophisticated layering of a policy. That is now taking into account user information. Their region, their reporting structure, where they’re based. As well as data information. What data is sensitive? What data comes from what region? As well as environmental context, where are you running this query? And now we’re also no longer in this, yes, you have access or no, you don’t have access, very simple binary [00:10:00] decision. You actually have to make a much more sophisticated decisions. You need to say, yes you can have access, but some data is going to be filtered out or some data is going to be masked. Or some data’s going to be joined in to give you some extra context. And so you now have this more sophisticated policy that you need to be able to apply. To address your modern needs in your engine.
And kind of weave then bring this all together. We took what one [00:10:30] organization that we work with has in their environment. And you can see this plethora of tools. Some of them quite overlapping. They have many compute engines that they’re using for different use cases in different departments but looking at similar data. Like Dremio, Spark, Snowflake, and so on. They have data stored in a variety of places like S3. They’re using multiple data formats like Delta Lake or gestation files, or they have Mongo as well. They have multiple [00:11:00] ways to catalog the data. So they’re working with Glue as a technical catalog, Collibra as a business metadata catalog. And then they’re accessing the data from a variety of places, Tableau, RapidSQL, Python, and MicroStrategy, Denoto. There are some that we didn’t list as well. And they’re also looking to add more with time. They’re looking to go multi-cloud not just AWS.
They’re looking to add Databricks. And from their perspective, the meme is a bit of a joke here. This is actually fine. This is what a real enterprise looks like. [00:11:30] It’s easy to say, “Oh, I’ll just go and use Dremio or I’ll just use Presto, or I’ll just use Spark.” But really you’re going to be using all of these on different engines. With different tools of accessing the data. And you want to make sure that your security and governance is working on all of these. Not in different ways, but on all of these in the same way.
We see the problem that we have today, which is we have this explosion of tools. We have [00:12:00] this explosion of data privacy needs. We have this much more sophisticated types of policies that we want to be able to create. But the other problem is that the common approaches that we’ve taken until now don’t really work in production. They work well in demo or when you have a very, very small use case. But as you’re deploying it to the rest of your organization, those approaches no longer work. But before we start thinking through what those approaches are. And what are the [00:12:30] trade offs that they’re forcing you to make. What would be great? If we could just wave a magic wand. And we could say, “Abracadabra security be solved.” What would that look like?
And so we think about this along three dimensions. What we call completeness, compatibility and consistency. Completeness means that you’re able to define rich and sophisticated policies. The policy that you saw before for the sales rep should be the norm. Those are the types of policies that you’re going to be able to define. And you’re [00:13:00] not limited by the tool or by your governance platform on what types of policies you can create.
You want to make sure that you have compatibility. That it works with all the tools that you care about. Remember that organization that we saw before. And you want to make sure that it works with the tool you have today. As well as the tools that you’re going to add in the future. Which you obviously today don’t know what those are going to be. So you want to retain flexibility and plan for that future.
And finally, you want to make sure that you have consistency. That your policies, your governance is only [00:13:30] defined once. You don’t have to repeat yourself every time you want to add a new tool. And that wherever that policy and that security to access is being applied, that it’s always going to do the same thing. So that you don’t have to think about, “Hey, if I access this data from Spark versus from Dremio, I’m going to see different results.” Because then you’re going to drive yourself nuts and your user’s nuts too. And your chief data officer’s going to tell you that you can’t give anybody access because you don’t have good security. And [00:14:00] we call these the… Completeness, competitive and consistency. The three Cs of secure data access. An easy way to remember that.
And as we go through these common approaches and their trade-offs. Think in the back of your head, which C are they violating? The first example or the first approach, which is very, very common is just to say, “I’ll go and create secure copies.” I have some data, maybe it has customer names or social security numbers in it. I’ll just create a copy of the [00:14:30] data that is secure. Meaning that I’ve removed that sensitive data or I’ve masked it. And then I’ll just give people using AWS, S3 policy or [inaudible 00:14:41] and so on. Access to that data specifically. And if somebody needs access to the raw sensitive data, I’ll just give them access via IAM to the raw one. And that’s the benefit. It’s relatively simple. But the trade-offs that you’re making is that your definition of secure… What data is sensitive? What do I need to filter [00:15:00] out? And so on changes over time.
Today you might say a user ID is not sensitive. But then in a month you go and say, “Hey, now we can accept email addresses as user IDs.” And now it’s become PII. So you need to go and mask that data. Now you need to go and recreate all of those previously secured copies again, to remask the data that’s already there. And similarly different users in different use cases are going to have different variants of what security is needed for them. If [00:15:30] you think back to that example policy, some user is going to need data from the EU and some from the US. And some of it is going to need for just reps one, two, and three. And now you’re forced to create multiple copies of the same data just for different users or for different groups of users. And you’re losing your mind keeping track of what data have I given to who? How do I make sure that somebody doesn’t have access to two different kinds of data that they can combine together? You see the limitations of this approach.
A similar approach [00:16:00] is that instead of creating static copies, you will say, “Hey, I’ll go and create views in Dremio or in Presto, or on spark, on top of my data. And I’ll just define my policy in the view. You can see a simplified example here. But you’re actually being forced to make very similar trade-offs. First, a typical organization easily has hundreds, thousands, tens of thousands, or even more of tables. Or what [00:16:30] we think of as catalog objects. And adding a view to each of those to go in and code a policy creates this incredible explosion. That you now need to go and manage. I need to think about as scheme of change and evolve. And so I need to go and update and manage all of these new calc objects.
Similarly, you now have to duplicate the policy differently for each table. If you want to say, “I just want to mask social security numbers everywhere.” You now need to go and create… In every table that has social security numbers. [00:17:00] You need to create a view. And say the mask function. But if you now need to go and change, instead of masking, you want to know it as an example. You need to go and change a hundred views. It becomes very difficult to evolve and to be able to create the types of policies over time that you need. And finally, it can be difficult to use user context in these ways. Because you’re defining the policy statically on the object, but not in terms of the access. How do you make sure [00:17:30] that for different users they can go and see different things? And some engines support this but then you’re sacrificing consistency.
Next, you can go and implement this as native policies. You can say, “Hey, SQL server or PostgreSQL, they have good security capabilities. I’ll just define my policy in each one of these engines that I’m using directly.” But you’re losing that single pane of glass. Having one way to be able to see what policies do you have across Altru data who has access to what [00:18:00] and so on. And I’ll need to go into each engine and reconstitute that global view. You’re losing consistency because not every engine can express and enforce the same types of policies. Some people can do masking, some can’t. Some can do referential integrity preserving tokenization, others can’t. And so now you’re forced to choose where do I implement what or choose the lowest common denominator. Which leaves nobody happy.
And you obviously have to educate your users [00:18:30] how to manage those policies and each engine. Sometimes the same engine, it’s going to look at the same data. So they have to create the same policies twice. Which they’re not happy about. And as you go and create new or add new engines, new tools, new storage into your system. You need to go and duplicate all the policy you created. And any new policy you have to define it one more time in that new engine. It becomes this new major effort. And your governance platform has become an impediment to actually [00:19:00] adopting new technologies that are important for your business.
Next, you can say, “Hey, I’ll just use open source, like Apache Ranger. I’ll just build something internally.” Let’s take Ranger as an example. The challenge that you have there is the approach that Ranger takes. Or the other open source technologies is they take a plugin approach to enforcement. They really say, “Hey, you have a plugin for Hive. Do you have a plugin for Presto?” To be able to do the security enforcement that you need. [00:19:30] But what happens if you don’t have a plugin? For example, there’s no open source plugins for Spark. Or what happens if each plugin implements different parts of security there. You end up having this challenge where you lose consistency, you lose compatibility by having just plugin. And some things don’t support plugins at all, like PostgreSQL or SQL Server, or BigQuery, or if the knockout the SQL as a service tools as well.
Ranger forces you to find policies per engine. If you wanted to find a policy for [00:20:00] accessing data from Dremio versus accessing data from Spark, or Hive, or accessing the data as a file. You have to create those policies for each one of those use cases separately. Forcing you to repeat yourself. And making it harder to evolve those policies.
And finally, these tools like Ranger delineate between the part that is about access control really. Policy definition and management. The auditing piece, ensuring that is well integrated and centralized, as well as cataloging. [00:20:30] Which is what objects do you have? How do they relate to each other? How do they relate to the policies? And all of these things really need to work in concert together. And you want to make sure that every piece of the Venn diagram of them is really one circle. Not that there are three, not overlapping circles. And so it’s critical that you do that and Ranger doesn’t fundamentally work that way.
And then the two catchalls I would say that people use as they’ll just say, “Hey, I’ll just use, IAM on S3. [00:21:00] I will create an IAM policy of who can access what. And I’ll just give people access that way. It’s very easy to think about, but the challenge is that you can only create very core screen policies. You can’t do filtering. You can’t do masking. You can’t do conditional filtering or masking by definition. So you lose out on that completeness aspect. It becomes hard to map organization identity. Things like Active Directory or Okta and so on. And the user context that comes with them [00:21:30] to your storage identities. Things like S3 bucket policies and so on. I don’t know how many people here have had the pleasure of needing to define policies for… Or sync IAM users with ADE, but it’s not fun. And then obviously it’s not possible for all engines. If you have integrated compute and storage like Snowflake, you out of luck because your cloud storage engine doesn’t support controlling the access on Snowflake.
[00:22:00] And similarly you see the same problem with securing it at the ingest layer. Removing sensitive data before it lands in your Lake. Your definition of sensitive changes over time so you have to go and re-ingest data. It’s difficult to control every entry points in the system. It’s very easy for accidentally sensitive data to land in the Lake. And then it’s hard to know what you’ve missed. If something accidentally landed in the Lake, you might not even know that now there’s actually sensitive data there.
We’ve talked about all of these common approaches and the trade- [00:22:30] offs that they force you to make. And the reality is trade-offs are okay. And every organization is going to have some mixture of some of each. But what approaches do work? What is going to scale with your organization? And so we found three things that work no matter what you throw at them.
The first one is attribute-based access control. And the goal there is really to be able to define rich policies that are based on all of the attributes of the user, the data, as well [00:23:00] as the environment. And so you can create those sophisticated policies that we saw before on top of those attributes. Not on top of the actual data itself. Allowing it to create a very, very broad policies that can apply very fine-grained control on individual tables, individual databases, and so on. You want to make sure that you’re doing dynamic enforcement. Because that allows you to take your Lake and easily add security to it in a very incremental way. Without spending [00:23:30] a lot of money or time doing that. And as well as your Lake changes, in terms of new volume of data. And new use cases, new tools. You can keep maintaining security without needing to go and redo all of your existing security investment because… Excuse me. Something changed.
And finally, you want to ensure that you have distributed stewardship. That you are able to delegate control to other people in your organization to be able to manage parts of your governance posture. But still have the visibility into what they’re doing, [00:24:00] what their users are doing. And the oversight in terms of being able to limit them and what they’re able to do. So that they can’t go and give access to everything. Or they can only give access to the part of the Lake that they on.
And really what you want to make sure is that you’re always prepared for scale on many dimensions. It’s not just the scale of is your Lake a hundred terabytes, or a petabyte, or 10 petabytes. It’s also how many tables do you have? How many different use cases do you have? How many users do you have? How many regions does your organization [00:24:30] work in? How many regulations does it have to handle? And so scale comes in all kinds of different ways. And governance needs to scale across all of them. Because if one of them fails, it’s going to be very difficult for you to have the agility that you want.
And then as you’re adopting that type of governance posture. There are things you want to make sure that you don’t forget to think about. First is prepare for the future. Ensure that as you know the organization is eventually going to adopt another engine, another [00:25:00] analytics tool, another storage platform. You want to make sure that your governance strategy is going to be able to adapt to them.
You want to make sure that you’ve removed bottlenecks on people. That things are automated, that you can delegate. It’s very common to say, “Hey, the data platform team is the one that’s in charge of provisioning access for any user who needs it.” But suddenly there are 1,000 users. And they have to go to two people who can’t handle that. You want to make sure that you can delegate that to the data stewards, the chief data officer, [00:25:30] to be able to define the governance policies and so on. Keep in mind all the personas that interact with your data across your organization.
And finally, you want to make sure that you have visibility into what is happening. What are people doing? So your use cases are going to change. Your regulations will change. Your tools will change. And if you don’t have access to data. Who is running queries? What are they accessing? Who is accessing sensitive data? What data sets are not being used? Etc. You’ll be flying blind and be very challenged to be able [00:26:00] to adapt to any change that you need to go into. And that will make you very cautious of doing that change. Again, reducing your agility. And making security a blocker rather than an enabler.
If we bring all that together, you want to make sure to… We need to adapt to changing times. There’s no one true way to implement data security. Sometimes you’re going to need to secure data on ingest. And sometimes you’re going to want to apply security at the edge with a plugin. And sometimes you’re going to want to use more for proxy-based approach. [00:26:30] All kinds of different approaches. But you should embrace that fact and not say like, “Hey, one of this is wrong, one of this is right.” We’re really different tools, different use cases, we’re required different approaches and we should embrace that. You to make sure that you have composition. That you’re able to compose different techniques, different tools, different strategies. And that the governance platform that you choose allows you to go and do that.
And finally, I hope that they’ve left you with this impression is duplication [00:27:00] is bad. If you have duplication in your policies, in your enforcement, then you’re going to be very, very sad. Because it’s going to be very difficult for you to adapt and change over time. And you want to make sure that you’re taking into account the entire life cycle of data. Sense of data exists in ingest, that it exists in storage, that exists in access, that exists in reports and dashboards, and so on. It exists in the metadata. How do you manage all these pieces and make sure that they’re working together? And your governance and secure data access is taking all of [00:27:30] that into account.
And finally you really want to think in terms of and rather than, or. You want security to be an enabler. Because you’re secure, you can give access to people more quickly. You can adopt new tools more quickly. You can bring online new use cases more quickly. Because you have a flexible, scalable governance platform to be able to do that.
That’s it for the presentation. Hopefully that’ll give you a sense of [00:28:00] different approaches that people take. And how they do that. Please, I’m happy to answer your questions now, as well as in the Slack channel app or the chats as well. Please stop by Okera booth and Okera Slack channel to talk a little bit more about our approach. And how we do that. And then see the demo. And then finally, for folks attending the demo, we’re giving out a $250 gift certificate. As well as doing some user research. If you’re interested in having an impact on a secure data platform. It’s a great way for you to be able to do that. Thank you very much.
[00:28:30] Excellent. Thank you so much. I think we have a couple of minutes. If you want to take some of the questions I am going to go ahead and try to queue in some of the users that have selected to participate in there in the interactive. That has not been working well today. Well, let me see, I’m going to click the room. If they disappear, that means that they are not there. There you go. We don’t have interactive questions. But let me see, Adam, did you want to…? No.
[00:29:00] All right. There is a question here that came in a little earlier. I don’t know if he has been answered or not. He says, is there access control applied to a schema or table layer for the data lake? If so in the AWS case, the log down access to S3 bucket is required before using the product?
Yeah. It’s a great question. And I would say that generally, you want to think about defense in depth. Where you have defensive from a security perspective at different layers. So [00:29:30] if you have sensitive data, let’s say of the S3 file where the social security numbers in it. You want to make sure that nobody is able to access that S3 file directly. Because they can circumvent any other security that you have. Similar to how you were saying, “Hey, I don’t want people to connect to PostgreSQL directly, or it’s only through API.” Just as an example from an application perspective.
Typically, you would lock down access to an S3 bucket. And then you would provide access through a secure platform. Which can still give you access to that data as a file. But ensuring [00:30:00] that, that fine-grained policy is still being applied too.
Excellent. I think that is our time for today. I want to mention that it looks like Roslyn’s cat jumped in on her keyboard. There was a bunch of a’s, f’s and s’s. A little bit of humor there today.
Great. Thank you so much Itay. Wonderful presentation. And as you mentioned, please go ahead and visit the booth. And to the rest of the audience. Thank you so much for being here today. I hope you enjoy the rest [00:30:30] of the sessions today as well as tomorrow. And we will talk to you soon. Thank you so much. Bye-bye.
Great. Thank you so much. (silence).