Dremio Jekyll


Subsurface LIVE Winter 2021

Centralized Security and Governance in the Cloud

Session Abstract

Digital transformation is driving data migration to the cloud to leverage the benefits of elasticity and scale. Tools like Dremio speeds up and simplifies access to discrete data sources. However, managing data privacy and security is still a challenge in the new norm of heterogeneous and highly diversified cloud environments. Apache Ranger is the de facto security solution for big data. With new integrations support, Apache Ranger now is integrated at the core of Dremio. This talk will highlight new capabilities of centralized access control and how it can be used to provide robust security and governance.

Presented By

Don Bosco Durai, CTO & Co-Founder, Privacera

Bosco is a seasoned entrepreneur and a thought leader in enterprise security. Through his first startup Bharosa (acquired by Oracle), he built one of the first real-time fraud detection solution products. His next company XASecure (acquired by Hortonworks/Cloudera) built security from the ground up for the Hadoop ecosystem. XASecure product was donated to Apache Software Foundation and became Apache Ranger. Apache Ranger is now the de facto security solution for Big Data and is been adopted by all major cloud providers. Currently, Bosco is the CTO and co-founder of Privacera. Privacera is simplifying the data access governance in the cloud.


Webinar Transcript

Anita Pandey:

Welcome everyone. My name’s Anita Pandey. I’ll be your moderator for this session. And I want to welcome Don Bosco, who is the CTO and co-founder of Privacera talking on a really important topic, centralized security and governance in the Cloud, also one of my favorite topics. And I think we’re going to learn a lot today.

But before we get started, I [00:00:30] just want to do a little bit of housekeeping. If you have questions, feel free to populate the chat, and I’ll be moderating and looking for your questions throughout the session. And then we will actually answer them 10 minutes before the end of the session. Alternatively, if you want to do live Q&A towards the last 10 minutes, feel free to enable your video and audio in the right top hand corner. [00:01:00] And so you can do it that way.

And then last but not least, just wanted to remind you that … Excuse me there. Slack is always the thing that interferes. We wanted to remind you that we have an expo hall open for you to go and check out deep dives on demos. We have a virtual sandbox with Dremio. And we also have fun giveaways. So please remember to go to those. And then any questions we don’t get to in this session, I’ll be [00:01:30] posting in the Slack channel for Don that you can engage with him post event. With that, let’s get started. Thanks, Don. Take it away.

Don Bosco Durai:

Yeah. Thank you, Anita. Hello, everyone. I’m Don Bosco Durai. I am the CTO and co-founder of Private Setup. At Private Setup, we are doing security and governance in the Cloud. I’m also a PMC member and committer in the project Apache Ranger. So today I’m going to be talking primarily about [00:02:00] security and governance in the Cloud. We will see our Ranger can help in addressing some of these challenges, and also we’ll have a quick demo of the new integration between Apache Ranger and Dremio.

So when it comes to data governance, the requirements are the same for both on-premise and Cloud. So we need to identify where all we have the PII and other sensitive data in our data set. Then this is our responsibility to ensure that our customer’s data are safe and secure. [00:02:30] This means we need to protect them with all the proper security controls. Also, we are responsible for complying with the multiple state and country and other industry compliance regulations. This means in addition to access control, we also need to make sure that we store auditor cards and we are able to generate latest compliance reports as required.

So when we enter on-premise world, if you start out somewhat simpler than the cloud, the main reason is we [00:03:00] have a very good understanding about our ecosystem. Mostly there might be our databases. We know exactly what is in what database. We know who has permission. And we also know what purpose they are using for, right? But when it comes to cloud, unfortunately it’s not that simple, right? Yeah. Opening up the whole data set for a larger user group across multiple line of businesses, right? Then cloud itself isn’t much. It gives you a lot of flexibility. [00:03:30] Also adds a lot more complication. Like example, you will have data stored in different formats and different tools and different services.

And they will be also be accessed by different tools and for different purposes, right? So if you look at the diagram out here… So you have the bottom layer, but I’m calling it as a storage layer, you have a data in object storage like S3 and ADLS with different file formats like parquet or arc. [00:04:00] Then, you also have data in RDBMS like Redshift and Snowflake and Synapse, right? Then the layer above that, you have this EMR, [inaudible 00:04:12] SQL, which provides you with a SQL layers. You can run SQL queries directly on that. And then there are a T9 database also, which provide their own SQL layer on top of this one, right? Now, and then some window that access is very similar to [00:04:30] databases when it tries to access data from the underlying object store, right? Then on top of that, you may have virtualization tools like Demio and Trino, which you would…a good way of abstracting the data source and making it easier to use.

And also it can help in performance, like caching and all those things. And you will have business users also trying to access the data, using different BI tools [00:05:00] like PowerBI, Tableau, and they have access to so many data and dashboards, et cetera. Now, the different roles and responsibilities within your enterprise, unfortunately, you have to give access to each and every layer. I wish you could just say, “Oh, anyone can access to the virtualization layer.” But unfortunately, that is not possible, right? So you will have business users will be primarily using some of the BI tools, and in some cases, they may also have access to the virtualize [00:05:30] layer, while data analysts and others will have access to both virtualize layer, as well as the underlying SQL layer, right? Then data scientists and data architects at each of the processes would need access to the underlying role, they need to have access to S3 and ADLS, and they would run spark jobs out of it.

So in other words, you can’t say, “Okay, I’m going to just make the users go through one layer and make it easier.” And when it comes to the tools itself, each [00:06:00] and every tool have their own way of managing the policies and permissions, right? So if you pick S3, it gives you a bucket load access portal permission. ADLS on the other extreme gives you permission, like can do execute in a given file, right? But it’s very difficult to manage, right? Now, if you take tools like Apache, which is an EMR, you can actually integrate with Apache Ranger and you can get a pretty [00:06:30] decent level of access control, but Athena might not probably the same granularity and also the same way of access, managing the policies. So it just makes it much more difficult when it comes to the Cloud.

So because of all the different tools and different way of managing the policies, it becomes very fragmented in a very unmanageable situation. And since most of the tools are built for managing, using [00:07:00] manually on a [inaudible 00:07:02], becomes really difficult to manage, keep the policies consistent across different tools. And then this increases the risk of data security and potentially can lead to non-compliance. And because of that, now from the business point of view, now you’ll be very hesitant. You may either restrict giving data to different groups, or it may just lend them the process of providing the access, which may in turn [00:07:30] lead to loss of value because the data is very time-sensitive, right? And then in addition to all this complication with just managing the policies, going on some others, there are different parties involved. You have the data privacy group who’s more focused on requirements like what are the state and country and other industry regulations. So they have a different set of requirements, while the security team is much more interested [00:08:00] in unauthorized access and data leakage and SIM and other integration. The data owner is more concerned about “Is the data used,” right? They want to know the purpose, they need to know who’s using it, why they’re using it, and they want to restrict whenever possible.

So how can we solve this problem, right? The first thing I think we feel is we need to make [00:08:30] sure that we have centralized access control tool. I think there’s no way you can manage all of them independently, right? So once you have the central management tool, you can consistently manage the permissions across all the tools, right? Now, you also have to be careful because each and every tool have a different way of representing what I call as a resource. They have a different way of representing the [inaudible 00:08:56] operation, somewhere we select and update and somewhere we read and write. [00:09:00] And then the other terminology is concepts like users and groups and roads, right? Example, the IAM role is not the same as the database role or the database role is not the same as the AD group.

So you have to make sure that when you are building abstraction, it’s consistent across all of them, right? Next is the data classification attributes, right? While it’s very easy [00:09:30] to manage the policies at the individual source, like you can take a table and give a grant or revoke on that one. It’s easier to understand, but particularly in the big data world, where it is Databricks or Snowflake or any other major one, is very difficult to manage it at a source level, right? So you need to start using other ways of managing it. And some of them are going to be like, “ [inaudible 00:09:57] I’ll give you database access control, right? Can we [00:10:00] use a tag with policies, right? And even this one has their own subject on it. I always want to discover, I always want to tag them, right? When it comes to attributes, what does attribute mean? That I’m going to get my triggers, right?

Is it going to be user attributes, which you can get it from probably from your ADL lab. If the user who’s accessing them needs to be certified, why to go to a third party? The positive, like an HR system [00:10:30] to get that one, right? And in some cases, either attributes could be in your data set, except like the end user might have given a consent, whether it be use the data for marketing or not. So you had to make sure that as you’re building this solution, you can leverage all this to make your life much easier, right? Then, you also have to make sure that you can automate all this process, right? Because as I said, managing manually is difficult. Making it centralized helps you a little bit, but if you can [00:11:00] somehow integrate with your CACD or Jenkins or some other way, I think that makes your life much more easier. Finally, auditing is a core requirement for almost all groups. Unfortunately, what they are looking for is slightly different. Like some may want to look for anomaly detection, someone primarily for compliance reporting. So you need to make sure that you are able to generate different types of report from each one of them.

So let’s see how Apache Ranger [00:11:30] can help out you. First of all, Apache Ranger is the de facto tool for security big data. Pretty much it was built ground-up to address the security and governance, the challenge that we face in the Cloud, right? It provides this centralized policy management tool. So from one console, now you can manage almost all the services in the Cloud, right? It has a built-in [00:12:00] support for [inaudible 00:12:01] access control. You can get that from ADL lab, but at the same time, it also is very flexible. If you have attributes in other systems, you can easily extend it and you can enrich the request from information from other system also. So it gives you the flexibility that you’re not dependent on the tool to be [inaudible 00:12:23]. when it comes to tag-based policy, it has first-class support file. It is always built [00:12:30] with the intention that customers will use tag-based policies. In the open source world Apache Atlas is good, complimentary product, but for the period of time, a lot of other vendors also integrated right now. So you can do tag-based policy directly in the Ranger itself.

Another important thing is the support for multitenancy, right? As you are moving your data from your premise to Cloud, [00:13:00] pretty much each and every line of business is doing, and they are pretty much coming to the same related account, right? So now, as you’re getting the data into the cloud, you want to make sure that each and every data owner can manage their own policy and without worrying about someone else getting access to it, right? So you’re going to make sure the tool can support multitenancy as first-class. So with Ranger, we built it ground-up so that our concept [00:13:30] called security zones and others, you can use it to isolate your data set, who can manage it, who can view it. And pretty much you are nice multitenancy within the same arc.

And then the next one is about APIs, right? Pretty much everything that you can do from UI, you can do it in using API. And the good thing about Apache Ranger is you have one single endpoint, [00:14:00] right? Means you have one endpoint where it can make a call to set the policy, regardless what service your manager, and the [paler 00:14:09] structure is almost the same. So if you are having your policies in saying, “Get Tonya the things” can store it a JSON and Yammer, and you can have your own process to do it. And it’s very simple and straightforward. And in the open source, there are also others who have created utilities and Python scripts, which you can use [00:14:30] to further automate the process.

Then coming to audits, audit also has been…the way Apache build audit is it collects, started from us, almost all the services [inaudible 00:14:50] as is it. So the structure is exactly the same. And mostly it’s in the JSON format. Then, you can now send us orders to [00:15:00] any external SIM system or Splunk. I can save it into the S3 or ADLS and run a Spark job on it, or leave it the database and query on it, or you can send it to a messaging system like Kafka and do a lot more real-time analysis of it. So all this flexibility features are available and from the retail point of view. And finally, the opportunity to use open source, and it’s also very extensible. [00:15:30] So if something new comes up, you don’t have to wait for the community to ask you to support it. You can actually write your own plugin. It’s really, really straightforward. I’ve seen many different enterprises I do.

They all plug in for your own internal services. And then, also there are other vendors who are supporting it. So let’s go to the next slide, could you talk about how the adoption of Ranger is? So as of today, [00:16:00] almost all the Cloud providers adopted ranger like EMR. If you’re taking EMR, they made announcement like two weeks ago, they actually have package Ranger plugging along with the EMR. So Hive and Spark, they have the first-class integration. So all you need to do is just set up one or two properties and you can have Ranger right now today in EMR, right? HDInsight, Azure is…they have always been using Ranger as part of their core operation. [00:16:30] Even Google Dataproc now includes Ranger as a first-class support. Then in addition to that, other services like Dremio, Trino, Presto, SQL always been made to support Apache Ranger, then other vendors like Cloudera, they have thousands of their customers using Ranger on their platform. And even like folks like privacy, like us, [00:17:00] we had extended Ranger to support the cloud native services like Redshift and Synapse and [Otina 00:17:05] and other third-party services, like Snowflake database and Starburst.

So if you talk about Dremio and Apache Ranger, right? So Dremio has always been supporting Apache Ranger for a long time. It just really applicable for Hive metastore. And also add some level of some limitation, [00:17:30] right? So what recently Dremio team has done is they’re viewing a first-class support for India plugin in their core implementation, et cetera. So now if you take a Dremio today, Dremio does a very good job in caching. So if you have two users, User 1 and User 2, and the users both are running the query against the same table, say table one. When User 1 does the query, [00:18:00] the data is fresh from the underlying data source. And then when User 2 runs it, it doesn’t go to the data. You can get it from the cache. So just very highly performing.

The challenge comes is… Let assume User 1 has access to the underlying data source, but User 2 doesn’t have it. So in this caching model, User 2 will get access to the data because it’s going to cache. So the underlying system, this case is higher may [00:18:30] not be able to enforce it because the call is not even coming to Hive. So your condition is going to break, right? So what Dremio your team has done is they actually refactored their code and the introduced Ranger before you’re accessing the cache. So in other words, when User 2 is trying to run a query before getting it from a cache, it knows what is the data that’s backing the cache. So, and they will run a call Ranger plugin before [00:19:00] getting from the cache to see whether the user has permission to that.

And if the user has permission to run the line data set, then only it will get from the cache. So in that way, no matter all you’re accessing done here today, over the access into the portal or through some JDBC or any other words, you would always get the same consistent result. And also you’re not depending on the underlying data source anymore. That means if you are connecting to S3 and [00:19:30] running a SQL query on top of it, you’ll get the same level of coverage and I’ll show that as a part of the demo. So what I’m going to do is I’m going to do a quick demo of the new integration between Demio and Ranger, okay? I’ll go to a couple of use cases, first I’ll go to the resource-based policies, which is pretty simple. Then I’ll go to their tag-based policies in the new, and also try to connect to an S3 and see how we can run a query and do [00:20:00] access control. So I’m just going to switch over to the demo. I hope you can see. So this is the Privacera version of Ranger. So what we had done is our cod is Apache Ranger, but we build a lot on top of it, making it easier to use it. We have internment and workflow approvals, and we have other discovery and compliance workflows, and quite a bit of addition that we have done [00:20:30] to make it more enterprise-ready.

Hope you guys can see it. But we kept a few of the things as it is because a lot of people were familiar with Ranger. We want to make sure that they can continue using it without relearning the tool. I’m going to go do a Dremio right now. And I’m going to just make a query for this database and tables. So there’s a sales database and a table “Sales”. And if you see I [00:21:00] was denied, so come back to Private Setup and look at the audit. You can see that at this time, like 1:36, I tried to access this database and table and I was denied, right? Because there’s no policy right now in Ranger, which allows it, right? So I’m going to go, I already pre-created a policy just before I go there. This is the typical [00:21:30] thing we support almost most of the [inaudible 00:21:33] DNR, Kinesis, Kafka, Snowflake, [inaudible 00:21:39] and the Google one.

So Dremio is one of it. So I’m going to enable one of the projects. We created a policy for this databases “Sales” for this table for user Bosco. I was given a select call, select permission. I’m going to just enable it right now. So it is now enabled. [00:22:00] Now if I go on to Dremio and I run the same query again, now it should be able to get the [inaudible 00:22:08] spec, right? So what is happening behind the scene is the Ranger plugin is part of the Dremio. So it doesn’t matter how we are running the query, right? When I’m trying to access this data table, that Ranger login intercepts it, checks within the region, what they do is ask permission to the user’s permission, then you allow it. So [00:22:30] in this case, I was allowed permission for this database table. And this was the policy ID, which is the one I just enabled, this table, user insert…

Okay, so this is a resource-based policy. What I’m going to do is I’m going to switch over to the tag-based policy. So obviously, as I mentioned, managing the source while this is very easy to do, sometimes it could be too painful. So you want to simplify it, right? [00:23:00] So I have another database and table. Then, this query, it will also fail because I don’t have any policies. Now, I’m sure I’m trying to go and create a policy on the source level. I’ll create one policy, which is more of a global policy that would say anyone who has access to, I’d say anything which has been tagged as assistant. [00:23:30] And I’m just using right now, it could be other tags also, for the user Bosco. You can access…allow when I’m accessing to Dremio, right? So now if I go… I think I saved this… So now if I go and run the same query, I should have received. The same thing happens, goes to the plugin. The major plugins says there’s a [00:24:00] tag in his policies, and it will allow it.

Anita Pandey:

Okay.

Don Bosco Durai:

[inaudible 00:24:14] No, sorry, this study. So this is the customer database, this customer data, and this is the policy, okay? So what you might assume right now I’m still trying to connect data, which is in Hive. What I’ll do [00:24:30] next is I will try to run a query for a data source, which is in S3. So this is the bucket and this is the file. And I think a data source and I would take the presentation on top of it. And obviously I got rejected because I don’t have a policy. You can quickly enable it.

Anita Pandey:

Hi, Don, just a quick call that we have five [00:25:00] minutes left. So-

Don Bosco Durai:

Thanks, Anita. I’ll just do this very quickly.

Anita Pandey:

No worries. Go for it.

Don Bosco Durai:

Thank you, Anita. Okay, I just enable this one. And if I go and run the same query, it will allow me. So underlying data sources S3. But I [inaudible 00:25:18] concern is a table. As the users can see the table. And also from the admin perspective, it’s just like another table. I think that is what I wanted to show. So we can now open up for [00:25:30] questions and answers.

Anita Pandey:

Appreciate it, Don. Okay. So is Apache Ranger available for Databricks?

Don Bosco Durai:

I’m sorry for what?

Anita Pandey:

For Databricks.

Don Bosco Durai:

Yes, it is. Okay, so Apache Ranger is available for Databricks. It may not be available in the open source though, but it’s still the Ranger’s available. Right now, it’s provided by us, Private Setup. So you can use, if you have a Ranger ecosystem and you’re [00:26:00] using Databricks, you can use Databricks with our plugin.

Anita Pandey:

Got it. Is data virtualization a part of the governance and access management? How do you make sure that performance is not compromised?

Don Bosco Durai:

Actually, it’s a good question, and that is exactly what we are trying to avoid in the Ranger world. We don’t want to go through the data virtualization, because we feel… If you are already [00:26:30] having a tool like Dremio and Trino, they are doing all the heavy lifting of virtualize. So you’d rather just go with a real virtualization tool, then a security company? [inaudible 00:26:40] what we in our philosophy is we try to have a native integration with each one of them. So whether it is going to be like a Dremio, as you saw, we have a plugin within the Dremio itself. So you don’t need one more virtualization on top of it, the Dremio itself will do it. EMR is a [inaudible 00:26:59]. So if you’re accessing [00:27:00] EMR directly, you’ll have what you mentioned. There are some tools like Snowflake and Redshift, you can put plugins. In that case, you can translate the Ranger policies directly into the database policies. And in that also, you are accessing the Redshift and Snowflake directly, so Ranger is not in the play. So it makes it much… You will not have any performance degradation. You don’t need another [00:27:30] infrastructure set up for that because the cost is anyway… If you had done your policy directly in Redshift or Snowflake, you would have observed the same cost.

Anita Pandey:

Got it. Thank you for that. And just a clarifying point from Gerhard here is, “Does Ranger get the Dremio users?”

Don Bosco Durai:

Yes, so the way the Ranger plugin works says it lets the service to the [00:28:00] authentication and whatever the authentication mechanism that service and support. And is it usually a password, [or Kerberos or Saima 00:28:09], doesn’t matter. It will take it, but it’s possible to do a translation of the user into the [anonymous 00:28:20] one. So if it takes, let’s say Databricks doing a login using [inaudible 00:28:23], then you can translate that into a user, which will be the same as the AD [inaudible 00:28:29] .

Anita Pandey:

Got it. And can [00:28:30] you address Atlas and Atlas-Ranger integration.

Don Bosco Durai:

So Atlas and Ranger, for whatever reason, was done by the same team. So essentially it was built as a complementary one. So Atlas has the metadata catalog and everything, and it pushes the patch to Ranger. And then you can do the popular data catalog either do auto-discovery [00:29:00] or manually, but Ranger is not dependent on Atlas. So you can push the pack directly to Ranger, and at least just gives you out-of-the-box, easy integration.

Anita Pandey:

Interesting. And we have time for one more question. Can you integrate with Active Directory for Snowflake?

Don Bosco Durai:

Yes. So it’s just a different way. What happens is you will be integrating [inaudible 00:29:27] into Snowflake. That is how we want to do it, [00:29:30] but it could be an email. Now what we do is, so the AD has emailed to user mapping. So, we will map the email to user. So from the administration point of view, you still see the policy at the user level, not the email level, or the mapping will be done via [inaudible 00:29:49].

Anita Pandey:

Got it. And just a request here to share your presentation, which we will be doing, folks. So I think that’s [00:30:00] all the time we have for questions, but you can continue the conversation in the Slack link that I just sent, with Don. That’s his private link. And also just a reminder to go to the Expo Hall and that’s open now and for demos as well as fun giveaways and to try our virtual lab.

Don Bosco Durai:

Yeah, thank you very much Anita. Thanks, everyone.

Anita Pandey:

Again, everyone. And thank you, Don. Have a good day. Bye.