
Using Data Mesh to Advance Distributed Data Access, Agility and Governance
Join this live fireside chat to learn about using Data Mesh to Advance Distributed Data Access, Agility and Governance.
read moreNote: Dremio only supports one data governance policy manager at a time, so you can use either Dremio or Ranger as a policy manager but not both at the same time.
In this tutorial, we will go on an overview of the ranger-based policy enforcement procedures, we will also exercise the different permissions that you can grant to Dremio users when using Ranger, and last but not least we will demonstrate how to implement row level security controls.
Keep in mind that you would need Hive as well as Ranger in your environment and an accessible AD/LDAP server. We’ve also made a video available if you would rather watch how the workflow evolves.
The steps that we are about to show, work for Enterprise and Community editions of Dremio.
To get the most of this tutorial, we recommend that you first follow getting oriented to Dremio and working with your first dataset tutorials. In addition, we will be using ranger 0.7. And the latest deployment of Dremio.
For this tutorial, we will be using the same AD/LDAP for both Ranger and Dremio since we will be utilizing the same users and groups. We are also going to use pre-defined Ranger policies as follows:
Engineering | Marketing | ||||
User | Member of | Vehicles | Engines | Customers | Sales |
cjohan | Engineering | Yes | Yes | No | No |
bharri | Marketing | No | No | Yes | Yes |
In addition, we will be using a fictitious ‘Production’ database composed of 20 tables located in Hive, however, for this tutorial we will be applying the policies to the four tables listed above. You will also have the opportunity to see how we interact with Ranger’s audit screen, as well as how Ranger security is enabled in Dremio once you connect to a data source.
For this tutorial we will be using Ranger 0.7.
In this diagram, we see the Dremio cluster on the right and the Hive environment on the left. In traditional Hive, you have the metastore which contains all the information about the data stored in HDFS in terms of stats, files, rows, tables and columns, along with Hive Server 2 instances which serve ODBC and JDBC requests from above.
Each Hive Server 2 instance has a Ranger plug-in that speaks to the Ranger server which authorizes access to Hive resources as specified by defined Ranger policies which are created and managed in the Ranger server. On the other side, the Dremio cluster [shown on the right] is running in a Yarn deployment composed of Coordinator and Executor nodes. In this mode, Dremio integrates with the Yarn resource manager to secure compute resources in a shared multi-tenant environment. This integration allows enterprises to more easily integrate Dremio on a Hadoop cluster including the ability to shrink or expand resources on demand.
Coordinator nodes are responsible for query planning, web user interface and handling client connections. Executor nodes are largely responsible for query execution and most of the heavy lifting.
The new item for Dremio, is the Ranger Plugin, this plug-in is installed in the Coordinator node. It allows the Coordinator node to communicate with Ranger and allow or deny access to HDFS resources based on the policies defined on the Ranger server instance. In this mode, client queries come in via JDBC or ODBC to the Dremio coordinator, or access is requested via these queries. Access is checked via Ranger policies to make sure they are valid and based on that authorization, the Executor node(s) are permitted to access that source and execute the query for the user. Then the results are sent back to the coordinator so they can be represented in the UI.
In the production database that we will be using for this tutorial we are going to focus on the 4 tables indicated below.
Additionally, we have created a security profile hive_site_1_policies
in Ranger that maps directly to the permissions that we have previously defined.
In this part of the tutorial, we are going to make sure that the Ranger policies are being enforced by Dremio. As we can see if the image above, Cjohan
and Bharri
should only have access to Vehicles
and Sales and the groups Engineering
and Marketing
should only have access to Engines
and Customers
respectively.
As you can see here we are logged in as Cjohan
And that username should have access to the Vehicles table and also the Engines table since he/she is a member of the Engineering
group.
Sure enough, it seems like the policy was enforced correctly, and we can observe that he has been granted access to the Engine
table since he belongs to the Engineering
group. However an extra step is due to double check on the enforcement of the policy, let’s try to access a data set that Cjohan
has not been granted access to.
While we are logged in as Cjohan
and tried to access the customers
table, Dremio provides us with an access denied warning. At this point we can conclude that the policy was enforced correctly. The same behavior would be expected if the same username tries to access the Sales
table.
Now let’s go ahead and try the same procedure using the Bharri
username which in this case should only have access to Customers
and Sales
.
Let’s head back to the list of available tables and select customers
.
Effectively we can see that Bharri
has access to the customers
table, now let’s try a different one to double-check on the validity of this policy.
As expected and based on the security policy already pre-defined in Ranger, Bharri
does not have access to the Engines
table. Now, if we head back to the audit screen in Ranger, we can observe that the access trials and test that we ran from the Dremio interface are recorded reflecting the results (access granted and denied) for each one of the events. These results map directly to the originally defined policies in Ranger.
Each one of these policies can be edited within Ranger to grant or deny access to the available users and groups mapped from the AD/LDAP server.
Now, let’s switch tracks for a second and talk about Data Reflections before we move into the next section of this tutorial. Data Reflections are a materialized view of a dataset, in essence they are a way for Dremio to pre-compute a physical representation of the data that is optimized for various query patterns.
Reflections largely replace cubes, abstracts and aggregations that many users are having to create when trying to access and accelerate their data. Because they are seen as an index of data, Data Reflections are invisible to the user, they don’t need to be managed and they are not a physical copy of the data, therefore they reduce the costly overhead of having to move and manage copies of data. Since they are created and managed in the background, Dremio users get to benefit from them immediately without having to know they even exist.
What does this mean for a scenario where users don’t have access to certain data that was used to create a Reflection? Data Reflections are a powerful shared asset in Dremio, however, this doesn’t necessarily mean that it permits users who already do not have access to the data via Ranger policies to view those datasets which were generated through the use of Data Reflections.
In our demo scenario, we’ve made some changes to the policies to reflect the following permissions chart:
Engineering | Marketing | ||||
User | Member of | Vehicles | Engines | Customers | Sales |
cjohan | Engineering | No | Yes | No | No |
bharri | Marketing | No | Yes | Yes | Yes |
Now we want to demonstrate that Data Reflections adhere to Ranger security policies. Let’s head back to Dremio logged in as Bharri
and try to access the Customers
data set, which according to the chart he/she should have access to.
The fiery icon next to the table name indicates that this set was accelerated by a Data Reflection. When we navigate to the ‘Jobs’ screen we can see that this query was indeed accelerated by a raw Reflection that is not being pushed down to the data source.
Now if we login as Cjohan
and try to access the same dataset, we can corroborate that the user won’t have access to that dataset through the use of the data reflection that was generated.
In this tutorial we did a complete evaluation of how security policies defined in Ranger and inherited by Dremo are effectively enforced to the users. We also analyzed how these policies are continuously enforced when using Data Reflections as well. This exercise allowed us to demonstrate how easy yet trustworthy is the process to accelerate your BI analysis while keeping your data safe through robust security technologies like Ranger when using Dremio.
Join this live fireside chat to learn about using Data Mesh to Advance Distributed Data Access, Agility and Governance.
read moreThe data lakehouse is a new architecture that combines the best parts of data lakes and data warehouses. Learn more about the data lakehouse and its key advantages.
read more