Dremio Jekyll


Querying Hive 3 Transactional Tables with Dremio

Jan 21, 2020
Mark Johnson

Part 1: Configuring Dremio

Introduction

Hive 3 with its advanced transactions is pretty awesome. A user can update a Hive table directly within Hive without having to create a new extract containing journal entries or to containing a full refresh. The problem with Hive Transactions (ACID) though is that many supporting tools are unable to read the delta files created by new Hive Transactions.

This blog post the first in its series introduces How to setup Dremio to leverage Hive 3 transaction data for highly performant queries with low data latencies.

Using Dremio to query Hive 3 Transactional tables enables high performance queries as well as the ability to link Hive 3 tables with other data sources without a data pipeline.


SECTION 1: Setup Hive to support Dremio


In this section we will take a very small file as an example and go through the steps to prepare it for Dremio reporting. Later in the series we will utilize the TPC-DS generated test data and test queries to examine how.

The Github containing the examples reviewed in this blog are available here.

Step 1: Set the Dremio user and the HDFS Proxy settings

Our first step is to ensure that the ‘dremio’ service account is defined to the Hadoop nodes. To do this we will execute the following command on each of the nodes to create the ‘dremio’ service account.

1
adduser dremio -G dremio

Then we are ready to setup the Dremio service account as a HDFS proxy user so that it can have access to the warehouse and metastore files within Hadoop. To enable the Dremio service account, we will need to add the following properties to the custom core-site section in Ambari or add them to the core-site.xml file with values specified below:

Key Value
hadoop.proxyuser.dremio.users *
hadoop.proxyuser.dremio.groups *

To see all the proxy users and verify that we are ready for the next step enter the HDFS configs in Ambari and search for the word proxy to confirm that all the proxy users are correctly defined as shown below:

image alt text

Once all the above values are confirmed, go and restart all of the Hadoop services required as specified by Ambari or your favorite Hadoop admin tools.

The last step before we move on to configuring Dremio is to confirm that the Dremio coordinator has access to the defined Hive Metastore Host and its active port. In most instances, you will find that the port is 9083. To confirm on your instance from Ambari or hive-site.xml go into the Hive Configs and check the assigned value of ‘hive.metastore.uris’ is assigned as follows:

image alt text

NOTE: In the above example hdp2.local is the Hive Metastore service host location.

Step 2: Setup the Hive 3 Datasource in Dremio

Dremio will use its service account (which on my cluster is ‘dremio’) to speak with the Hive3 metastore.  

As we covered back in Step 1, we are defining Dremio for our service account in this example. As such all that is required to select the Hive 3.x data source.

image alt text

Once selected, give the data source a name (in this case we are highlighting that the DOAS flag ssd set to ‘false’, and then specify the location of the Hive Metastore Host (Not the HiveServer2 host). You will also want to make certain your Dremio coordinator hosts file contains the Hadoop data nodes as well.

image alt text

The next part of the Dremio Hive3 datasource configuration

In most instances, we will want the hive.server2.enable.doAsOnce the General Hive Datasource 3.x is defined, then you need to click on Advanced Options to define the ‘hive.server2.enable.doAs=false’ property to Dremio.

image alt text

Press the ‘Save’ button once the General and Advanced Options are completely entered (we will cover Using Reflections and Metadata refresh in another blog). You know everything is successfully configured when you drill into the ‘Hive3’ datasource just created and the screen displays all databases currently defined to this instance of Hive (which in this case is just ‘default’).

image alt text

Step 3: Setting up Permissions for Dremio to access your Hive3 tables

Everything is looking great at this point, you can see the Hive3 databases in Dremio and you keep drilling down in the Hive3 hierarchy until finally you hit a table and a message like “Failure while retrieving dataset [“Hive3”.”default”.acidtbl] appears on the screen. Taking a look at your Dremio instances server.log file the error is expanded saying:

Plugin ‘Hive3’, database ‘default’, table ‘acidtbl’, problem checking if table exists.

1
org.apache.hadoop.hive.metastore.api.MetaException: org.apache.hadoop.security.AccessControlException: Permission denied: user=dremio, access=EXECUTE, inode="/warehouse/tablespace/managed/hive":hive:hadoop:drwx------

The file permissions indicate that the /ware/house/tablespace/managed/hive file is assigned to the hadoop group, but has no “—” permissions for any other group members. Having been told that the file does in fact have permissions for the group. While in earlier Hive versions a -chmod 770 command could work, in the default Hive deployment, you will need to execute the following command to enable the ‘dremio’ user access (defining ‘dremio’ service account using Ranger is a different process and will get covered in another blog).

1
hdfs dfs -setfacl -R -m user::rwx,group::rwx,user:hive:rwx,group:hadoop:rwx  /warehouse/

Now that HDFS ACLs are defined to enable the ‘dremio’ service account access, we can dry again and this time correctly Dremio displays all the records in this sample set.

image alt text

Comparing this output to the same query in beeline we see the same result, so all looks good and Dremio is now configured to query the Hive 3 databases and tables in our system. It is time now to example how Dremio handles active transactions.


Section 2: Examining Hive Transactions


In the last section we went through the steps necessary to configure Hive 3 as a Dremio data source and concluded querying a table which consisted of multiple delta files. In this section, let’s dive deeper and review all of the transactional actions in Hive which resulted in the final result we saw at the end of Section 1 of this blog.

Authors Note: If you really don’t care about how Dremio handles the Hive Transaction Delta files, you can skip this section and spend the time freed up enjoying the benefits querying Hive tables using Dremio.

So you decided to continue reading, so here we go examining how Dremio reads the Hive Transaction Delta files. Starting with an empty table from beeline we enter the command:

1
INSERT INTO acidtbl (a,b) VALUES (100, "oranges"), (200, "apples"), (300, "bananas");

Hive 3 of course adds one new delta partition as shown below, containing two inserts as each INSERT command with Hive represents a new transaction.

1
2
3
4
[hdfs@hdp2 hive]$ hdfs dfs -ls /warehouse/tablespace/managed/hive/acidtbl
Found 1 items
drwxrwx---+  - hive hadoop 0 2020-01-20 22:40 /warehouse/tablespace/managed/hive/acidtbl/delta_0000006_0000006_0000
[hdfs@hdp2 hive]$

Turning to Dremio, we see that the query accurately presents the newly inserted records without the troublesome effort required by other tools to build a new cube or copy the Hive data into another on-prem or cloud storage location.

image alt text

The next step we will want is to demonstrate the Hive ‘UPDATE’ with the following command:

1
UPDATE acidTbl SET b = "pears" where a = 300;

You will notice that when we re-run the select * from acidtbl query it does not appear as though the bananas have been corrected to show pears.

image alt text

This is because of metadata refresh intervals. Looking at the Dremio Metadata refresh in the datasource settings shows that the refresh interval is at 1 hour and we executed the query after only a minute. This is a property you have complete control dependent on your reporting needs. Also, if different reports demand different refresh intervals, you can define multiple Hive data sources following the description from section 1 of this blog.

But because I am impatient, going to override the default metadata refresh and change it from 1 hour to 1 minute (No, I don’t think it is a good idea for such a frequent refresh rate). So Dremio would refresh the tables in this Hive 3 Dremio Datasource, the Dataset Discovery and Dataset details Fetch every property was changed from “Hour(s)” to “Minute(s)” and the Dataset details Fetch mode was changed to apply this change only to recently queried datasets in Dremio. The changes are shown below:

image alt text

Now when the query gets re-executed it correctly reflections the update operation.

image alt text

Next we need to validate that the Hive 3 DELETE transaction command works. Executing the following command in Beeline and then re-querying acidtbl in Dremio will show only two rows:

1
DELETE FROM acidTbl where a = 200;

image alt text

We have now gone through the steps to configure Dremio to query Hive 3 transactional tables and validated using a simple table as an example. In the next blog, we will explore queries using relatively large Hive 3 transactional tables.

I hope you learned something useful today. To learn more about Dremio visit our tutorials and resources, also if you would like to experiment with Dremio on your own virtual lab, go ahead and checkout Dremio University, and if you have any questions visit our community forums where we are all eager to help.

Ready to get started?