Dremio Jekyll

Running SQL on HBase with Dremio

Intro

HBase is a popular distributed key/value store based on the Google BigTable paper. It is typically deployed in Hadoop clusters and provides a scalable, low-latency system for operational workloads where performance and scalability are critical. HBase differs from traditional relational databases in a number of ways, but the two most important are the data model and the query language. In terms of the data model, applications store their data in HBase in a denormalized representation in a single table with many columns. In terms of the query language, HBase does not support SQL and provides its own means of querying the data.

These two differences - the data model and the query language - make using HBase with traditional SQL-based tools like Tableau, or data science tools like Python and R, very challenging. Dremio makes this easy, and adds a number of other benefits that open HBase data up to many new use cases.

Installing Dremio

To get a sense for Dremio’s capabilities, you can easily download Dremio on your laptop or local machine by following the Quickstart Guide. When it comes to using Dremio in production you would deploy Dremio in a cluster, either standalone, or provisioned and managed as a YARN application within your Hadoop cluster where HBase is running. You can read more about using Dremio with YARN.

Once you’ve installed Dremio we would suggest reading the tutorial on Getting Oriented to Dremio and then Working With Your First Dataset.

Connecting to HBase

Dremio supports many data sources, including Hadoop, relational databases, MongoDB, Elasticsearch, and HBase. To connect to HBase, you can click the + button next to sources:

Creating a new source

You’ll then be prompted with a set of different data source types, including HBase:

New source dialog

Once you select HBase, you’ll name your data source and provide the IP address for your ZooKeeper quorum:

Configuring HBase for SQL access

Click Save and then you’re finished setting up your connection to HBase. Feel free to connect other systems so that you can join/blend data between your HBase cluster and other sources.

HBase Schema

HBase stores data in tables, but an HBase table is very different from a traditional relational table. Each row spreads data across many columns (potentially millions or more), and the number of columns can vary from record to record. Groups of columns are organized together in column families.

To use a trivial example, a simple HBase table storing data about customers might have the following structure:

Row Key Customers Orders
Name Address City State Zip Order No Price
878 Dremio 883 N. Shoreline Blvd Mountain View CA 94043 ABC123 $599

In this model, Customers and Orders are column families. Dremio treats the relationship of column family and column as a nested data structure. When you connect Dremio to your HBase cluster you will see each table displayed as a table with a column for the row key and each column family. Within each column family you will see key/value pairs for each column, similar to how JSON stores key/value pairs.

Building Your Schema in Dremio

Here’s another example using some simple test data. I connect to this HBase cluster and see a table called ‘test’:

test physical dataset in HBase

When I open this physical dataset, I can see two columns, the row key and a column family called ‘cf’:

Viewing the physical dataset from HBase

This is the data “as is” returned from HBase. To make the data easy to query with Dremio, we’re going to go through a one time setup where we create a virtual dataset that applies all transformations in a virtual context. Once this work is complete, any tool can take advantage of the new logical data model without copying the data from HBase or moving it into a new environment.

Note: HBase support is currently in beta. We think the following steps are a little tedious at this time and plan to make the experience better. Keep in mind this is a one-time effort. Do let us know on community.dremio.com if you have suggestions about how to make our integration better.

The row_key column returns the binary values stored in this column in HBase. To display the non-binary encoded value, we need to convert it to a string. We can do this in SQL or using Dremio’s graphical interface. Let’s start with the graphical interface.

Typically you can convert from the physical data type to a new type by clicking on the data type menu on a column. Unfortunately with binary types (at this time) you cannot access this menu, so we’ll use the calculated field option instead by clicking on data transformation menu on the right of the column:

Creating a calculated field in HBase

Now you’ll see a dialog for creating your new column. Start by double clicking on the column on the left to highlight the name, then go to the right and type ‘binary’ into the search box to find all functions whose name includes ‘binary’:

Converting the binary data to string

You should see two functions. In this case we want to transform from a binary value to a string, so click the plus next to STRING_BINARY. This should wrap your column name with the function:

Applying the STRING_BINARY function

There’s no need to preserve the original binary value in our virtual dataset, so leave the check box for Drop Source Field checked. After you click Apply you will return to the dataset viewer to see the string value displayed - row1, row2, row3:

Finishing up the virtual dataset

The second column has a nested data type icon on the data type menu. You can see that for each of the rows there are nested columns for age and name, each with a binary-encoded value. We can use Dremio to extract each of these columns by clicking on the three dots on the right of one of the rows:

Viewing the value of the nested field

Then click on one of the columns and select ‘Extract’:

Extracting one of the fields from the nested value

Now you will enter the Extract dialog to configure your new column. Name it age and don’t drop the source field:

Configuring the name field with the extract dialog

If you’d like to preserve the column family name in your column names, you can use an underscore to prepend the name, such as “cf_age.” Do the same steps for the name field. When you’re finished your virtual dataset should look like he following:

Viewing the virtual dataset at this point

At this point we need to convert the binary value to a string as we did for the row_key column. We can follow the same steps using the calculated field dialog, or we can edit the SQL manually. To edit the SQL click the SQL Editor arrow:

Editing the SQL to convert from binary

We can take the STRING_BINARY function we used on row_key and add it to the other columns. At this point we can also delete the cf column by removing it from the SQL. Now you should see:

Configuring this virtual dataset for SQL access

Feel free to close the SQL editor if you’d like. One last step - let’s convert the age to an integer. We can do this by clicking the data type menu on the left of the column:

Converting the string to an integer

Now you can save your virtual dataset in your favorite space and git it a name like ‘people’.

Running SQL Queries

Now you can run any SQL query on this data. For example:

1
2
3
SELECT name, age
FROM "@kelly".people
WHERE name IN ('brian','candace','jim','janet') AND age > 6

Conclusion

In this tutorial we connected our Dremio cluster to an HBase cluster, built a schema in Dremio, and ran a simple query. Dremio makes connecting any BI tool, Python, R, or other SQL-based analytics tool simple to use with HBase. In addition, Dremio extends the query processing capabilities of HBase to support the full SQL language - Dremio will “push down” parts of the query that HBase can perform, and will augment those capabilities with its own SQL execution engine. Finally, HBase-encoded data is optimized for low-latency, write-intensive workloads on small numbers of records per operation. Dremio can accelerate query processing by optimizing data for analytics using Data Reflections, which can significantly improve the performance of analytical workloads, and can offload analytical workloads from your operational HBase workloads.