Dremio Jekyll

High Performance Parallel Exports

Introduction

When using standard SQL, we can create a table by issuing the CREATE TABLE statement, however, this will create an empty table that we then have to insert data into. Starting in the 3.0 release, Dremio provides the formal ability to export virtual datasets (VDS) to their respective sources (S3, ADLS, HDFS, MapR-FS and NAS) using standard SQL-based commands. Create Table As Select or CTAS will allow users to create and populate a virtual dataset and write it to specific sources that they have WRITE access to in a single SQL statement.

In this tutorial, we will help you understand the use cases that this feature will help you with, then we will explore how you can enable it in any of the data sources that support this ability today. Additionally, we will execute the CTAS command and check out the results.

Assumptions

To get the most of this tutorial, we recommend that you first follow getting oriented to dremio and working with your first dataset tutorials. Also, please keep in mind that at the moment of writing of this tutorial, the only data sources that support CTAS are: Amazon S3, ADLS and HDFS, MapR-fs and NAS.

The CTAS Use Case

The use of high performance parallel exports allow users to create, reorganize, download and export large (>1 million rows) or small datasets from any source of data into any of the CTAS-supporting data sources within Dremio. When users employ the CTAS statement, Dremio will then store the results of the query into one or many parquet files (depending on the size of the source) on which users have full control on the naming, destination path and security rules. Like any dataset created within Dremio, the results of the high performance parallel exports will be cataloged and made searchable so users can easily find, share, and collaborate.

Enabling CTAS on a Source

To enable CTAS on the supported sources, first navigate to the “Sources” perspective and click on the “+” icon.

image alt text

Then, select one of the CTAS supported data sources (Amazon S3, ADLS, HDFS and NAS)

image alt text

For purposes of this tutorial, we will work with Amazon S3; once we select it, click on “Advanced Options”

image alt text

Then, select “Enable exports into the source (CTAS and DROP)” and click “SAVE”

image alt text

The steps we have performed up to this point will enable the functionality on the source connector using impersonation for permissions to allow WRITE access to the source via SQL-based commands.

image alt text

Once the datasource has been added, you will be able to see all the buckets available in your S3 environment

image alt text

Using CTAS on a Source

When creating a CTAS on a source, Dremio will generate a parquet file (or several parquet files, depending on the size of the datasource) that contains the results of the “SELECT” statement and will place this file in a directory inside the source that you will be working with. The steps to create this directory is somewhat different for each one of the supported platforms (S3, HDFS, ADLS, MapR-FS). In this tutorial, we will be using an S3 bucket; please refer to your data source’s documentation to verify the steps that you need to follow to create and grant permissions to the new directory where the CTAS results will be stored. A great advantage of this, is that you can leverage the file system permissions to ensure secure access to the data.

In this tutorial, I will be working with the San Francisco Police Department incidents dataset that I already have available in one my S3 buckets.

image alt text

After reviewing the data inside the VDS, I will go ahead and write the CTAS statement to create data export using the following syntax:

1
2
3
4
5
CREATE TABLE [S3 source].["bucket name"].[file_name]
[HASH PARTITION BY (column, . .) ]
[LOCALSORT BY (column) ]
AS
SELECT [statement]

In this case, I want to create a subset of 100 records from the Incidents dataset and store it into my “dremio.uploads” bucket using the following statement.

1
2
3
4
5
CREATE TABLE jeffs3."dremio-upload".incidents100_subset
as
SELECT *
FROM "@lucio".incidents
Limit 100

To do so, simply click on the “New Query” icon on the top of your working space

image alt text

Then type the SQL statement in the SQL editor field and click “Run”.

image alt text

Once the query runs successfully, Dremio will provide a brief report of the results. Here we can see that the query provided 100 records and the results have been stored in a parquet file in the path that was specified in the query.

image alt text

Visualizing the results

Now, wether we want to verify or make use of the results that the query generated, we will need to head out to the datasource that we specified in the query. In the list of “Sources” select the one that you are working with and then from the list of available buckets, select the one that you specified in the query.

image alt text

Here first the results we will presented in form of a parquet file, click on it.

image alt text

At this point you will be able to see a sample of the data that is contained in the file. Click the “Save” button.

image alt text

Dremio will immediately execute a query to preview the results. Notice that at this point we are working directly on the parquet file that was created using the CTAS statement.

image alt text

At this point we can curate the data, add catalog information, save the dataset somewhere in our local space and/or start analyzing the data using our BI tool of choice.

Joining Data from different sources

Using CTAS in Dremio still allows to enjoy the flexibility of connecting to and joining data from multiple and different data sources. In this next use case, I’ve been presented with two different datasets, one flat CSV file that contains the user ID number and name for a list of employees, and another file stored on S3 that contains the ID and biographical information for these employees.

Normally, in a situation like this, I would have to figure out a way to join these two datasets using a data integration tool. However, let’s see how easy we can join these two sets into a single file that we will create into S3 using the CTAS feature in Dremio.

Uploading the CSV file

First, I will upload into Dremio the CSV file that was provided to me. To do this, simply click on the purple grid icon on the top of your working space.

image alt text

Then navigate to the local directory where the file is located, or you can also drag and drop the file here. Click “Save”

image alt text

In the next screen, from the drop-down menu select the appropriate format for your file. In my case, I will select “Text(delimited)”

image alt text

Then, make the appropriate changes needed to the data is read correctly into Dremio from the file. I’ll change the line delimiter as well as select the “Extract field names” option. Click “Save”

image alt text

The new dataset should now be listed on your working space.

image alt text

Now for the rest of the data. I know that it is stored in a dataset called “pbsample” so I’m going to search for it in Dremio.

image alt text

And here I can see that this dataset contains the missing fields that I need to complete my employees dataset.

image alt text

I will save this dataset under the name of “id_noname” under my working Space.

image alt text

Now, to complete the task, I want to join “ID_name” and “ID_noname” datasets and save them into a new table that I will create using Dremio’s CTAS feature. To do that, click on the “New Query” icon.

image alt text

I will use the CTAS syntax and join the tables at the same time in the SELECT statement as follows:

1
2
3
4
5
6
7
8
9
CREATE TABLE jeffs3."dremio-upload".employees_list_final
as
SELECT nest.Gender AS Gender, nest.Age AS Age, nest."Height(in)" AS "Height(in)", nest."Weight(lbs)" AS "Weight(lbs)", nested.ID AS ID, join_ID_name.ID AS ID0, join_ID_name."""Name""" AS """Name"""
FROM (
  SELECT ID, Gender, Age, "Height(in)", "Weight(lbs)"
  FROM "@lucio".id_noname
) nest
 INNER JOIN "@lucio".ID_name AS join_ID_name ON nest.ID = join_ID_name.ID
 

Once the query is in, click “Run”

image alt text

Once the query runs successfully, you will be able to see the location of the new file under the “Path” field in the results grid.

image alt text

To see the results, navigate to the indicated path within Dremio and open the parquet file. Once you save it, Dremio will present you with the option to start working with the joined dataset.

image alt text

To make this dataset more presentable and usable for my team, I will make some changes as well as adding the email addresses for this list of employees.

First, I will drop the extra ID field by selecting the dropdown menu and selection the “Drop” option.

image alt text

Notice that in the results, there is a name that needs a bit of clean-up. I will go ahead and clean that fieldname and also the value ‘ “Alex ‘ so we don’t come across any string issues later.

image alt text

From the dropdown menu for the “name” field select “Replace Text”. Here I will be cleaning only one value so I can get away with replacing anything that contains double quotes with the value that I want. However for more complex transformations you can use any supported SQL function in the SQL editor or you could also use regular expressions. image alt text

Once I’m satisfied with the results, I can click “Apply.” Notice that Dremio automatically created the query used to alter the value and shows it on the SQL Editor.

image alt text

One last thing before this list is ready, we need to add email addresses to the names. This can simply be done by adding a new field. Click on the “add field” icon and then concatenate the “@email” string to the name field.

image alt text

Before applying the changes make sure that you preview the results to make sure that they are what you are expecting.

image alt text

Now we just need to save the dataset and it will be ready to be shared and used by other members of my team.

image alt text

The same way we have joined data from a local CSV file and an S3 bucket and export it to a different S3 bucket using the CTAS statement, you can join and export data from any of the supported data sources in Dremio into parquet files in S3.

In this new scenario, I have a dataset that contains Yelp reviews stored in Elasticsearch and the rest of the information for the same Yelp reviews is stored in a MongoDB source. Following the previous steps I can use a CTAS statement to create a parquet file on S3 that contains the joined data from these two sources, as follows:

image alt text

We can also verify the origin of the data using Dremio’s data lineage feature

image alt text

Navigating to the datasource allows us to verify that the joined data has been placed in the indicated S3 bucket

image alt text

If we have the proper access to our storage environment, we can also double check where the parquet files have been created.

image alt text

Last but not least, we can query the newly created dataset to explore its contents.

image alt text

Dropping a table

If you recall, at the beginning of the tutorial when we enabled the CTAS feature, enabling this feature also enabled the option to DROP the tables that were created in that source.

image alt text

Once you are done using the table that have been created using the CTAS statement, you can drop it by simply issuing the DROP TABLE statement against the same path where the table is located, as follows:

image alt text

Notice that I’m giving Dremio the path of the table that I originally created using the CTAS statement, and I’m not providing the path to the parquet file inside the bucket.

To verify that the table is dropped, we can navigate to the data source path given in the DROP statement and make sure the parquet files have been removed.

image alt text

In addition, if you have access to your AWS console, you can check that the bucket has been removed from your environment as well.

Conclusion

In this tutorial we walked through the steps of using Dremio’s new High Performance Parallel Export functionality. Users can use this feature to write back to sources where they have “Write” access to. In this case we used Amazon S3 and we learned how Dremio stored the results of the CTAS statement as a parquet file on the S3 bucket of our choice. Additionally, we were able to use the Create Table statement along with a Join statement to create a dataset composed by two different data sources and save the results directly into an S3 bucket for later use.

We hope that you enjoyed this tutorial, stay tuned to learn more about can you use Dremio to gain more value from your data, faster.