15 minute read · August 25, 2020

Dremio Benchmarking Methodology – How to Do It Yourself

Serge Leontiev

Serge Leontiev · Leading Technical Product Marketing and Enablement, Dremio

Recently, we announced the Performance and Efficiency benchmarks of Dremio versus Presto distributions. This blog post talks about the methodology that we used to execute benchmarking, provides guidance on how to conduct your own benchmark, and also makes available the tabular results of our benchmark tests.

Benchmarking Tools and Methodology

There are many ways to measure the performance and efficiency of general-purpose decision support systems, and TPC benchmarks are the de facto industry standard. We selected the TPC-DS benchmark to evaluate the performance and efficiency of Dremio and Presto distributions because it offers a wide variety of queries, from BI/reporting to ad hoc, and allows for the creation of datasets at a large scale that well represents data stored in cloud data lakes.

Our ultimate goal was to measure query performance at a large scale to determine the level of compute infrastructure and associated cost required to deliver a given level of performance and thus better understand the cost-efficiency of the query engines. It was also important to evaluate engine performance and efficiency for different types of queries, such as BI/reporting or ad hoc queries, at scale.

We executed benchmarks on two different scale factors, SF1,000 (1TB) and SF10,000 (10TB), to test the engines’ linear scalability. We used the same EC2 instance types for all benchmarking tests, gradually increasing cluster size from 4 to 20 nodes in 4-node increments to measure query execution time and cost of compute infrastructure with different node counts. Before each benchmarking run, we performed metadata refresh to achieve maximum engine efficiency.

The TPC-DS generated data resides in an AWS S3 bucket within the same cloud region as the EC2 compute instances to avoid network-related latency. We selected Apache Parquet as an ideal column-oriented file format for the cloud data lake store. It is the most pervasive open source columnar format in big data analytics and is broadly supported in the industry, including by technologies such as Dremio, Presto and Spark.

As a test suite, we selected Apache JMeter as an open and flexible framework that can run in a GUI and command-line mode. It allows the use of any JDBC driver and provides transparent and easily digestible test results. The output generated by JMeter is a CSV file that contains timestamp information for each query execution, query labels, elapsed time for each query, amount of sent and received data, latency, idle time and connect time. That provides plenty of data to calculate the total execution time, execution time by query type, and execution cost based on the hourly fee of an EC2 instance.

Now let’s explore how you can leverage those tools to build your own TPC-DS dataset and execute benchmarks in your environment.

TPC-DS Test Data Prep

TPC-DS offers a set of tools that can generate q99 queries and sample datasets. The tools can be downloaded from the TPC specification page and compiled for the target platform using the make command. The scale factors 1,000 and 10,000 are a good representation of small and medium-sized data lakes.

We recommend provisioning an EC2 instance with enough vCPU and memory to generate TPC-DS datasets. Before you start generating sample data, make sure that there is enough free space on your locally attached EBS volume. You need enough free space that can double the size of the selected scale factor because generated data needs to be massaged further.

The dsdgen utility generates 200-300GB/hour serially on a 2-3GHz x86 processor; therefore, you should consider running multiple parallel streams when generating large amounts of data. When parallel streams are used, the generated data needs to be merged. Note, the tool itself does not inject any column headers into the data output. To streamline the metadata generation process you have to add column headers to the datasets.

Below is a sample of the command lines that spawn data generation for an SF10,000 (10TB) dataset by using 8 parallel processes to speed up data generation. By default, dsdgen adds a column separator at the end of each row, making future processing more difficult, using -terminate n flag prevents adding a column separator at the end of the row.

./dsdgen -scale 10000 -f -dir <destination directory> -terminate n -parallel 15 -child 1 &
./dsdgen -scale 10000 -f -dir <destination directory> -terminate n -parallel 15 -child 2 &
…
./dsdgen -scale 10000 -f -dir <destination directory> -terminate n -parallel 15 -child 8 &

After sample datasets are generated, the following script should be executed to consolidate the outputs produced by the dsdgen utility into a single file per table with column names injected as the first row for each dataset. You can clone this GitHub project that contains the shell scripts and supporting files for this operation. Navigate under the Scripts > TPC-DS Data Prep folder to locate the data_conversion.sh file. Set the STAGE variable to point to the folder containing generated .dat files and set the DEST_STAGE variable to point to the folder where the script will store consolidated files. The script iterates over the list of tables stored in the table_list.txt file that represent TPC-DS schema, then processes each table by injecting the content of a predefined .txt file with column name headers, and then consolidates the content of all the .dat files from the stream into a single .csv file stored in the location defined by the DEST_STAGE variable.

#!/bin/bash
STAGE=/stage/tpc-ds/SF10000
DEST_STAGE=/stage2/SF10000
STREAMS=15
for i in `cat DatasetHeaders/tables_list.txt`
do
   echo 'Processing '$i;
   for (( n=1;n<=$STREAMS; n++));
   do
      filename=$STAGE/$i'_'$n'_'$STREAMS'.dat';
      csvfile=$DEST_STAGE/$i'.csv';
      if( test -f $filename)
      then
        echo 'Processing '$filename;
        if (( $n == 1 ));
        then
           cat DatasetHeaders/$i'.txt' $filename > $csvfile;
        fi
        cat $filename >> $csvfile;
      fi
   done
   echo 'Done processing '$i;
done

Data Conversion to Parquet

The sample dataset is now ready to be loaded into the cloud data lake. While storing data in the CSV format is highly efficient, the best practice is to convert data into a columnar format such as Apache Parquet. Dremio makes this easy with its native, CREATE TABLE AS (CTAS) command.

First, let’s move the generated TPC-DS datasets from your locally attached storage on your EC2 instance into the AWS S3 bucket. Use the AWS CLI tool to move CSV files from the destination folder defined in the previous step to the AWS S3 bucket, as shown in the following example:

aws s3 cp ./ s3://tpc-benchmarking/tpcds_sf10000/csv --recursive

After datasets are successfully moved to the S3 bucket, it’s time to connect the bucket as a source to your Dremio engine instance. If you haven’t provisioned Dremio yet, please follow these instructions. Log in to your Dremio UI, and on the left panel, click the “+” button next to the “Sources,” and select the new source type as “Amazon S3.” On the next screen provide a name for the new source, authentication information, and S3 bucket where the generated data resides.

image alt text
image alt text

You should be able to see the CSV files through the Dremio UI, and those files need to be promoted to physical datasets (PDS) in order to be able to use the CTAS command to create Parquet files. There are a couple of ways to do so: you can either do it through the Dremio UI and promote the datasets manually, or you can use Dremio REST APIs instead and automate the whole process. The following explores both options.

For manual PDS promotion, select individual CSV files from the list and navigate to the “Action” column on the far right, then click the “Format File” icon. Next, in the “Default Settings” window set the “Field Delimiter” as “Pipe,” the “Line Delimiter” as “LF - Unix/Linux,” and set the checkbox for “Extract Field Names” option. Then click the “Save” button and repeat the process for all CSV files.

image alt text
image alt text

If you decide to automate the PDS promotion process (the recommended approach), then refer to the Dremio REST API documentation to see the whole spectrum of capabilities. You can use the following script to execute sequential calls of the REST APIs that would promote CSV files to PDS and apply all necessary parameters. You also need to add your Dremio instance host, port, username and password to the authentication line, and update the URI path highlighted in red to reflect the actual path to the source location.

#!/bin/bash

# Authenticate and extract the token
TOKEN=$(curl -X POST  http://<host>:<port>/apiv2/login -H 'Content-Type: application/json'   -d '{"userName": "dremio", "password": "dremio"}'| jq -r '.token' )

# Iterate through datasets
for i in `cat tables_list.txt`
do

   DATAOUTPUT=$(curl -X PUT  http://<host>:<port>/apiv2/source/benchmarking/folder_format/benchmarking/tpcds_sf10000/csv/$i -H 'Authorization: _dremio'"$TOKEN"''  -H 'Content-Type: application/json'  -d '{"type": "Text", "fieldDelimiter": "|", "skipFirstLine": false, "extractHeader": true, "quote": "\"", "comment": "#", "escape": "\"", "lineDelimiter": "\n", "autoGenerateColumnNames": true, "trimHeader": true}');

   echo $DATAOUTPUT;
done

And finally, you need to convert the CSV files to Parquet file format. You can easily do this with the CTAS command described earlier. This sample SQL script contains the CTAS commands for each table in the TPC-DS dataset and performs all the necessary data type conversions for the successful execution of TPC-DS queries. You should replace the source and destination table locations in the script to match your environment and execute CTAS commands line by line.

Benchmarking Test Suite

As mentioned earlier in this blog, there are 99 queries in the TPC-DS benchmark that include a variety of BI/reporting and ad hoc queries and represent typical analytical workloads. We identified 58 queries that could be successfully executed on Dremio and Presto distributions without any modifications to query syntax. While 100% query success rate is achievable with logical query rewrites, we focused the benchmarking effort on the 58 unmodified queries because our primary objective was to have easily reproducible results.

The Apache JMeter testing suite is straightforward to use and allows you to execute benchmarking tests against any target system that supports the JDBC protocol. The tool also offers UI and command-line interfaces. We recommend running tests on a dedicated EC2 instance provisioned within the same region as Dremio. Use an EC2 instance type similar to Dremio with matching network bandwidth and enough vCPU/RAM for JMeter to run. We recommend a m5d.2xlarge EC2 instance type or higher.

After downloading and installing JMeter, you need to download the Dremio JDBC drivers and place them in the JMeter lib folder. We have provided the JMeter test plans that execute TPC-DS queries sequentially, and recommend performing metadata refresh before the benchmarking run or after the engine resize. You also need to modify the JDBC connection configuration in both JMeter test plans and provide a JDBC connection string to your Dremio instance, include the PDS path as a schema and set user credentials.

Due to the dataset scale factor, we recommend running JMeter in the command-line mode. Also, to capture the benefits of query acceleration with the Dremio columnar cloud cache (C3) feature, we recommend executing JMeter tests in the consecutive runs to capture benchmarks with the cold and warm C3. The following shell script can automate JMeter execution.

#!/bin/bash
WORKERS='twenty'
echo "First Run"
./jmeter -n -t ~/DremioTestPlanTPC-DS-metadatarefresh.jmx
./jmeter -n -t ~/DremioTestPlanTPC-DS.jmx
mv DremioSummary-tpc-ds.csv 'DremioSummary-tpc-ds-'$WORKERS'-workers-first.csv'
echo "Second Run"
./jmeter -n -t ~/DremioTestPlanTPC-DS.jmx
mv DremioSummary-tpc-ds.csv 'DremioSummary-tpc-ds-'$WORKERS'-workers-second.csv'
echo "Done"

Now you can simply execute the script above on your Dremio environment at the desired scale. At this point, you’re all set to execute TPC-DS benchmarks of Dremio by yourself using your cloud environment and leveraging the BI tool of your choice to analyze the test data. In our next blog, we will cover in detail the execution part of this exercise.

Here are the tabular data results containing the benchmarking data that we produced for Dremio, PrestoDB, PrestoSQL, Starburst Presto and AWS Athena at different scales. Download your copy of the “Dremio vs. Presto — Performance and Efficiency Benchmark” to learn more about how and why Dremio outperforms any distribution of Presto.

Ready to Get Started?

Bring your users closer to the data with organization-wide self-service analytics and lakehouse flexibility, scalability, and performance at a fraction of the cost. Run Dremio anywhere with self-managed software or Dremio Cloud.