17 minute read · August 4, 2020
Creating a Cloud Data Lake with Dremio and AWS Glue
· Regional Manager, Resident Solution Architect, Dremio
Dremio 4.6 adds a new level of versatility and power to your cloud data lake by integrating directly with AWS Glue as a data source. AWS Glue is a combination of capabilities similar to an Apache Spark serverless ETL environment and an Apache Hive external metastore. Many users find it easy to cleanse and load new data into the data lake with AWS Glue, and the metadata store in the AWS Glue Data Catalog greatly expands data scalability while automating type-consuming data management and preparation tasks.
AWS Glue crawlers discover the schema for datasets available in Amazon S3, and the AWS Glue Data Catalog provides a persistent and versioned metadata store, allowing users to focus on data analytics rather than the movement, cleansing and administration of data. Dremio integrates seamlessly with AWS Glue to provide automatic access to the full range of enterprise data without additional data preparation.
Why AWS Glue?
Many organizations already understand the benefits of using AWS Glue as the data transformation and metadata storage solution for the data lake. If your organization has committed to using AWS Glue, you may already have:
- An extensive collection of sophisticated ETL jobs and event-driven pipelines
- A long list of data crawlers for standard and custom data formats
- An extensive metadata library of databases and tables in the AWS Glue Data Catalog
For these organizations, Dremio integrates seamlessly with the existing AWS Glue framework to provide fast, interactive query performance. Dremio builds on AWS Glue to give a data lake user experience more like a data warehouse — enterprise data easily within reach for dashboards and reports. Dremio also provides integration with best-in-class analysis tools such as Tableau, Power BI, Jupyter and others.
Dremio integrates with the AWS Glue framework and directly with other sources to provide fast, interactive query performance. Dremio also includes integration with best-in-class analysis tools such as Tableau, Power BI, Jupyter and others.
For many other organizations, AWS Glue is new, and the cloud data lake journey is just beginning. The first few initial versions of the cloud-based data lake may consist of Dremio integrating directly with Amazon S3 and other relational or HDFS sources. The data lake may not require recurring ETL or schema discovery across vast and diverse datasets. However, this is the time to make plans for handling common problems of data maintenance and organization as the data lake grows in size and complexity.
AWS Glue Features
AWS Glue data crawlers perform the work of connecting to your source S3 buckets and collecting metadata. Built-in crawlers support Parquet, ORC and CSV files, so Dremio users need never spend time defining the low-level details of esoteric data file formats. You can run crawlers on demand from the AWS Management Console or trigger them based on an event, and then integrate a Dremio update using the Dremio REST API.
As AWS Glue collects metadata, it discovers the data schema and converts types in mixed-type columns as necessary across the dataset. AWS Glue also maintains a comprehensive schema version history to permanently document how your data has changed over time.
The AWS Glue Data Catalog contains database and table definitions with statistics and other information to help you organize and manage the cloud data lake. By delegating the collection and maintenance of metadata to AWS Glue, Dremio can query massive cloud-based datasets, giving you the power to create cloud data lakes on par in size and scope with on-prem environments supported by an external Hive-based metastore.
Setting Up AWS Glue
AWS Glue organizes metadata into tables within databases. Users start by defining databases, and AWS Glue adds the tables to these databases automatically when running crawlers; alternatively, users can create tables manually using the AWS Management Console. Each table exists in only one database.
The crawler definition includes the S3 bucket hosting the dataset. Dremio reads the S3 bucket from the resulting Data Catalog and refers to the S3 data files when running queries.
The crawler needs an IAM policy that grants permission to access the host S3 bucket and the contained data. Users can define a role that includes an IAM policy created explicitly for the crawler, or the crawler creation wizard will update a standard IAM policy during crawler definition.
The AWS Management Console shows the status of each crawler. From the console, crawlers can be launched on demand to trigger the discovery of recent or unexpected changes in the dataset, possibly occurring between scheduled crawler executions. The console also provides a pointer to Amazon CloudWatch, where the logs for each run are available for review.
After the crawler completes, the AWS Management Console shows metadata and statistics for the dataset, such as the number of records and the average record size. In the example below, the crawler has completed discovery in the mult_4k table in the la-crime-db database. The Parquet files of this dataset include about 32 million records. As you will see later, the number of Parquet row groups is a critical factor in dataset scaling.
After fully configuring AWS Glue and populating the Data Catalog, configure Dremio to read the catalog and query these datasets.
Configuring Dremio
Dremio supports the Glue Catalog as an out-of-box dataset, and Dremio administrators can configure a connection from Dremio to AWS Glue in minutes using the Dremio UI. Dremio users will immediately recognize the simple steps to configure an Amazon Glue Catalog source because they are nearly identical to other Dremio sources. With a few clicks, users can configure metadata and reflection refresh, along with permissions to access data.
Once connected, Dremio reads the metadata from the AWS Glue Data Catalog and automatically creates a folder for each Glue database and a physical dataset (PDS) for each contained table. No additional configuration is required to begin querying data within Dremio.
Querying Massive Datasets
The Apache Parquet project provides an excellent overview of the Parquet file format. Parquet formats are adjustable in several dimensions, including the Parquet row group. A row group is a logical horizontal partitioning of the data into rows, and it consists of a column chunk for each column in the dataset. Dremio also provides guidance and best practices regarding tuning row group size for performance.
Many existing ETL tools provide the ability to configure the size of row groups in their output data files. For example, the default Spark Parquet row group size is 128MB. If your ETL framework does not include information about the row group size, standalone tools are available. Parquet-tools (MIT license) and parquet-metadata (Apache 2.0 license) are two open source tools which read Parquet files and provide schema or other metadata.
Most query processing tools limit the total number of Parquet row groups contained in a dataset. For a dataset hosted in a cloud object store like S3 or a file system like Unix, Dremio provides an error diagnostic when the PDS contains more than 60,000 row groups. Most use cases never hit this limit, but some users may find the need to support massive datasets, especially when rehosting data to S3 from legacy data warehouses or on-prem HDFS clusters.
By adding an external metadata catalog, either AWS Glue or a Hive data catalog, Dremio supports datasets with up to 300,000 row groups — a dramatic increase from the size of a standard file system dataset. The Dremio query profile confirms the number of row groups processed by each thread of query execution, which is very useful in troubleshooting situations.
For this dataset integrated with AWS Glue in the la-crime-db database above, the sum of the NUM_ROW_GROUPS in the query profile shows that the dataset contains 63,435 row groups, which cross-checks with standalone Parquet tools.
On the jobs page, Dremio also shows the total number of records in the dataset, correlating to the statistics reported by the AWS Glue crawler.
Using Dremio Reflections
A well-designed Dremio raw or aggregation reflection can dramatically improve query performance. Dremio supports the creation of reflections on datasets from AWS Glue, precisely like any other data source. Typically, Dremio reflections are highly beneficial with AWS Glue data sources in several situations:
- Needle-in-haystack queries on CSV sources. Raw reflections preserve row-level detail into an optimized Parquet representation of the data, which are dramatically faster to read and query than CSV datasets.
- Compute-intensive business logic in virtual datasets. Upper level datasets of the semantic layer may contain business logic required by dashboards and reports, where performance is critical.
- OLAP workloads on any source. Aggregation reflections precompute measures for all dimension combinations selected in the reflection definition and significantly improve the efficiency of GROUP BY, SUM, AVG and similar queries. Aggregation reflections also optimize distinct value expressions and store approximate count-distinct measures on selected columns.
This example illustrates a GROUP BY query against the la-crime-db dataset integrated with AWS Glue:
The Dremio reflection UI was used to create an aggregation reflection on the count measure of the time_occ column when grouped by the area_name dimension.
This aggregation reflection reduces query times from several minutes to less than 1 second.
Dremio recommends that you survey the last few weeks or months of query history to determine the dimensions and measures typically used in queries, and include them in the reflection. Dremio has documented several best practices for the creation and use of reflections, such as maximizing the coverage of each reflection, in its Data Reflections Overview and Best Practices white paper.
Managing Dataset Evolution
“Data that is loved tends to survive.” – Kurt Bollacker
Most datasets are indeed long-lived and updated frequently by new periods or events. Unfortunately, updates often bring errors or changes in the types of data elements within the same column in the same table. Mixed-type columns can ruin aggregations or comparisons, and some tools completely ignore values in a mixed-type column that are different from the majority type.
To make users aware of the presence of mixed-types, Dremio provides an error message that shows the incompatible types and the column. In this example, DECIMAL(19,8) and DECIMAL(18,6) are incompatible because they are different in both precision and scale.
In most use cases, mixed-types are discovered and automatically resolved by the AWS Glue crawler during the metadata discovery phase. In the example below, AWS Glue automatically promotes both decimal types to a third new compatible decimal type. With both types promoted to a common unifying type across the schema, Dremio quickly queries the dataset.
Updating the Dataset
As datasets evolve, they also usually grow in records or number of columns. Both Dremio and AWS Glue accommodate data changes over time with easily configured metadata discovery configurations. The AWS Glue crawler runs on demand or at intervals predefined from hourly to monthly, or defined with a customized cron expression. After each run, the metadata summary screen shows the date of the last metadata discovery.
Dremio also requires a metadata refresh to recognize changes in the dataset and make them visible in your physical and virtual datasets. After running the AWS Glue crawler, run the command:
ALTER PDS <PHYSICAL-DATASET-PATH> REFRESH METADATA
The dataset path is the sequence of source name, database name and table name defined by the AWS Glue metastore and made visible in Dremio. As with AWS Glue, users can run this refresh on demand or on a regular schedule configured on the metadata refresh page of the source configuration UI. When running the ALTER PDS command from the Dremio UI query window, Dremio confirms successful metadata refresh.
Users can submit the ALTER PDS command to Dremio through the Dremio query window, a JDBC or ODBC client, or the SQL interface of the Dremio REST API. More information on Dremio SQL commands is available in the Dremio documentation.
Conclusion
Dremio now utilizes AWS Glue to expand your success in the cloud data lake. AWS Glue data crawlers automatically discover the schema of your datasets, and the AWS Glue Data Catalog presents a persistent metadata store easily shared with Dremio and other tools. Dremio leverages AWS Glue to manage massive and evolving datasets, allowing you to focus on data queries and analytics rather than data movement, cleanup and administration. Dremio data reflections accelerate datasets available through AWS Glue or other sources and provide sub-second query performance for many use cases.
If you face similar challenges querying massive and evolving datasets, deploy Dremio 4.6 to discover how easy and fun it can be to query your cloud data lake with Dremio.