22 minute read · December 17, 2020
Exploring Cloud Data Lake Data Processing Options – Spark, EMR, Glue
· Senior Solutions Architect, Dremio
Overview
Data processing is a critical part of the *data pipeline *— which is the flow of data from its collection point to the point where it lands in a data lake. When designing a data lake, particular attention needs to be paid to the proper processing of data for a host of reasons. Data processing tends to be an expensive operation, both in terms of manpower and compute cost. Data processing is also critical to the consumability of data. As described below, raw data is very different from data that can be consumed by end users such as business intelligence analysts and data scientists. Further, a data processing operation that lacks data integrity is useless (or potentially disastrous) to the consumers of data; for example, business users can make wrong business decisions and AI and ML algorithms can produce inaccurate results — unbeknownst to the data scientists relying on those algorithms.
It is important to understand that processing data is not simply moving it from point A to point B (e.g., the IoT device where data originates to a data lake where it lands as its final destination.) Rather, it is a complex “heavy lifting” process that involves the transformation of data, at large scale, such that it is morphed and joined with other data in a format that can be queried.
This article explores the following three technologies: Apache Spark, Amazon EMR and AWS Glue. Each is involved and complex in its own way, and there’s also overlap among the three and no concrete right or wrong way in which to use them. Your decision on which technology to use for your own needs depends on a multitude of factors: cost, effort, familiarity with the technology or a specific stack, corporate guidelines and more. However, it is good practice to have an overall understanding of what is out there and how the three compare with one another.
Real-World Scenario
Let’s explore the problem in the context of something that is familiar: a self-driving car. In general, a self-driving car understands its environment through sensors, for example, the 2018 Model 3 Tesla has a forward radar, eight cameras and 12 ultrasonic sensors that produce ~40TB of data per hour.
In contrast, Apollo 11’s computer had 32K RAM, so data generated by a single Tesla car in a single hour of driving could fill up Apollo 11’s storage over a billion times!
Let’s assume we solved the obvious storage problem, and all data has been piped into a data lake (e.g., Amazon S3, Microsoft ADLS (Azure Data Lake Storage ), Google Cloud Storage, etc.). Unfortunately, we have to address another challenge: How to make sense of all this disparate data and make it consumable so that it can answer real business questions.
To begin, identify what the data looks like. It may look like a stream of binary values:
Or it could be XML:
<Placemark> <name>Floating placemark</name> <visibility>0</visibility> <description>Floats a defined distance above the ground.</description> <LookAt> <longitude>-122.0839597145766</longitude> <latitude>37.42222904525232</latitude> <altitude>0</altitude> <heading>-148.4122922628044</heading> <tilt>40.5575073395506</tilt> <range>500.6566641072245</range> </LookAt> <styleUrl>#downArrowIcon</styleUrl> <Point> <altitudeMode>relativeToGround</altitudeMode> <coordinates>-122.084075,37.4220033612141,50</coordinates> </Point> </Placemark>
Alternatively, it could be JSON:
"type": "Feature", "properties": {}, "geometry": {"type": "LineString", "coordinates": [[-110.390625, 42.8115217450979 ], [-111.796875, 27.371767300523047 ], [-94.21875, 30.14512718337613 ], [-86.1328125, 40.97989806962013 ], [-118.47656249999999, 34.59704151614417 ], [-100.1953125, 31.952162238024975 ], [-15.1171875, 47.754097979680026 ], [-9.84375, 53.54030739150022 ]
Finally, it could be CSV, TSV, .ORC, Apache Parquet, gzip, any database-storage format, etc.
All of this data is of little value—in its raw form—to an executive decision-maker who wants to see a metric that looks like the following:
The goal is to find a way to make all of this disparate data analytics-ready.
Enter ETL
So how do we morph and stitch disparate forms of data into a data structure that can be analyzed by business users, data analysts, data scientists and the like? A common approach is via an ETL process.
ETL stands for extract, transform, and load, and ETL tools move data between systems. If ETL were for people instead of data, it would be akin to public and private transportation. Companies use ETL to safely and reliably move their data from one system to another.
Some notable ETL tools are: Informatica, Talend, Panoply, Fivetran and Stitch. A more exhaustive list would include dozens if not hundreds.
A modern big data problem requires a high degree of heavy lifting, whereas more traditional ETL may look like this:
- Extract “opportunities” from a CRM system (e.g., Salesforce)
- Extract “employees” from an HR system of record (e.g., Workday)
- Filter out employees who don’t work in sales in the employees data
- Join opportunities and employees into a single table (or multiple tables in a star-schema)
- Upload the joint result set to a data warehouse (e.g., Amazon Redshift, Snowflake, etc.)
At this point, a business user can use a business intelligence (BI) tool, such as Tableau, PowerBI, Qlik, Looker, etc., to build charts and reports on top of that data.
It is worth noting that in a data lake the problem is often exacerbated. While a traditional data warehouse imposes some constraints on data that can be loaded (for example, a command to upload a table to Redshift will fail if the data does not match a predefined schema), in a data lake there are no natural guardrails that protect against data that is not conducive to analytics.
Apache Spark
Spark is an open source, distributed processing system used for big data workloads. It utilizes in-memory caching and optimized query execution for fast queries against data of any size. It’s faster than previous approaches to working with big data like MapReduce because Spark runs on memory (RAM), which makes processing much faster than on disk drives. It can also be used for multiple things like running distributed SQL, creating data pipelines, ingesting data into a database, running machine learning algorithms, working with graphs or data streams, and much more.
Schematically, Spark includes:
Apache Spark Core – Spark Core is the underlying general execution engine for the Spark platform that all other functionality is built upon. It provides in-memory computing and referencing datasets in external storage systems.
Spark SQL – Spark SQL is Apache Spark’s module for working with structured data. The interfaces offered by Spark SQL provides Spark with more information about the structure of both the data and the computation being performed.
Spark Streaming – This component allows Spark to process real-time streaming data. Data can be ingested from many sources like Apache Kafka, Apache Flume, and HDFS (Hadoop Distributed File System). Then the data can be processed using complex algorithms and pushed out to file systems, databases and live dashboards.
MLlib (Machine Learning Library) – Spark is equipped with a rich library known as MLlib. This library contains a wide array of machine learning algorithms: classification, regression, clustering and collaborative filtering. It also includes other tools for constructing, evaluating and tuning ML pipelines. All these functionalities help Spark scale out across a cluster.
GraphX – Spark also comes with a library to manipulate graph databases and perform computations called GraphX. GraphX unifies ETL processes, exploratory analysis, and iterative graph computation within a single system.
The following are real-world use cases for Spark. In the e-commerce industry, real-time transaction information could be passed to a streaming Spark clustering algorithm like k-means or collaborative filtering like ALS. Results could then even be combined with other unstructured data sources, such as customer comments or product reviews, and used to improve and adapt recommendations over time with new trends. In the finance or security industries, the Spark stack could be applied to a fraud or intrusion detection system or risk-based authentication. It could achieve top-notch results by harvesting huge amounts of archived logs, combining it with external data sources like information about data breaches and compromised accounts and information from the connection/request such as IP geolocation or time.
Spark could fit into your data lake in two places: as part of your ETL pipeline and for large-scale data processing that usually involves machine learning.
In your organization, you may be faced with a very large amount of data that is being ingested through various pipelines. Spark is ideal for running jobs against that data, performing ETL transformations and writing the resulting processed data into your data lake — in a format that is friendly to running future queries against. Format is the file structure in which data is written to the data lake as well as the structure of the data itself.
Keep in mind that data lakes are not particular about file types; they will accept movie files as easily as they would text files. With future querying in mind, you would want to make sure that the data Spark writes into your data lake is written in a physical format that would be readable by business intelligence tools and other data analytics tools. While CSV file format is common, for analytical data Parquet file format is preferred as it is more performant when queried. Parquet is an open source file format available to any project in the Hadoop ecosystem. Parquet is designed for efficient as well as performant flat columnar storage format of data compared to row-based files like CSV or TSV files. Parquet can only read the needed columns, thereby greatly minimizing the I/O. As a result, one preferred usage pattern is for Spark to ETL your raw data and output to your data lake in Parquet-formatted files.
A second common use case for Spark is large-scale data processing that usually involves machine learning. Apache Spark is an amazing framework for distributing computations in a cluster in an easy and declarative way, and is becoming a standard across industries so it would be great to add the amazing advances of deep learning to it. There are also parts of deep learning that are computationally heavy … very heavy! Distributing these processes may be the solution to this and other problems, and Apache Spark is one of the easiest ways to distribute them.
In terms of your data pipeline, this process would often happen “on the other end” of a data lake. While Spark ETL places data into your data lake, Spark’s ML/AI-type jobs typically extract data from your data lake for processing.
Amazon EMR
Amazon Elastic MapReduce (EMR) is an Amazon Web Services (AWS) offering for big data processing and analysis. EMR offers an expandable low-configuration service as an easier alternative to running in-house cluster computing. EMR is based on Hadoop, a Java-based programming framework that supports the processing of large data sets in a distributed computing environment.
Because EMR is provided by Amazon, the underpinning hardware and services are fundamentally AWS. EMR processes big data across a Hadoop cluster of virtual servers on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3). The elastic in EMR’s name refers to its dynamic resizing ability, which allows it to ramp up or reduce resource use depending on the demand at any given time. EMR is used for data analysis in log analysis, web indexing, data warehousing, machine learning, financial analysis, scientific simulation, bioinformatics and more. EMR also supports workloads based on Spark, Presto and Apache HBase — the latter of which integrates with Apache Hive and Apache Pig for additional functionality.
Because EMR could be leveraged in so many different ways, there is no single image that encompasses all architectures. Below is one that closely represents a data lake — depicted as S3 and the various buckets in this illustration.
According to Amazon, EMR benefits include:
- Ease of use – Using EMR notebooks, individuals and teams can easily collaborate and interactively explore, process and visualize data. EMR takes care of provisioning, configuring and tuning clusters so users can focus on running analytics.
- Low cost – EMR pricing is simple and predictable: pay a per-instance rate for every second used, with a one-minute minimum charge.
- Elasticity – EMR decouples compute and storage allowing users to scale each independently and take advantage of tiered storage of Amazon S3. Users can provision one or thousands of compute instances to process data at any scale and use auto scaling to increase or decrease instances automatically.
- Reliability – EMR is tuned for the cloud and constantly monitors clusters. And with multiple master nodes, clusters are highly available and automatic failover occurs in the event of a node failure. EMR provides the latest stable open source software releases, so IT doesn’t have to manage updates and bug fixes.
- Security – EMR automatically configures EC2 firewall settings controlling network access to instances and launches clusters in an Amazon VPC. Server-side encryption or client-side encryption is available with the AWS Key Management Service or your own customer-managed keys. EMR cooperates wtih other encryption options and offers authentication with Kerberos and other fine-grained data access controls.
- Flexibility – Organizations have complete control over their cluster with root access to every instance and the ability to reconfigure clusters and install other applications with bootstrap actions. You can launch EMR clusters with custom Amazon Linux AMIs and easily install additional applications with bootstrap actions.
In terms of the data lake, EMR is often used to ingest data into your data lake. Some of it may be via jobs that read your data from various data sources, manipulate it and store it in your data lake. More and more of EMR is dedicated to variations of live streaming, for example, real-time streaming and clickstream analytics. The former refers to analyzing events from Kafka, Amazon Kinesis or other streaming data sources in real time with Spark Streaming and Apache Flink to create long-running, highly available and fault-tolerant streaming data pipelines on EMR. Data sets are then persisted to S3 or HDFS (i.e., your data lake).
The latter, clickstream analysis, is often implemented on the “other end of your data lake”: reading terabytes of logs and data pertaining to website usage from your data lake, analyzing it and pushing the output to other types of storage (e.g., Elasticsearch for indexing, Redshift for analytics). So from the perspective of EMR, your data lake is both a data source and a data destination. Some EMR tools are used to feed data into your data lake, whereas others use your data lake as a data source from which data is obtained for further processing and analysis.
EMR vs. Spark
As discussed earlier, not only is there a great overlap between Spark and EMR, but Spark is actually a tool within EMR’s toolset — So the relationship gets a tad confusing.
To disentangle this quandary, let’s start at the top: EMR is classified as a “big data-as-a-service” solution, whereas Spark is classified as a “big data tool.” Basically Spark, as technology, could be used both within the EMR ecosystem or, more traditionally, it could be used on a standalone Hadoop cluster in your own data center.
So a lot of your choice may not be as much about whether to use Spark, but how to use it. If you are generally an AWS shop, leveraging Spark within an EMR cluster may be a good choice. Netflix, Medium and Yelp, to name a few, have chosen this route. On the other hand, if you are a strong fan of open source, and would like to maintain independence in your choice of clouds, running Spark on a Hadoop cluster, or some other non-AWS platform, may be a good option. Uber, Slack and Shopify are some examples of companies that have chosen this route.
AWS Glue
Along with EMR, AWS Glue is another managed service from Amazon. In the context of a data lake, Glue is a combination of capabilities similar to a Spark serverless ETL environment and an Apache Hive external metastore. Many users find it easy to cleanse and load new data into the data lake with Glue, and the metadata store in the AWS Glue Data Catalog greatly expands data scalability while automating type-consuming data management and preparation tasks.
Since this article focuses on processing, Glue is recommended when your use cases are primarily ETL and when you want to run jobs on a serverless Spark-based platform. Streaming ETL in Glue enables advanced ETL on streaming data using the same serverless, pay-as-you-go platform that you currently use for your batch jobs. Glue generates customizable ETL code to prepare your data while in flight and has built-in functionality to process streaming data that is semi-structured or has an evolving schema. Use Glue to apply both its built-in and Spark-native transforms to data streams and load them into your data lake or data warehouse.
Beyond plain ETL, Glue provides a cataloging service for your data lake data. In a traditional database or data warehouse, you would use a schema to define tables. In a data lake, or in an ecosystem where data is distributed among various forms of storage (e.g., some data in S3 and some in a Redshift data warehouse), Glue provides a catalog that defines the data: where and how it is stored and how it relates.
The means by which Glue catalogs data is called “crawlers.” A Glue crawler connects to a data store, progresses through a prioritized list of classifiers to extract the schema of your data and other statistics, and then populates the Glue Data Catalog with this metadata. Crawlers run periodically to detect the availability of new data as well as changes to existing data, including table definition changes. Crawlers automatically add new tables, new partitions to existing tables and new versions of table definitions. You can customize Glue crawlers to classify your own file types.
Summary
In summary, the concept of “processing” relates to your data lake in two general ways: processing data that goes into your data lake and processing data from your data lake. The former uses your data lake as a destination, whereas the latter uses your data lake as a data source. The process of ingesting data into your data lake often involves some form of an ETL process. Spark is commonly a major component of that — be it in its own Hadoop cluster or as part of EMR. In the latter case, other EMR technologies may be used to ETL data into a data lake. AWS Glue can help with both ETLing and cataloging your data lake data for future analysis by BI tools, via products that run fast queries against your data lake, such as Dremio. Spark and EMR can also play a vital role in reading and processing data from your data lake, often in data science-related endeavors such as machine learning and artificial intelligence.
Looking for more info or help from a Dremio expert? Feel free to post on the community forum or reach out to us at [email protected].