Azure Data Lake Analytics
What is Azure Data Lake Analytics
Azure Data Lake Analytics (ADLA) is one of the main three components of Microsoft’s Azure Data Lake. It is an on-demand job service built on Apache Hadoop YARN, designed to simplify big data by eliminating the need to deploy, configure and maintain hardware environments to handle heavy analytics workloads. This not only allows data consumers to focus on what matters, but also allows them to do so in the most cost-effective way thanks to ADLA’s pay-per-use price model.
ADLA is designed to let users perform analytics on data up to petabytes in size. Users only pay for the processing power used, and have the ability to scale up or down the number of resources needed for their analytical jobs.
Work Across all Cloud Data
ADLA is a fundamental element in the Azure Data Lake stack, which means users can seamlessly process their data regardless of where it is stored in the Azure cloud:
- Azure SQL Data warehouse
- Azure SQL database
- Azure Data Lake Storage (ADLS)
- Azure Storage Blobs
- SQL in an Azure virtual machine (VM)
Easy and Powerful Data Processing with U-SQL
One of the fundamental pillars of ADLA is its powerful data processing language U-SQL.
U-SQL is a combination of C# data types and functions, and SQL statements such as SELECT, FROM and WHERE. U-SQL is based heavily on Microsoft’s internal analytics language SCOPE. U-SQL’s scalable distributed runtime enables users to efficiently analyze data stored across SQL servers in Azure, Azure SQL databases, Azure blob storage and Azure SQL data warehouses.
U-SQL enables users to process any type of data, from security vulnerability pattern logs to metadata from media for machine learning, and is designed to make the user comfortable from the get-go..
It allows users to schematize unstructured data for analysis, and share easily the data with other users. U-SQL also provides developers a way to integrate their code seamlessly, especially in situations where proprietary code might be difficult to express in different languages.
Last but not least, U-SQL offers a way to process big data from where it lives, eliminating the need to move data to central locations within the Azure ecosystem for further processing.
U-SQL Query Lifecycle
One of the most valuable features of U-SQL is that it allows users to query data where it resides without having to copy or move it to a centralized place. For external systems such as Azure SQL Server, Azure SQL data warehouse and SQL database, this can be achieved through the use of federated queries against the data sources. By simply using query pushdowns, the query is executed at the data source and then returned to the user.
Another benefit of U-SQL is the ability to write a query in a language that is similar to SQL, and then queries are executed by the number of compute nodes specified within the query. These units are referred to as Azure Data Lake Analytics Units (AUs), and if you’re familiar with YARN, you can view them as you would a YARN container.
The following illustrates what happens to a U-SQL query once it is written and the role AUs play in the process:
The process begins when the user authors a U-SQL script, and defines the following three parameters:
- 1. The U-SQL script
- 2. The input data that the script will use
- 3. The number of AUs
When the user submits the script, it goes through the “Preparing” phase, where the script gets compiled and translates the U-SQL script into C#, C++ or other metadata items.
Once the script is compiled, a “Plan” is created and optimized for execution.
Once compiled and optimized, the script progresses to the “Queued” stage. There are several factors that can cause a job to remain in this stage. The most common is the amount of resources allocated for the job, as well as exceeding the maximum number of jobs that you can run simultaneously in your Azure account. There is the option to set the priority sequence of queued jobs, however, this doesn’t ensure that the job will get executed right away unless there are enough resources to process the workload.
Once the job leaves the queue, it moves into the “Running” stage. In this stage, the “plan” is divided into tasks named “vertex,” where the job manager allocates the “vertices” needed to execute the job by the AUs, and uses YARN to orchestrate the allocation process.
When the vertex is finished, the AU gets another vertex assigned to it, therefore the job is completed one vertex at a time. Multiple vertices can be run at the same time, so multiple AUs are run at the same time so long as there are enough resources allocated to do so. Once processing is complete, all the AUs are released so the user can assign them to other jobs. At the end of this stage, the user can either consume the processed data locally or store it in ADLS.
Vertices and AUs
When a vertex runs a job, it uses the resources of 1 AU. When users select the number of AUs to be used to execute a script, this affects the number of vertices that will run at any given moment.
It is important to take into consideration the cost associated with the number of AUs selected. You might be inclined to think that more AUs will get the job done faster, however, this is not always the case.
To illustrate this point, imagine a job that is divided into many vertices because of its large size, and the user only assigns 1 AU. In this case, the job will only use one active vertex at a given point in time. Alternatively, if you have a job that is divided into 10 vertices and allocate the maximum 32 AUs, Azure will recognize the overallocation and send a warning. In this scenario, the job will use 10 vertices at any given moment, meaning the rest of the AUs won’t be used—but you still have to pay for them.
Anatomy of a U-SQL Job
This graph illustrates a completed U-SQL job, and contains the following elements:
- 1. Name of the Stage
- 2. Number of vertices used for this stage as well as the AVG time per vertex used to execute this stage.
- 3. Number of rows written
- 4. Data read
- 5. Data written
In the ADLA environment you can also review the AU analysis to see what other scenarios can be chosen to execute the same script.
In this view you can observe the breakdown of the efficiency of each one of the scenarios, and the comparison blocks allow you to see the relationship between used and allocated resources. In this case, you can see that the used AUs correspond to the allocated ones. If you dial the number of allocated resources to the max, the following table depicts the results.
In this scenario, 21 AUs were allocated for the job, and while this might seem ideal, notice that the calculated efficiency of the event would be just 2%.
Understanding the Elements of a U-SQL Query
The general pattern of a U-SQL query process can be summarized as: read, process and Store. The following table explains the way that data is “read” from the different Azure sources and what happens to it once it is processed.
|Azure Storage Blobs (extract)||Azure Storage Blobs (output)|
|Azure SQL DB (Select)||Select…. From….. Where...|
|Azure Data Lake (extract) (select)||Azure Data Lake (output) (insert)|
U-SQL scripts can be written as a series of statements that indicate processing actions against the input data. According to Microsoft U-SQL, scripts are typically structured to retrieve data from source systems in a rowset format, transform it as needed and then output the transformed data to a file or a U-SQL table for analysis.
The following example shows a simple U-SQL script taken directly from ADLA’s user interface:
Here you can see that this script is not written with the purpose of transforming data, it simply extracts data from a log file, applies a schema and stores the results into a TSV file, respecting the following structure:
- 1. Rowset: essentially can be seen as an intermediate table, U-SQL uses it to transfer data between statements. The data types used by U-SQL are the same as C#, and the schema is imposed on-read when the data is extracted from the source.
- 2. The input data is read from a sample directory in ADL
- 3. Custom function to extract data from the TSV file
- 4. The output of the job is stored in a file in ADL
- 5. This is a built-in function that wraps the output in a TSV format
Benefits of U-SQL
- Eliminates costly data movement, uses federated queries to get results from data where it lives
- Allows users to obtain a single view of the data regardless of where it is located
- Uses one single query language for all data structure and types
- Accesses JSON, Azure blobs, ADLS, text, CSV, TSV, images
- Provides the flexibility for users to write their own data extractors
ADLA Cost Model
One of the characteristics that makes ADLA the most attractive cloud-based data analytics solutions is its no upfront cost, pay-per-use model. It allows users to pay only for the resources allocated to each job, and when the jobs are completed the resources are released immediately. At the moment of writing, Microsoft bills users on UA usage based on the number of AUs multiplied by the execution duration of a job. Usage is calculated in terms of “AU seconds.”
There are two different pricing tiers, pay-as-you-go and monthly plans. Pay-as-you-go charges the user a fixed amount of $2 per AU hour, so if you have a job that runs for 75 AU hours, then you will be charged $150. The monthly plan offers savings up to 70% per AU.
Understanding AU Seconds
An AU Second is the unit used to measure the compute resources used to process a job. A simple way to see this would be the following:
- 1 AU for a job that takes a second to execute = 1 AU Second
- 1 AU for a job that takes a minute to execute = 60 AU Seconds
You will find that when doing the AU analysis in the Azure user interface, it will show you usage data based on AU Hours, that simply is 1 AU Second divided by 3600 seconds.
One of the great advantages of this pricing model, is that it will only bill users for jobs that succeed or that are canceled by the user, any jobs that fail due to platform issues won’t be included in the bill cycle.
Dremio is the Data Lake Engine for Azure
Dremio connects to data lakes like ADLS, Amazon S3, Apache HDFS and more — putting all of your data in one place while also providing valuable structure. Designed for BI users and data scientists, Dremio incorporates capabilities for data acceleration, data curation, data cataloging and data lineage, all on any source, delivered as a self-service platform. In addition, Dremio allows companies to run their BI workloads from their existing data lake infrastructure, removing the need to build cubes or BI extracts.
Run SQL on any Data Source
With Dremio, you don’t need to worry about the schema and structure of the data that you put in your data lake. Dremio takes data from whatever kind of source (relational or NoSQL) and converts it into a SQL-friendly format without making extra copies. You can then curate, prepare, and transform your data using Dremio’s intuitive user interface, making it ready for analysis.
Integrated Data Curation
Dremio makes it easy for your data engineers to curate data for the specific needs of different teams and different jobs, without making unnecessary copies. By managing data curation in a virtual context, it’s easy and cost-effective to design customized virtual datasets that filter, transform, join, and aggregate data from different sources. And, because our virtual datasets are defined with standard SQL, they fit into the skills and tools already in use by your data engineering teams.
Optimization and Governance
In order to scale results across your enterprise, Dremio provides a self-service semantic layer and governance for your data. Dremio’s semantic layer is an integrated, searchable catalog in the data graph that indexes all of your metadata, allowing business users to make sense of the data in the data lake. Anything created by users — spaces, directories and virtual datasets — all make up the semantic layer, all of which is indexed and searchable. The relationships between your data sources, virtual datasets and all your queries are also maintained in the data graph, creating a data lineage, allowing you to govern and maintain your data.
At its core, Dremio makes your data self-service, allowing any data consumer at your company to find the answers to your most important business questions in your data lake, whether you’re a business analyst who uses Tableau, Power BI or Qlik, or a data scientist working in R or Python. Through the user interface, Dremio also allows you to share and curate virtual datasets without any overhead, so you can optimize storage and improve collaboration across teams.
Accelerate Your Data
Dremio accelerates your BI tools and ad-hoc queries with data reflections, and integrates with your favorite BI and data science technology, allowing you to leverage the tools you already know how to use, all on your data lake.
Visit our tutorials and resources to learn more about how Dremio can help you gain insights from your data stored in ADLS.