Dremio Jekyll

Azure Data Lake Analytics Explained by Dremio

What is Azure Data Lake Analytics

Azure Data Lake Analytics (ADLA) is one of the main three components of Microsoft’s Azure Data Lake. It is an on-demand job service built on Apache YARN offered by Microsoft to simplify big data by eliminating the need to deploy, configure and maintain hardware environments to handle heavy analytics workloads. Not only this allows data consumers to focus on what matters, but also it allows them to do so in the most cost-effective way thanks to ADLA’s pay-per-use price model.

image alt text

ADLA Features

Limitless Scalability

Azure Data Lake Analytics has been designed to let users perform analytics on data up to petabytes in size. While users pay only for the processing power used, it also allows them to scale up or down the number of resources needed for their analytical jobs.

Work Across all Cloud Data

image alt text

ADLA is a fundamental element in the Azure Data Lake stack, this means that users can seamlessly process their data regardless of where it is stored in the Azure cloud:

  • Azure SQL DW
  • Azure SQL DB
  • ADLS
  • Azure Storage Blobs
  • SQL in an Azure VM

Easy and Powerful Data Processing with U-SQL

One of the fundamental pillars of Azure Data Lake Analytics is its powerful data processing language U-SQL.

What is U-SQL?

U-SQL is a combination of C# data types and functions, and SQL’s features such as SELECT, FROM, and WHERE. U-SQL is based heavily on Microsoft’s internal analytics language SCOPE. U-SQL posses a scalable distributed runtime which enables users to efficiently analyze data stored across SQL servers in Azure, Azure SQL databases, Azure Blob Storage, and Azure SQL data warehouses.

image alt text

What is it useful for?

U-SQL allows users to process any type of data, from security vulnerability pattern logs, to extracting metadata from media for machine learning, it is a language that has been designed to make the user comfortable from the get-go allowing them to process any data.

It allows users to schematize unstructured data for analysis, share data with other users. In addition to provide a way for developers to integrate their code seamlessly specially in those situations where proprietary code might be difficult to express in different languages.

Last but not least, U-SQL opens a way to process big data from where it lives, eliminating the need to move data to central locations within the Azure ecosystem for further processing.

U-SQL query lifecycle

One of the most valuable facts about U-SQL is that it allows users to query data where it resides without having to copy or move it to a centralized place. For external systems such as Azure SQL Server, Azure SQL Datawarehouse, and SQL Server, this goal can be achieved through the use of federated queries against the data sources. By simply using query push-downs, the query will be executed at the data source and then the results brought back to the user.

Another great benefit of U-SQL is that you are able to write a query in a language that looks very similar to SQL, queries then are executed by a number of compute nodes specified within the query. These units are more formally known as AUs or Azure Data Lake Analytics Units, which if you are familiar with YARN, you can see them as a YARN container.

This is what happens to a U-SQL query once you write it and the role that AUs play in the process:

image alt text

Preparing Stage

Everything starts by the user authoring a U-SQL script, which in essence, is the logical plan of how the user wants to transform their input data. When the user creates a U-SQL script, there are three things that need to be defined:

  • 1. The U-SQL script
  • 2. The input data that the script will use
  • 3. The number of AUs

image alt text

When a user submits the script, it goes through the ‘Preparing’ phase, at this stage the script gets compiled. The compiling process will translate the U-SQL script into C#, C++ and other metadata items.

Once the script is compiled, a ‘Plan’ is created and optimized for execution.

Queued Stage

Once compiled and optimized, the script progresses to the ‘Queued’ stage. There are several factors that can cause a job to remain in this stage. The most relevant one is the amount of resources allocated for the job, and also the maximum number of jobs that you can run simultaneously in your Azure account. There is the option to set the priority for the job, however this doesn’t ensure that the job will get executed right away unless there are enough resources to process the workload.

Running Stage

Once the job leaves the queue, it moves into the ‘Running’ stage. In this stage the ‘plan’ is divided into tasks named ‘Vertex’. Here, the job manager will allocate the ‘Vertices’ needed to execute the job by the AUs and will use YARN to orchestrate this allocation process.

Finalizing Stage

When the ‘vertex’ is finished, the AU will get another vertex assigned to it. This way, one job will be completed one vertex at a time. Multiple vertices can be ran at the same time so multiple AUs can be run at the same time given that enough resources have been allocated to do so. Once the processing is done, all the AUs will be released so the user can assign them to other jobs. At the end of this stage, the user can either consume the processed data locally or store it in ADLS.

Vertices and AUs

When a vertex runs a job, it uses the resources of 1 AU. When users select the number of AUs to be used to execute a script, this affects the number of vertices that will run at any given moment.

image alt text

It is important to take into consideration the cost associated with the number of AUs selected, while we might be inclined to think that more AUs will get the job done faster, this is not always the case.

image alt text

Let’s use as an example scenario a job that because of its size it is divided into many vertices, but the user assigns only 1 AU, in this case the job will use only one active vertex at any given point in time. On the other side, if we have a job that is divided into 10 vertices but we allocated the maximum of 32 AUs, Azure will notice that we have over-allocated resources to this job a throw a warning. Why? Well, the job will use 10 vertices at any given moment, meaning that the rest of the AUs will be not used for the job, however, you would still have to pay for it.

Anatomy of a U-SQL Job

image alt text

This is the graph of a completed U-SQL job. It contains the following elements:

  • 1. Name of the Stage
  • 2. Number of vertices used for this stage as well as the AVG time per vertex used to execute this stage.
  • 3. Number of rows written
  • 4. Data read
  • 5. Data written

In the ADLA environment you can also take a look at the AU Analysis where you can see what other scenarios can be chosen to execute the same script.

image alt text

There we can observe a breakdown of the efficiency of each one of the scenarios, the comparison blocks allow us to see the relationship between used and allocated resources. In this case we can see that the the used AUs correspond to the allocated ones. Let’s dial the number of resources allocated to the max and see how this scenario looks like.

image alt text

Here, 21 AUs were allocated for this job, while this might seem like a good idea, notice that the calculated efficiency of this event would be just 2%.

Understanding the elements of a U-SQL query

At this point we can summarize the general pattern of a U-SQL query process as: Read, Process, and Store. The following table explains the way that data is ‘read’ from the different Azure sources and what happens to it once it is processed.

Read Process Store
Azure Storage Blobs (extract)   Azure Storage Blobs (output)
Azure SQL DB (Select)  Select…. From….. Where...  
Azure Data Lake (extract) (select)    Azure Data Lake (output) (insert)

U-SQL scripts can be written as a series of statements that indicate processing actions against the input data. According to Microsoft U-SQL scripts typically are structured to retrieve data from source systems in a rowset format, transform it as needed and then output the transformed data to a file or a U-SQL table for analysis.

The following example shows a simple U-SQL script taken directly from ADLA’s user interface

image alt text

Here we can see that this script is not written with the purpose of transforming data, it simply extracts data from a log file, applies a schema and stores the results into a tsv file respecting the following structure:

  • 1. Rowset: essentially can be seen as an intermediate table, U-SQL uses it to transfer data between statements. The data types used by U-SQL are the same as C#, and the schema is imposed on-read when the data is extracted from the source.
  • 2. The input data is read from a sample directory in ADL
  • 3. Custom function to extract data from the TSV file
  • 4. The output of the job is stored in a file in ADL
  • 5. This is a built-in function that wraps the output in a TSV format

Benefits of U-SQL

  • Eliminate costly data movement, use federated queries to get results from data where it lives
  • Allows users to obtain a single view of the data regardless of where it is located
  • One single query language for all data structure and types
  • Can access JSON, Azure Blobs, ADLS, Text, CSV, TSV, images
  • Provides the flexibility for users to write their own data extractors

ADLA Cost Model

One of the characteristics that makes ADLA the most attractive cloud-based data analytics solutions is its no upfront cost, pay-per-use model. It allows users to pay only for the resources allocated to each job and when the jobs are completed, the resources are released immediately.

At the moment of writing, Microsoft bills users based on UA usage specifically based on the number of AUs multiplied by the execution duration of a job. Usage is calculated in terms of ‘AU Seconds’

image alt text

There are two different pricing tiers, pay-as-you-go which charges the user a fixed amount of $2 per AU hour, so if you have a job that runs for 75 AU Hours, then you will be charged $150. Also a monthly plan is available which offers savings up to 70% per AU.

image alt text

Understanding AU Seconds

An AU Second is the unit used to measure the compute resources used to process a job. A simple way to see this would be the following:

  • 1 AU for a job that takes a second to execute = 1 AU Second
  • 1 AU for a job that takes a minute to execute = 60 AU Seconds

You will find that when doing the AU analysis in the Azure user interface, it will show you usage data based on AU Hours, that simply is 1 AU Second divided by 3600 seconds.

image alt text

One of the great advantages of this pricing model, is that it will only bill users for jobs that succeed or that are canceled by the user, any jobs that fail due to platform issues won’t be included in the bill cycle.

Dremio and Azure Data Lake

Dremio connects to data lakes like ADLS, Amazon S3, HDFS and more, putting all of your data in one place and providing it structure. We provide an integrated, self-service interface for data lakes, designed for BI users and data scientists. Dremio increases the productivity of these users by allowing them to easily search, curate, accelerate, and share datasets with other users. In addition, Dremio allows companies to run their BI workloads from their data lake infrastructure, removing the need to build cubes or BI extracts.

Here’s how Dremio helps you leverage your data lake:

Data Acquisition

With Dremio, you don’t need to worry about the schema and structure of the data that you put in your data lake. Dremio takes data from whatever kind of source (relational or NoSQL) and converts it into a SQL-friendly format without making extra copies. You can then curate, prepare, and transform your data using Dremio’s intuitive user interface, making it ready for analysis.

Data Curation

Dremio makes it easy for your data engineers to curate data for the specific needs of different teams and different jobs, without making copies of the data. By managing data curation in a virtual context, Dremio makes it fast, easy, and cost effective to design customized virtual datasets that filter, transform, join, and aggregate data from different sources. Virtual datasets are defined with standard SQL, so they fit into the skills and tools already in use by your data engineering teams.

Optimization and Governance

In order to scale these results across your enterprise, Dremio provides a self-service semantic layer and governance for your data. Dremio’s semantic layer is an integrated, searchable catalog in the Data Graph that indexes all of your metadata, allowing business users to easily make sense of the data in the data lake. Anything created by users—spaces, directories, and virtual datasets make up the semantic layer, all of which is indexed and searchable. The relationships between your data sources, virtual datasets, and all your queries are also maintained in the Data Graph, creating a data lineage, allowing you to govern and maintain your data.

Analytics Consumption

At its core, Dremio makes your data self-service, allowing any data consumer at your company to find the answers to your most important business questions in your data lake, whether you’re a business analyst who uses Tableau, Power BI, or Qlik, or a data scientist working in R or Python. Through the user interface, Dremio also allows you to share and curate data virtual datasets without making extra copies, optimizing storage and supporting collaboration across teams. Lastly, Dremio accelerates your BI tools and ad-hoc queries with reflections, and integrates with all your favorite BI and data science tools, allowing you to leverage the tools you already know how to use on your data lake.

Data-as-a-Service Platform for Azure

Dremio provides an integrated, self-service interface for data lakes. Designed for BI users and data scientists, Dremio incorporates capabilities for data acceleration, data curation, data catalog, and data lineage, all on any source, and delivered as a self-service platform.

Run SQL on any data source. Including optimized push downs and parallel connectivity to non-relational systems like Elasticsearch, S3 and HDFS.

Accelerate data. Using Data Reflections, a highly optimized representation of source data that is managed as columnar, compressed Apache Arrow for efficient in-memory analytical processing, and Apache Parquet for persistence.

Integrated data curation. Easy for business users, yet sufficiently powerful for data engineers, and fully integrated into Dremio.

Cross-Data Source Joins. execute high-performance joins across multiple disparate systems and technologies, between relational and NoSQL, S3, HDFS, and more.

Data Lineage . Full visibility into data lineage, from data sources, through transformations, joining with other data sources, and sharing with other users.

Visit our tutorials and resources to learn more about how can you gain insights from your data stored in ADLS, faster, using Dremio.