Azure Data Lake Storage (ADLS)
Microsoft Azure Data Lake Storage (ADLS) is a fully managed, elastic, scalable and secure file system that supports HDFS semantics and works with the Apache Hadoop ecosystem. It provides industry-standard reliability, enterprise-grade security and unlimited storage that is suitable for storing a large variety of data. It is built for running large-scale analytics systems that require large computing capacity to process and analyze large amounts of data. Data stored in ADLS can easily be analyzed using Hadoop frameworks like MapReduce and Hive.
One of the many characteristics of big data is its variety. ADLS is suitable for storing all types of data coming from different sources like devices, applications, and much more. It also allows users to store relational and non-relational data. Additionally, it doesn’t require a schema to be defined before data is loaded into the store.
ADLS can store virtually any size of data, and any number of files. Each ADLS file is sliced into blocks and these blocks are distributed across multiple data nodes. There is no limitation on the number of blocks and data nodes.
Building a Cloud Data Lake on Azure with Dremio and ADLS
In ADLS the data itself is held within the “Catalog” folder of the data lake store, but the metadata is contained in the data lake analytics. For many users, working with the structured data in the data lake is very similar to working with SQL databases.
Furthermore, ADLS allows users to store:
Unstructured data: data that does not have a pre-defined data model or is not organized into a particular format or standard, i.e., tweets, text messages, etc.
Semi-structured data: data that does not conform with the formal structure of data models associated with a relational database or other forms of data tables (i.e., JSON, XML).
Structured data: data that resides in a field within a record file; the most common examples are spreadsheets and data contained in a relational database.
Support for Heavy Analytic Workloads
ADLS was created to support analytic workloads that require large throughput in order to improve performance and reduce latency. To process data up to petabytes in size, ADLS distributes data across multiple nodes where mappers and reducers, in the magnitude of thousands, process the data in parallel to provide fast results.
High Availability and Reliability
Azure maintains three copies of each data object stored across multiple regions to ensure availability in the unlikely event of a hardware failure. “Read” transactions can be directed toward any of the three copies of the data object. Microsoft suggests as a best practice to always ensure proper access policies for your data as well as create copies of critical data as part of disaster mitigation routines.
When implementing a big data solution, security shouldn’t be optional. To conform with security standards and limit sensitive information visibility, data must be secured in transit and at rest. ADLS provides rich security capabilities so users can have peace of mind when storing their assets in the ADLS infrastructure. Users can monitor performance, audit usage and access control through the integrated Azure Active Directory (AAD).
ADLS creates audit logs for all operations performed in it. These logs can be analyzed with U-SQL scripts.
ADLS provides access control through the support of POSIX-compliant access control lists (ACL) on files and folders stored in its infrastructure. It also manages authentication through the integration of AAD based on OAuth tokens from supported identity providers. Tokens will carry the user’s security group data, and this information will be passed through all the ADLS microservices.
ADLS encrypts data in transit and at rest, providing server-side encryption of data with the help of keys including customer-managed keys in the Azure Key Vault.
Data Encryption Key Types
ADLS uses a Master Encryption Key (MEK) stored in Azure’s key vault to encrypt and decrypt data. Users have the option to manage this key themselves but there is always the risk of not being able to decrypt the data if the key is lost. ADLS also includes the following keys:
Block Encryption Key (BEK): these are keys generated for each block of data.
Data Encryption Key (DEK): these keys are encrypted by the MEK and are responsible for generating BEKs to encrypt data blocks.
ADLS and Big Data Processing
ADLS can get any data from anywhere into the data lake in its native format without requiring any prior transformations. It eliminates the need for users to define a schema before loading the data. With the ability to store files of arbitrary sizes and formats, on-premises legacy systems and existing cloud stores makes it possible for ADLS to handle unstructured, semi-structured and structured data.
Data ingestion is only one of the key stages of big data processing, other stages include:
- Consuming or visualizing data
ADLS can be used alongside other Azure services to help users meet their big data requirements across these stages. The following table maps each one of these tools to the processing stages:
|Azure Data Lake Analytics||✓|
|Azure Data Factory||✓||✓|
|Azure Stream Analytics||✓|
|Azure SQL Data Warehouse||✓|
A characteristic that makes ADLS one of the most attractive cloud-based storage solutions is its no upfront cost, pay-per-use model. It allows users to pay only for data at rest and the number of gigabytes stored, as well as the number of transactions (read and write) over that data.
At the time of writing, pay-as-you-go prices can range from $0.039 per GB for the first 100 TB to $0.037 per GB up to 5,000 TB. There is also the option to select monthly commitment packages which provide users with more affordable prices based on storage needs.
- Transaction: every time a user, application or Azure service reads or writes data of size between 128 KB and up to 4 MB.
If a user places an item in ADLS of 9 MB, Azure will break this down into three different transactions: 4 MB + 4 MB + 1 MB. The monthly cost is calculated based on monthly usage volume (transactions) plus the storage used. The following is an example of the cost breakdown for a common use case:
If a user has an application that writes data into ADLS at a rate of 10 items per second, each item being 4 MB, and another service that runs for 4 hours a day and reads 1,000 items per second, then the monthly bill would be:
Note: For this example we will use the following time period parameters:
- Month = 31 days.
- 31 Days = 730 hours.
- 1 hour = 3,600 seconds.
|Custom app||10 items/second x 3,600 x 730||$0.05 every 10k transactions||$131.40|
|Reading job||1000 items/second x 3,600 x 4 x 31||$0.004 every 10k transactions||$178.56|
|Storage||3.4 Terabyte/Day x 31||$0.038 per GB||$4,005.02|
Azure Data Lake Storage Gen2
Recently, Microsoft announced ADLS Gen2, which is a superset of ADLS Gen1 and includes new capabilities dedicated to analytics built on top of Azure Blob storage.
Described by Microsoft as a “no-compromise data lake,” ADLS Gen2 extends Azure Blob storage capabilities and is best optimized for analytics workloads. Users can store data once and access it through existing blob storage and HDFS-compliant file system interfaces with no programming changes or data copying when doing database operations.
ADLS Gen2 includes most of the features from both ADLS Gen1 and Azure Blob storage, including:
- Limitless storage capacity
- Azure Active Directory (AAD) integration
- Hierarchical File System (HFS)
- Read-access geo-redundant storage
- 5 TB file size limit
- Blob tiers (Hot, Cool, Archive)
Multi-Modal Storage Service
Up to this point users had to make a choice between an object store (i.e., Azure Blob storage) or a file system (i.e., ADLS Gen1.) ADLS Gen2 unifies object storage and file systems to provide simultaneous access over the same data.
Who Is ADLS Gen2 for?
Customers who are using ADLS Gen1, or customers who are using Azure Blob storage, or both can take advantage of ADLS Gen2. Since ADLS Gen2 delivers the best of both worlds, current ADLS Gen1 users won’t see new features, so they can remain in ADLS Gen1 unless they need to use features associated with Blob storage. The same applies for current Azure Blob storage users, they can remain in their current environments and save on transaction costs.
It is always a best practice to first define what the storage need is before selecting the service, for example, if a user only needs to store images or back up files, the simplicity of Azure Blob storage might be all they need.
Storage costs for ADLS Gen2 are roughly 50% less than ADLS Gen1.
Dremio and ADLS
Dremio unlocks value in your existing ADLS data lake by enabling queries directly against it with a best-in-class data lake engine that’s ideal for business intelligence services and data science workloads, including Power BI.
Deploy Dremio on ADLS
Harness Dremio’s industry-leading query speed, cost-per-query efficiency and simplicity on AWS, Azure or on premises — in just a few clicks. Drive unprecedented time to insight and dramatic cost savings for your data analytics.