Dremio Jekyll

Azure Data Lake Store Explained by Dremio

What is ADLS?

ADLS, short for Azure Data Lake Store, is a fully-managed, elastic, scalable, and secure file system that supports HDFS semantics and works with the Hadoop ecosystem. It provides industry-standard reliability and it also provides enterprise-grade security for all data. ADLS provides unlimited storage and it is suitable for storing a large variety of data. It is built for running large-scale analytics systems that require large computing capacity to process and analyze large amounts of data. Data stored in ADLS can easily be analyzed using Hadoop frameworks like MapReduce and Hive.

image alt text

ADLS Features

Limitless Storage

One of the many characteristics of big data is its variety. ADLS is suitable for storing a great variety of data coming from different sources like devices, applications, and much more. It allows users to store relational and non-relational data. Additionally, it doesn’t require a schema to be defined before any data is loaded into the store.

ADLS can store virtually any size of data, any number of files. Each ADLS file is sliced into blocks and these blocks are distributed across multiple data nodes. There is no limitation of number of blocks and data nodes.

In ADLS, the data itself is held within the “Catalog” folder of the data lake store, but the metadata is contained in the data lake analytics. For many users, working with the structured data in the data lake is very similar to working with SQL databases.

Furthermore, ADLS allows users to store:

  • Unstructured data: data that does not have a pre-defined data model or is not organized into a particular format or standard. I.e tweets, texts, etc.
  • Semi-structured data: this data could be seen as self-described structures that do not conform with the formal structure of data models associated with a relational database or other forms of data tables. I.e JSON, XML.
  • Structured data: data that resides in a field within a record file, the most common example of this are spreadsheets and data contained in a relational database.

Support for Heavy Analytic Workloads

Azure Data Lake Store has been created to support analytic workloads that require large throughput. This allows to improve performance and reduce latency. In order to process data up to Petabytes in size, ADLS distributes data across multiple nodes where mappers and reducers, in the magnitude of thousands, process this data in parallel to provide fast results.

High Availability and Reliability

Azure maintains three copies of each data object stored across multiple regions to ensure availability in the unlikely case of a hardware failure. “Read” transactions can be directed towards any of the three copies of the data object. Microsoft suggest as a best practice to always ensure proper access policies for your data as well as creating copies of critical data as part of disaster mitigation routines.

Security

When implementing a big data solution, security shouldn’t be optional. To conform with security standards and limit sensitive information visibility, data must be secured in transit and at rest.ADLS provides rich security capabilities so users can have peace of mind when storing their assets in the ADLS infrastructure. Users are capable of monitor performance, audit usage, and access control through the integrated Azure Active Directory (ADD)

Auditing

ADLS creates audit logs for all operations performed in it. These logs can be analyzed with U-SQL scripts.

Access Control

ADLS provides access control through the support of POSIX compliant ACLSs on files and folders stored in its infrastructure. It also manages authentication through the integration of Azure Active Directory (AAD) based on OAuth tokens from supported identity providers. Tokens will carry the user’s security groups data, and this information will be passed through all the ADLS microservices.

Data Encryption

ADLS encrypts data in transit and at rest, providing server-side encryption of data with the help of keys including customer-managed keys in the Azure Key Vault.

Data Encryption Key Types

Azure Data Lake Store uses a Master Encryption Key stored in Azure’s key vault, to encrypt and decrypt data. Users have the option to manage this key themselves but there is always the risk of not being able to decrypt the data if the key is lost. ADLS also include the following keys:

  • Block Encryption Key (BEK): these are keys generated for each block of data.
  • Data Encryption Key (DEK): these keys are encrypted by the Master Encryption Key and are responsible for generating BEKs to encrypt data blocks.

ADLS and Big Data Processing

ADLS can get any data from anywhere into the data lake in its native format without requiring any prior transformations. It eliminates the need for users to define a schema before loading the data. Being able to store files of arbitrary sizes and formats, on-premises legacy systems, and existing cloud stores makes it possible for ADLS to handle structured, semi-structured and structured data.

image alt text

Data ingestion is only one of the key stages of big data processing, other stages include:

  • Processing
  • Downloading
  • Consuming or visualizing data

ADLS can be used along other Azure services to help users meet their big data requirements across these stages, the following table maps each one of these tools to the processing stages:

Tool Ingest Process Download Visualize
Azure Portal      
Azure Powershell    
Azure Data Lake Analytics      
Azure CLI    
Azure Data Factory    
AdlCopy      
DistCp    
Azure Stream Analytics      
Apache Sqoop    
Azure SQL Data Warehouse      
Power BI      
HDInsight Storm      

Cost Model

One of the characteristics that makes ADLS one of the most attractive cloud-based storage solutions is its no upfront cost, pay-per-use model. It allows users to pay only for data at rest and the number of gigabytes stored, as well as the number of transactions (read and write) over that data.

At the time of writing, pay-as-you-go prices can range from $0.039 per GB for the first 100 TB to $0.037 per GB up to 5000 TB. There is also the option to select monthly commitment packages which provide users with more affordable prices based on storage needs.

  • Transaction: every time a user, application, or Azure service reads or writes data of size between 128Kb and up to 4 MB.

If a user places an item in ADLS of 9MB then Azure will break this down into 3 different transactions: 4 MB + 4 MB + 1 MB

The monthly cost is calculated based on monthly usage volume (transactions) plus the storage used. An example of the cost breakdown for a common use case would be the following:

If a user has an application that writes data into ADLS at a rate of 10 items per second, each item being 4 MB, and another service that runs for 4 hours a day and reads 1000 items per second, then the monthly bill would be:

Note: For this example we will use the following time period parameters:

  • Month = 31 days.
  • 31 Days = 730 hours.
  • 1 hour = 3600 seconds.
Item Usage (Transactions) Price Cost
Custom app 10 items/second x 3600 x 730 $0.05 every 10k transactions $131.40
Reading job 1000 items/second x 3600 x 4 x 31 $0.004 every 10k transactions $178.56
Storage 3.4 Terabyte/Day x 31 $0.038 per GB $4,005.02
    Total $4,315.28

Azure Data Lake Store Gen2

Recently Microsoft announced Azure Data Lake Store Gen 2, which can be seen by users as a superset of ADLS Gen 1 which include new capabilities dedicated to analytics built on top of Azure Blob Storage.

Described by Microsoft as a “no-compromise data lake” ADLS Gen 2 extends Azure blob storage capabilities and it is best optimized for analytics workloads. Users can store data once and access it through existing blob storage and HDFS-compliant file system interfaces with no programming changes or data copying when doing database operations.

ADLS Gen2 includes most of all features from both ADLS Gen1 and Azure Blob Storage. These features include:

  • Limitless storage capacity
  • Azure Active Directory (AAD) integration
  • Hierarchical File System (HFS)
  • Read-access geo-redundant storage
  • 5 TB file size limit
  • Blob tiers (Hot, Cool, Archive)

Multi-Modal Storage Service

Up to this point users had to make a choice between an object store i.e Azure Blob Storage or a file system i.e ADLS Gen1. ADLS Gen2 unifies of object storage and file systems to provide simultaneous access over the same data.

image alt text

Who is ADLS Gen2 for?

Customers who are using ADLS Gen1, or customers who are using Azure Blob Storage, or both. Since ADLS Gen2 delivers the best of both worlds, current ADLS Gen1 users won’t see new features, so they can remain in ADLS Gen1 unless they need to use features associated with Blob Storage. The same applies for current Azure Blob Storage users, they can remain in their current environments and save on transaction costs.

It is always a best practice to define first what the storage need is before selecting the service, for example, if a user only needs to store images, or back-up files, the simplicity of Azure Blob Storage might be all they need.

Storage costs for ADLS Gen2 are roughly 50% less than ADLS Gen1.

Dremio and ADLS

Dremio connects to data lakes like ADLS, Amazon S3, HDFS and more, putting all of your data in one place and providing it structure. We provide an integrated, self-service interface for data lakes, designed for BI users and data scientists. Dremio increases the productivity of these users by allowing them to easily search, curate, accelerate, and share datasets with other users. In addition, Dremio allows companies to run their BI workloads from their data lake infrastructure, removing the need to build cubes or BI extracts.

Here’s how Dremio helps you leverage your data lake:

Data Acquisition

With Dremio, you don’t need to worry about the schema and structure of the data that you put in your data lake. Dremio takes data from whatever kind of source (relational or NoSQL) and converts it into a SQL-friendly format without making extra copies. You can then curate, prepare, and transform your data using Dremio’s intuitive user interface, making it ready for analysis.

Data Curation

Dremio makes it easy for your data engineers to curate data for the specific needs of different teams and different jobs, without making copies of the data. By managing data curation in a virtual context, Dremio makes it fast, easy, and cost effective to design customized virtual datasets that filter, transform, join, and aggregate data from different sources. Virtual datasets are defined with standard SQL, so they fit into the skills and tools already in use by your data engineering teams.

Optimization and Governance

In order to scale these results across your enterprise, Dremio provides a self-service semantic layer and governance for your data. Dremio’s semantic layer is an integrated, searchable catalog in the Data Graph that indexes all of your metadata, allowing business users to easily make sense of the data in the data lake. Anything created by users—spaces, directories, and virtual datasets make up the semantic layer, all of which is indexed and searchable. The relationships between your data sources, virtual datasets, and all your queries are also maintained in the Data Graph, creating a data lineage, allowing you to govern and maintain your data.

Analytics Consumption

At its core, Dremio makes your data self-service, allowing any data consumer at your company to find the answers to your most important business questions in your data lake, whether you’re a business analyst who uses Tableau, Power BI, or Qlik, or a data scientist working in R or Python. Through the user interface, Dremio also allows you to share and curate data virtual datasets without making extra copies, optimizing storage and supporting collaboration across teams. Lastly, Dremio accelerates your BI tools and ad-hoc queries with reflections, and integrates with all your favorite BI and data science tools, allowing you to leverage the tools you already know how to use on your data lake.

Data-as-Service Platform for Azure

Dremio provides an integrated, self-service interface for data lakes. Designed for BI users and data scientists, Dremio incorporates capabilities for data acceleration, data curation, data catalog, and data lineage, all on any source, and delivered as a self-service platform.

Run SQL on any data source. Including optimized push downs and parallel connectivity to non-relational systems like Elasticsearch, S3 and HDFS.

Accelerate data. Using Data Reflections, a highly optimized representation of source data that is managed as columnar, compressed Apache Arrow for efficient in-memory analytical processing, and Apache Parquet for persistence.

Integrated data curation. Easy for business users, yet sufficiently powerful for data engineers, and fully integrated into Dremio.

Cross-Data Source Joins. execute high-performance joins across multiple disparate systems and technologies, between relational and NoSQL, S3, HDFS, and more.

Data Lineage . Full visibility into data lineage, from data sources, through transformations, joining with other data sources, and sharing with other users.

Visit our tutorials and resources to learn more about how can you gain insights from your data stored in ADLS, faster, using Dremio.