Microsoft Azure Storage Explained by Dremio
Azure Storage is a Microsoft-managed cloud service that provides storage that is highly available, secure, durable, scalable and redundant. Whether it is images, audio, video, logs, configuration files, or sensor data from an IoT array, data needs to be stored in a way that can be easily accessible for analysis purposes, and Azure Storage provides options for each one of these possible use cases.
Within Azure there are two types of storage accounts, four types of storage, four levels of data redundancy and three tiers for storing files. We will focus on exploring each one of these options in detail to help you understand which offering adapts better to your big data storage needs.
Azure Storage Account
An Azure storage account is an access point to all the elements that compose the Azure storage realm. Once the user creates the storage account, they can select the level of resilience needed and Azure will take care of the rest. A single storage account can store up to 500 TB of data and like any other Azure service, users can take advantage of the pay-per-use pricing model.
There are two different storage account types. With the “Standard” storage account, users get access to Blob Storage, Table Storage, Queue Storage, and File Storage. The alternative, “Premium” account, is the most recent storage option which provide users with data storage on SSD drives for better IO performance; this option supports only Page Blobs.
Azure Blob Storage
Blob storage is Microsoft Azure’s service for storing Binary Large Objects or BLOBs which are typically composed of unstructured data such as text, images, videos, along with their metadata. Blobs are stored in directory-like structures called ‘Containers’.
The blob service includes:
- Blobs, which are the data objects of any type.
- Containers, which wrap multiple blobs together.
- Azure Storage Account, which contains all of your Azure storage data objects.
Although Blob allows for storage of large binary objects in Azure, these are optimized for different three different storage scenarios:
- Block blobs: these are blobs that are intended to store discrete objects such as images, log files and more. Block blobs can store data up to ~5TB, or 50,000 blocks of up to 100 MB each.
- Page blobs: are optimized for random read and write operations and can grow up to 8 TB in size. Within the page blob category, Azure offers two types of storage - standard and premium- The latter bring the most ideal for VM’s storage disks (including the operating system disk).
- Append Blobs: Optimized for append scenarios like log storage, Append blogs are composed of several blocks of different sizes – up to a maximum of 4 MB. Each append blob can hold up to 50000 blocks, therefore allowing each append blob to grow up to 200 GB.
Blob storage accounts offer three types of tiers that are selected at the time of creation of the storage account.
- Hot Access Tier: Out of the three options, the hot access tier is the most optimized for data that is accessed frequently. It offers the lower access (read-write) cost, but the highest storage cost.
- Cool Access Tier: This option is better suited for use cases where data will remain stored for at least 30 days and is not accessed frequently. Compared to Hot Access Tiers,, this tier offers lower storage cost and higher access costs.
- Archive Access Tier: Archive storage is designed for data that doesn’t need to be accessed immediately. This tier offers higher data retrieval costs, and also higher data access latency. It is designed for use cases where data will be stored for more than 180 days and is rarely accessed.
Why Use Blob Storage?
Much of what data consumers do with storage is focused on dealing with unstructured data such as logs, files, images, videos, etc. Using Azure’s blob storage is a way to overcome the challenge of having to deploy different database systems for different types of data. Blob storage provides users with strong data consistency, storage and access flexibility that adapts to the user’s needs, and it also provides high availability by implementing geo-replication.
Azure Table Storage
Azure Table Storage is a scalable, NoSQL, key-value data storage system that can be used to store large amounts of data in the cloud. This storage offering has a schemaless design, and each table has rows that are composed of key-value pairs. Microsoft describes it as an ideal solution for storing structured and non-relational data, covering use cases ranging from storing terabytes of structured data that serves web applications, to storing datasets that do not require complex joins or foreign keys to accessing data using . NET libraries.
Azure Table Storage Components
Table Storage includes:
- A Storage Account, which contains all your tables.
- Tables, which are composed of collections of “entities”.
- Entities, which are sets of properties, similar to database rows. An entity can grow to up to 1MB in size.
- Properties: the most granular element in the list. Properties are composed of name-value pairs. Entities can wrap up to 252 properties to store data, and each entity contains three system properties that define its partition key, row key and time stamp.
Because Azure Storage Tables are represented in a tabular format, they can be easily confused with RDBMS tables. However, Azure Tables don’t have the notion of columns, constraints, or 1:1 or 1:* relationships and any of their variations.
Azure Table Storage Vs. Azure SQL Database
These two technologies, while very similar, are designed to tackle very different use cases. However, one of the main differences between the two is their capacity. Azure Tables can have rows of up to 1 MB in size with no more than 255 properties including the three identifying keys: Partition, Row and Timestamp. Meaning, that when users add the size of all 255 properties, they can’t exceed 1 MB.
On the other side, Azure SQL databases can have rows up to 2 GB in size. Naturally, this would make the user think that Azure SQL databases are a no-brainer when it comes to storing large amounts of data. However, Azure SQL databases can scale up to 150 GB only, while the maximum data size for Azure tables is 200 TB per table.
Why Use Azure Table Storage?
Azure Table Storage enables users to build cloud applications easily without worrying about schema lock-downs. Developers should consider using Azure Table Storage when they want to store data in the range of multiple terabytes, while keeping storage costs down – when the data stored does not depend on complex server-side joins or other logic. Additional use cases including disaster recovery scenarios, or storing data up to 500 TB without the need to implement sharding logic.
Azure Storage Queues
Queues have been around for a long time – their simple FIFO (first in first out) architecture makes queues a versatile solution for storing messages that do not need to be in a certain order. In simple terms, Azure Queue Storage is a service that allows users to put high volumes of messages, process them asynchronously and consume them when needed while keeping costs down by leveraging a pay-per-use pricing model.
Azure Storage Queues Components
Azure Storage Queues are composed of the following elements:
Storage Account, which contains all your storage services.
Queue, composed of a set of messages.
Message: A message can include any kind of information. For example, a message could be a text message that is supposed to trigger an event on an app, or information about an event that has happened on a website. A message, in any format, can only be up to 64KB in size and the maximum time that a message can remain in a queue is 7 days. However, a single queue can hold up to 200 TB worth of messages. Messages can be text strings or arrays of bytes containing any kind of information in formats such as XML, CSV, etc.
Why Use Azure Storage Queues?
Queues reduce the possibility of data loss due to timeouts on the data store or long running processes; a good example of this scenario is a shopping cart or a forum where a user can place an “order” in the shape of a purchase or a message on a message board. A reader will then take care of ingesting or “de-queuing” the message while giving the user control back so they can continue navigating the site.
Queues allow users to accept all information that comes in and then deal with it at the pace of the application. Going back to the shopping cart scenario, imagine a situation where a user places over 50 items on a shopping cart and is ready to check out. Once the user checks out, if a queue is not in place, the order information would have to be processed and stored in the database immediately, and as you can imagine, during peak times, this could create a bottle-neck and bring down the entire system. Queues provide a fault tolerance mechanism where all this orders can be stored for a limited amount of time and then processed and executed as the system has bandwidth to do so. This way it is guaranteed that each element in the queue will receive attention.
Azure Files is a shared network file storage service that provides administrators a way to access native SMB file shares in the cloud. These shares - as the rest of the Azure storage offerings - can be set as part of the Azure storage account. The Azure File service provides a way for application running on cloud virtual machines to share files among them by using standard protocols like WriteFile or Readfile.
Why Use Azure File System
There are many different scenarios in which you might want to use AFS:
- If you have an on-premise environment that requires a file share, and need to lift and shift it to the cloud, AFS provides an easy way to set of shared files among cloud VM’s. The Azure File system allows users to set up a shared drive without the need to create a dedicated VM to handle the file share workload.
- Azure File system can also be used to simplify cloud development, it can be set to be a shared resource for developers and sysadmins to have a central share when installing tools and applications.
- It can serve as the central location for config files and monitoring logs.
Benefits of Azure Files
Easy to manage: To deploy a shared file, all users need to do is navigate to their storage account and create a new file share. Within minutes the user will have a fully functional file share up and running.
Secure storage: Azure File storage encrypts data at rest and transit using Server Message Block (SBM 3) and HTTPS.
Cross-platform support: Azure File uses the SMB protocol, which is natively supported by many OS APIs, libraries and tools.
Highly Scalable: Users can store up to 5 TB of data – or up to 100 TB if they configure the share in the premium tier.
Hybrid Access: Azure File Sync allows users to access data anywhere through SMB and REST protocols. This service provides a way to extend file shares to on-premise deployments by creating a local cache of the files providing local access through protocols such as NFS, SMB, FTPS and more. This type of synchronization allows users to have highly-available access to their files and also the opportunity to implement enterprise-grade security protocols such as ACLs.
Dremio and Azure Storage
Dremio connects to data lakes like ADLS, Amazon S3, HDFS and more, putting all of your data in one place and providing it structure. We provide an integrated, self-service interface for data lakes, designed for BI users and data scientists. Dremio increases the productivity of these users by allowing them to easily search, curate, accelerate, and share datasets with other users. In addition, Dremio allows companies to run their BI workloads from their data lake infrastructure, removing the need to build cubes or BI extracts.
Here’s how Dremio helps you leverage your data lake:
With Dremio, you don’t need to worry about the schema and structure of the data that you put in your data lake. Dremio takes data from whatever kind of source (relational or NoSQL) and converts it into a SQL-friendly format without making extra copies. You can then curate, prepare, and transform your data using Dremio’s intuitive user interface, making it ready for analysis.
Dremio makes it easy for your data engineers to curate data for the specific needs of different teams and different jobs, without making copies of the data. By managing data curation in a virtual context, Dremio makes it fast, easy, and cost effective to design customized virtual datasets that filter, transform, join, and aggregate data from different sources. Virtual datasets are defined with standard SQL, so they fit into the skills and tools already in use by your data engineering teams.
Optimization and Governance
In order to scale these results across your enterprise, Dremio provides a self-service semantic layer and governance for your data. Dremio’s semantic layer is an integrated, searchable catalog in the Data Graph that indexes all of your metadata, allowing business users to easily make sense of the data in the data lake. Anything created by users—spaces, directories, and virtual datasets make up the semantic layer, all of which is indexed and searchable. The relationships between your data sources, virtual datasets, and all your queries are also maintained in the Data Graph, creating a data lineage, allowing you to govern and maintain your data.
At its core, Dremio makes your data self-service, allowing any data consumer at your company to find the answers to your most important business questions in your data lake, whether you’re a business analyst who uses Tableau, Power BI, or Qlik, or a data scientist working in R or Python. Through the user interface, Dremio also allows you to share and curate data virtual datasets without making extra copies, optimizing storage and supporting collaboration across teams. Lastly, Dremio accelerates your BI tools and ad-hoc queries with reflections, and integrates with all your favorite BI and data science tools, allowing you to leverage the tools you already know how to use on your data lake.
Data-as-Service Platform for Azure
Dremio provides an integrated, self-service interface for data lakes. Designed for BI users and data scientists, Dremio incorporates capabilities for data acceleration, data curation, data catalog, and data lineage, all on any source, and delivered as a self-service platform.
Run SQL on any data source. Including optimized push downs and parallel connectivity to non-relational systems like Elasticsearch, S3 and HDFS.
Accelerate data. Using Data Reflections, a highly optimized representation of source data that is managed as columnar, compressed Apache Arrow for efficient in-memory analytical processing, and Apache Parquet for persistence.
Integrated data curation. Easy for business users, yet sufficiently powerful for data engineers, and fully integrated into Dremio.
Cross-Data Source Joins. execute high-performance joins across multiple disparate systems and technologies, between relational and NoSQL, S3, HDFS, and more.
Data Lineage . Full visibility into data lineage, from data sources, through transformations, joining with other data sources, and sharing with other users.
Visit our tutorials and resources to learn more about how can you gain insights from your data, faster, using Dremio.