Dremio Jekyll

Cloud Data Lakes

A cloud data lake is a cloud-hosted centralized repository that allows you to store all your structured and unstructured data at any scale, typically using an object store such as Amazon S3 or Microsoft Azure Data Lake Storage (ADLS). Its placement in the cloud means it can be interacted with as needed, whether it’s for processing, analytics or reporting. Cloud data lakes can be used to store an organization’s data, including data generated from internal and external actions and interactions.

The term data lake is traditionally associated with Apache Hadoop-oriented object storage. In such a scenario, an organization’s data is loaded into the Hadoop platform and then analyzed as it resides on Hadoop’s cluster nodes of commodity computers. While traditional data lakes have been built on HDFS clusters on premises, the current trend is to move and maintain data lakes in the cloud as an infrastructure-as-a-service.

A data lake can include structured data from relational databases (rows and columns), semi-structured data such as CSV, JSON and more, unstructured data (documents, etc.) and binary data such as images or video. The primary utility of this shared data storage is in providing a united source for all data in an organization. Each of these data types can then be collectively transformed, analyzed and more.

How Do Cloud Data Lakes Work?

The data journey is designed to take advantage of the separation of compute and storage, so that each individual element can scale when necessary, without slowing down the other. Auto-scaling is the key benefit of putting a data lake in the cloud. Additionally, because of its centralized location, cloud data lake infrastructure provides self-service access to users and developers, compared to on-premises solutions, which silo information.

The Data Journey

To understand the structure and logic of cloud data lakes, it helps to follow the path that data takes, from ingestion through to analytics and reporting.

1.Ingestion: The first step in the data journey, ingestion involves the uptake of structured and unstructured data. Data is collected and collated from multiple sources and transferred into the data lake in its original format. A major benefit of data lakes is the fact that scaling can occur without the need to reconsider schemas, transformations or data structures (as you would need to do with a traditional data warehouse). Despite the ease of transfer and storage, organizations usually maintain multiple, separate data lakes to avoid any issues with data privacy or internal access privileges.

2. Storage: The second step in the data journey, storage is the controlled repository for all ingested data prior to any transformations — all data can maintain its original state, whether it’s structured or unstructured. This simplified storage system allows businesses to collect and consider endless amounts of data, using the major cloud object stores (ADLS, S3, Google Cloud Storage), and provides high availability, auto-scaling, affordability and security.

3. Processing: The third step in the data journey, where data is converted from its raw state into something compatible with multiple data types, allows for combinatory analysis through aggregation, joins and more. Once the data has been processed it’s returned to the data lake where it can be analyzed.

4. Analytics: During the final step in the data journey, the stored and processed data is made available for self-service analysis by data scientists, BI users, and more, which is ultimately the end goal for any organization.

Cloud Data Lake Platforms

The most popular cloud providers, Microsoft, Amazon and Google, all offer cloud data lake solutions.

Microsoft Azure Cloud

Microsoft’s data lake offering, Azure Data Lake Store (ADLS), is a hyper-scale repository for cloud storage. Built on the Hadoop file system, ADLS is capable of managing trillions of files, and can even sort and maintain petabyte-sized files. With high availability, ADLS was built with the expressed purpose of running and maintaining large-scale data analytics in the cloud.

Amazon Web Services

Amazon Web Services (AWS) offers a number of data lake solutions, and their most popular is Amazon Simple Storage Service (Amazon S3). S3 is a highly scalable, industry-standard object store, capable of storing all data types. It’s ensured to be both secure and durable, and its standardized APIs allow for the use of external analytics tools.

Google Cloud Services

While its offerings aren’t as established as Microsoft’s or Amazon’s, Google does provide its own cloud data lake offering. Google Cloud Storage (GCS) is a lower-cost cloud data lake, which provides user access to Google’s own suite of ingestion, processing and analytics tools.

Cloud Data Lake Comparison Table

Cloud Service Ingestion Storage Processing Analytics
Microsoft Azure Azure Data Factory Azure Stream Analytics Apache Sqoop Azure PowerShell Azure Portal AdlCopy DistCp ADLS Blob storage ADLS Gen2 HDInsight Azure SQL Data Warehouse HDInsight Storm Data Lake Analytics
Amazon Web Services Amazon Kinesis Amazon Snowball Amazon Storage Gateway Amazon S3 AWS Glue Amazon Glacier Amazon Athena Amazon EMR Amazon Redshift
Google Cloud Platform Cloud Pub/Sub Cloud Data Flow Storage Transfer Service Google Cloud Storage Cloud Datalab Cloud Datapre Big Query Cloud Data Proc Cloud Bigtable

Types of Data Lakes

Data lakes can be built either in the cloud or on premises, with the trend currently pointing to placing them in the cloud because of the power and capacity that can be leveraged.

For organizations who already maintain an on-premises data lake but are considering a transition to a cloud-based solution, the migration process can be very daunting. They need to determine how they can transfer vast quantities of data and adapt customized technology for a universal cloud provider. The first step is determining what cloud data lake architecture works best for their needs.

  • On-Premises: Data lakes maintained on premises are different than their cloud counterparts. They require the combined management of both hardware and software. This double duty requires greater engineering resources and expertise, and it also locks companies into a static scaling solution, where they have to be sure to maintain capacity overhead in order to avoid any downtime as they expand storage.

  • Hybrid Data Lake: Maintaining both on-premises and cloud data lakes concurrently introduces its own benefits and challenges. Managing an on-premises operation requires additional engineering expertise, as does constantly migrating data between on premises and the cloud. On the other hand, this two-pronged approach does allow companies to maintain less relevant data on premises, while placing more important data in the cloud, thereby benefiting from the speed of cloud services.

  • Cloud Data Lake: By maintaining a standard cloud data lake, the major benefits are availability, speed and lower engineering and IT costs. This option allows businesses to operate swiftly, without having to measure every decision against expertise. The downside can be that cloud services are paid for as a subscription model. Over time this will inevitably cost more than the “buy once” model of local storage.

  • Multi-Cloud Data Lake: The final type of data lake is where multiple cloud offerings are combined, for example, where businesses use both AWS and Azure to manage and maintain the data lakes. Maintaining multiple data lakes means benefiting from the advantages of each platform, but it also requires greater expertise to enable disparate platforms to communicate with one another.

The Benefits of Building Data Lakes in the Cloud

Moving data storage to the cloud has become feasible for companies of all sizes. The centralized functionality and ability to scale allows for greater operations simplicity, more immediate data-driven insights and more.

Benefits include:

  • Capacity: With cloud storage, you can start with a few small files and grow your data lake to exabytes in size, without the worries associated when expanding storage and data maintenance internally. This gives your engineers the freedom to focus on more important things.

  • Cost efficiency: Cloud storage providers allow for multiple storage classes and pricing options. This helps companies to pay for exactly as much storage as they need, instead of planning for an assumed cost and capacity as is needed when building a data lake locally.

  • Central repository: A centralized location for all object stores and data access means the setup is the same for every team in an organization. This simplifies operations complexity and frees up time for engineers to focus on more pressing matters.

  • Data Security: All organizations have a responsibility to protect their data. With data lakes designed to store all types of data, including sensitive information like financial records or customer details, security becomes even more important. Cloud providers guarantee security of data as defined by the shared responsibility model.

  • Auto-scaling: Modern cloud services are designed to provide immediate scaling functionality, so businesses don’t have to worry about expanding capacity when necessary, or paying for hardware that they don’t need.

The Challenges of Data Lakes in the Cloud

The migration of data and infrastructure to the cloud has been a long time coming, and simplifies many operational costs for businesses. However, that doesn’t mean that it’s a perfect solution:

  • Migration: The biggest challenge for cloud data lakes is actually getting data into the cloud — the migration process can be incredibly daunting. It’s not only complex, but can also be expensive, especially when it occurs repeatedly.

  • Data management: One of the benefits of a data lake can also be a challenge — data management. Because data lakes are capable of supporting all types of data (structured, unstructured, etc.), the management and cleanliness of data lakes can be an intensive process. When things get out of hand, data swamps can occur. A data swamp, full of poorly formed data, holds very little value to a business, and requires a lot of effort to fix.

  • Storage costs: While on-premises storage costs can be aggressive, the trade-off is fairly simple — cloud providers charge for storage based on time more than size. This means that costs can expand over time, and businesses are forced to weigh existing engineering and IT costs against the “rental” of cloud services.

  • Self-service analytics: The main benefit of setting up a data lake in the first place is analytics. The ability to combine, transform and organize disparate data sources together is a huge benefit, but it requires an equally robust analytics solution. While most cloud providers offer analytics solutions, the ability to effectively utilize and hook into these analytics platforms isn’t always easy.

Why Dremio?

Dremio provides an integrated, self-service interface for data lakes. Designed for BI users and data scientists, the data lake query engine incorporates capabilities for data acceleration, data curation and data lineage — all on any source and delivered as a self-service platform. Standout features include:

  • SQL on any data source including optimized pushdowns and parallel connectivity to non-relational systems like S3 and HDFS.

  • Accelerated data queries using data reflections, a highly optimized representation of source data that is managed as columnar, compressed Apache Arrow for efficient in-memory analytical processing, and Apache Parquet for persistence.

  • Integrated data curation which is easy for business users, yet sufficiently powerful for data engineers, and fully integrated into Dremio.

  • Cross-data source joins across multiple disparate systems and technologies, between relational and NoSQL, S3, HDFS and more.

  • Full visibility into data lineage from data sources through transformations, joins with other data sources and sharing with other users.

Additional Resources