Fundamental Considerations of Moving to the Cloud Data Lake
Cloud World: A New Reality in IT
A cloud data lake is a cloud-hosted centralized repository that allows you to store all your structured and unstructured data at any scale, typically using an object store such as S3 or Azure Data Lake Store. Its placement in the cloud means it can be interacted with as needed – whether it be processing, analytics, or reporting of said data. The cloud data lake can be used to store an organization’s data (including data generated from internal and external actions and interactions).
The broad term data lake is traditionally associated with Hadoop-oriented object storage. In such a scenario, an organization’s data is loaded into the Hadoop platform, and then analyzed as it resides on Hadoop’s cluster nodes of commodity computers. While traditional data lakes have been built on HDFS clusters on-premises, the current trend is to move and maintain data lakes in the cloud as an infrastructure-as-a-service.
A data lake can include structured data from relational databases (rows and columns), semi-structured data such as CSV, JSON and more, unstructured data (documents, etc.), and binary data such as images or video. The primary utility of this shared data storage is in providing a united source for all data in a company, where each of these data types can then be collectively transformed, analyzed, and more.
Advantages of Cloud Data Lakes Compared to On-Prem Data Lakes
Moving data storage to the cloud has become feasible for companies of all sizes – the scaling and centralized functionality allowing for greater operations simplicity, more immediate data-driven insights, and more:
Capacity: With cloud storage, you can start with a few small files and grow your data lake to exabytes in size, without the worries associated when expanding storage and data maintenance internally. This gives your engineers the freedom to worry about more important things.
Cost efficiency: Cloud storage providers allow for multiple storage classes and pricing options. This enables companies to only pay for as much as they need, instead of planning for an assumed cost and capacity, which is required when building a data lake locally.
Central repository: A centralized location for all object stores and data access means the setup is the same for every team in an organization. This simplifies operation complexity and frees up time for engineers to focus on more pressing matters.
Data security: All companies have a responsibility to protect their data; with data lakes designed to store all types of data, including sensitive information like financial records or customer details, security becomes even more important. Cloud providers guarantee security of data as defined by the shared responsibility model.
Auto-scaling: Modern cloud services are designed to provide immediate scaling functionality, so businesses don’t have to worry about expanding capacity when necessary or paying for hardware that they don’t need.
Cloud Data Lake Platforms
The main cloud providers, Microsoft, Amazon, and Google, all offer cloud data lake solutions:
Amazon Web Services - Amazon Web Services (AWS) offers a number of data lake solutions – their main offering being Amazon Simple Storage Service (Amazon S3). S3 is a highly scalable, industry-standard object store, capable of storing all data types. It’s ensured to be both secure and durable, and its standardized APIs allow for the use of external analytics tools.
Microsoft Azure Cloud - Microsoft’s data lake offering, Azure Data Lake Store (ADLS) is a hyper-scale repository for cloud storage. Built on the Hadoop file system, ADLS is capable of managing trillions of files, and can even sort and maintain petabyte-sized files. With high availability, ADLS was built with the expressed purpose of running and maintaining large-scale data analytics in the cloud.
Google Cloud Services - Google provides its own lower-cost cloud data lake offering, which gives user access to Google’s own suite of ingestion, processing, and analytics tools.
Migration to the Cloud Data Lake
Migration to the cloud remains one of the biggest challenges. It is substantially easier to create a cloud data lake from scratch than migrate the existing on-premise solution. Many things are important and worth considering. Migrating to the cloud is an ever-evolving process; with Dremio, you can take small steps and secure success on these steps since Dremio’s data lake engine enables you to join several data sources. This allows you to mitigate the downtime [if any] that users are exposed to.
Identify your use case
After exploring the advantages and offerings provided by different cloud providers, it’s important to explore further and understand your situation and needs. While most cloud providers have commonalities, each one of them might have a feature or two that will adapt more to your analytics needs. Make sure you understand the current limitations of the data lake that you have. Then, identify the possible changes that the cloud data lake will bring; this will help you put into context the benefits you will get when the migration is complete.
A migration process of this magnitude requires the cooperation of several teams across the organization, so it’s important to consider both the stakeholders and user dependency.
Identify strengths and weaknesses
Performing a SWOT analysis will help you identify what sources you have in the organization that will make the migration process simpler, i.e., data architects or engineers who have gone through a similar exercise. This will also give you an opportunity to identify possible roadblocks and address them before the migration process begins.
This analysis is a fundamental step for migration preparation. This is the moment for the team involved to ask themselves, “How should we prepare for the migration? How can we mitigate the potential impact of the threats and how can we make our weak spots more robust? How we can use the opportunities to amplify their impact?” These and other questions should be answered before moving further.
Analyze user impact
At this step, you should determine which applications, systems, departments, and employees are using your local data lake. Then you should think about how they are using it and what changes will occur when you move the data lake to the cloud. Remember to migrate one step at a time and not all your sources of data at once. You can take advantage of the fact that Dremio allows you to join multiple data sources; you can connect to your sources on the cloud data lake while keeping connections with data from the legacy system without impacting users.
Think carefully about the implications of transitioning your data lake to the cloud. You need to understand the potential impact of the migration to other business aspects and estimate all explicit and hidden costs related to this process.
When developing your cloud data lake architecture, keep in mind that cloud solutions require less maintenance, making this step significantly easier than in the case where you need to design an on-premise data lake architecture. Define physical storage, which is the core storage layer on a cloud data lake that is used for the primary datasets– this is where all your raw data will land. Cloud solutions provide high levels of elasticity and scalability, so you can alter the amount available resources as needed.
Also, it is possible that after moving to the cloud, you will need to include some other cloud services into your system architecture. Modern cloud providers offer a wide range of different products for data processing and analytics which may be relevant for your use case; consider these elements when designing your architecture landscape.
Cloud provider selection
Compare offerings from different cloud providers and select the one which is the best fit for your company. Despite the fact that we specify this step as a separate milestone, it can be done jointly with the previous step. This is because the architectural decisions often depend on the particular offerings available on the chosen cloud platform, as well its associated pricing.
The points discussed in previous steps and the variety of offers that cloud vendors provide will help you decide whether you need to implement a cloud data lake or a hybrid (cloud + on-premise) solution. The decision depends on your organization’s specific factors and your current use case.
Overview of the Data Lake Migration Process
Preparing for the migration
Create a migration plan that not only includes the list of subtasks needed to execute the migration to the cloud, but also a timetable for each of the steps. Metrics which show the success or failure of the main tasks should also be developed and specified.
The following are examples of tasks and factors to keep in mind when preparing for the migration:
- When is it better to perform a migration?
- What is the estimated disruption time?
- How we will protect ourselves from unexpected damage such as data loss?
- How will we do the backup? How will we perform a possible restore?
- What post-migration procedures should we execute?
- What potential problems might emerge?
Migration preparation is mainly a technical milestone. Nevertheless, all stakeholders should be engaged in the process of plan preparation.
Perform the migration
Once everything is in place, the migration process can be executed in four stages:
- Backup. This is the most important step, and it will prevent you from losing data or configuration files and settings.
- Deploy resources. Create and set up your cloud account, subscribe to the needed resources, choose the pricing plan, and establish all needed connections.
- Transfer data. When the cloud environment is ready, you should start transferring the data from your local storage into the cloud. This is the pivotal point of your migration journey. It is exactly in this moment that your on-premise data lake goes to the cloud.
- Test. After successfully migrating to the cloud, conduct an exhaustive quality assurance exercise to check on data integrity, available connections, and proper interaction between all the systems in the new environment.
Consider the first step of the migration a greenhouse step. Only a very small chunk of the large on-premise data lake is transferred to the cloud. This step “tramples the path” for the migration of the entire data lake. When the pilot project is completed, the organization can move on to the next steps.
Moving data one step at a time will put you in a hybrid state. This is because the organization has a big enough part of the data lake in the cloud, but there is also a big chunk of the data located on-premise. Depending on your architectural needs, this might be your final step.
Finally, a full cloud or multi-cloud will be generated. At this point, the entire data lake of the company is transferred to the cloud. Organizations can maintain several data lakes and each of them is specialized only on some particular things or areas of the business. Also, different cloud providers could be more beneficial or convenient for different tasks.
To ensure proper adoption and success of the new cloud platform, be sure to educate those who need it on how to work with the new cloud data lake. Set up monitoring of key points in the system. Fine-tune the cloud deployment and fix any bugs that might have shown up in the process.
Share your progress with stakeholders and data consumers that use the platform on a daily basis so they can validate the usability of the new cloud platform. Select a test team who is detail-oriented that can test access and data integrity. This action will help you resolve any discrepancies before you roll out the new deployment to a larger audience.
Test a considerably large sample of data from different time ranges. Check that historical values match the results obtained from the new platform.
Transitioning to the cloud changes many business processes, not to mention technical processes. The organization should be ready for these changes. Employees should be prepared for the migration and for using the cloud data lake instead of on-premise. Of course, most of the training efforts should be aimed at extending the skills of technical personnel. But it is also important to educate non-technical employees about the migration process, the reasons for moving to a cloud data lake, etc.
Planning the Migration: What Do You Need?
In the previous section, we learned about the key steps in the cloud migration process. Now let’s go deeper into the important aspects of migrating to the cloud data lake.
Select a migration model
There are several models of the local-to-cloud data lake migration. Probably the most well known is when you migrate to a cloud-based storage solution (for example, Amazon S3 or ADLS) as a place for your data lake. This is the most simple case.
Another approach is known as ForkLift. In this case, you use the same hardware in the cloud which you use locally for the current data lake. In other words, you summon the same amount of compute and storage instances and then deploy your data lake on them. This approach is relatively easy because you don’t have a lot of variables or changes to think about. You simply create a cluster from instances and then do the same things in the cloud that you would do with your own local servers.
Finally, the third approach is to use a Hadoop cluster managed by the cloud provider. Modern data lakes are often based on the Hadoop ecosystem, in particular on HDFS. Popular cloud providers have offerings of managed Hadoop clusters in the cloud. In this case, the cloud provider is in charge of different low-level tasks as well as maintenance duties.
So, prior to thinking about more specific aspects of cloud migration, start by considering what migration model you would like to implement.
In addition to choosing a migration model, there are a number of points that also need to be taken into account. Below we will explain the most common aspects of migrating to the cloud data lake.
Evaluate compliance requirements
Think about the regulatory and compliance requirements when planning your transition from on-premise to the cloud. The key point is that during the migration, your data is transferred to a different location. Because of this, you need to take into account the regulations and laws of not only your current country, but also where the data will be stored. Fortunately, cloud providers often have the answer to regulatory and compliance questions.
Security issues are very important today, as the number of hacker attacks around the world has increased. If you use the cloud data lake as a service, there is almost nothing your company needs to do in regards to low-level security. According to the “shared responsibility agreement”, you only have to take care of the data you send to the cloud. All other security tasks are the responsibility of the cloud provider. At the same time, if you decided to use Hadoop-as-a-Service or especially the IaaS models, your responsibility increases. In these cases, the security of your low-level infrastructure is your job; cloud providers will not cover your losses related to low-level security failures.
Identify a level of failure tolerance
This can be customized according to your needs and budget. If you want your data lake to be replicated in several locations around the globe, you will have to pay more for this service. However, even if you choose the cheapest pricing plan with the minimum replication factor, cloud providers try to ensure the highest possible durability and fault tolerance. This is the big advantage of cloud technologies.
Evaluate and select a pricing plan
You should consider whether you need everything all at once, or whether you can start small and then gradually increase the available resources. Since cloud technologies provide a high level of flexibility and scalability, it is often the smart decision to migrate to the cloud in batches. Pay-as-you-go is the perfect option for most use cases. But remember, price often depends not only on the amount of resources used, but also on other factors such as geographical replication, access speed, etc.
One of the great features of cloud data lakes is that they support a separation between storage and compute capabilities. As we have seen earlier, you use compute resources when you actually need to process raw data from the data lake (schema-on-read), so you have no need to possess the compute resources all the time. In the cloud, there are so-called transient compute resources. This means that you can use computer resources when you need them, but the cloud provider can interrupt them at any time depending on their specific needs. The benefit is the price - they are much cheaper than permanent compute resources.
For cloud data lakes, where you don’t need to perform heavy calculations on a permanent basis, the transient compute resources could be a really good choice. Typically, you will order a small portion of the permanent compute resources (say, 10-20%), and the other larger part will be the transient compute resources (80-90%). This approach could be really cost-effective.
How will the new cloud data lake communicate with other components of the system? You should support at least the level of interconnectivity which was in your local data lake before migration. In addition, you can connect other services to your cloud data lake. This could be both cloud and non-cloud services. The matter is very use case-specific, but you should definitely think about maintaining the optimal interconnectivity architecture. The issue of data flow also needs to be considered. You need to have a good understanding of how you will get your data into the cloud data lake, and how you will manage it from there.
Designing a Proper Cloud Data Lake Architecture
We have seen earlier that there are several cloud approaches: full-cloud, hybrid, or multi-cloud. This is one aspect of the design. But there is also another aspect: how you will organize the single cloud data lake? In other words, what is the optimal structure of your cloud data lake? This is a complex question and the answer can be derived only after a detailed analysis of your particular use case. Only the specialists with the relevant experience can develop the architecture which would be the best fit for the needs of a particular business. Here is one example of how the cloud data lake can be structured, to give you an idea of what kinds of problems should be solved during the design development phase.
A typical cloud data lake can be split into several zones. In most cases, there are just two zones: raw zone and trusted zone. Depending on the architectural decisions, there also may be a landing zone and a sandbox.
The landing zone
This zone is a temporary place where data from sources is stored before transitioning to the raw zone. Sometimes the landing zone is called a transient zone. This zone is important for companies which operate in highly-regulated industries such as the financial sector. Before storing data permanently in the raw zone, the data should pass through the same initial compliance procedures.
Data is dumped into the transient zone, special measures are applied to it, and when everything is all set, the data goes further. Another example of companies who may require a landing zone are medical organizations who store sensitive client data. Organizations should think carefully about whether it needs this zone. If there is no need, then it is better not to create it in order to simplify the data lake architecture.
The raw zone
This zone is the largest zone which is used to store data in its raw form. If there is a landing zone in the data lake, the data is transferred from it to the raw zone. If there is no landing zone, the raw zone is the first place where data lands. The primary data quality checks have to be already applied to the data stored there. The main users of data from the raw zone are the ETL developers. If needed, the data is tokenized in this zone. Various metadata is also created. It is the most important zone to consider when you are struggling against the emergence of data swamp.
The trusted zone
The trusted zone is where the data from the raw zone is prepared for further usage. Only the data that will definitely be used is stored in the trusted zone. The initial data preprocessing techniques could be applied to data in the trusted zone. The main users of this zone are data scientists, business analysts, and other corporate employees who need access to data in the data lake.
In some cases, an organization can also create an additional zone called the sandbox. This is the place for experiments and testing. Many new approaches, insights, and solutions can be explored here. The data can arrive here from any source: directly from the original data source, or from the landing, raw, or trusted zone.
The architecture described above is only one of the possible options. It is popular enough, but there could be other options as well, which could be more suitable for the specific use case. The key concept that we wanted to highlight is that the architecture of the data lake should be carefully considered beforehand.
How Dremio Can Help
Dremio’s Data Lake Engine delivers lightning-fast query speed and a self-service semantic layer operating directly against your data lake storage. There’s no need to move data to proprietary data warehouses or create cubes, aggregation tables, or BI extracts. Dremio offers flexibility and control for Data Architects, and self-service for Data Consumers.
Dremio provides an integrated, self-service interface for data lakes, designed for BI users and data scientists. It increases the productivity of these users by allowing them to easily search, curate, accelerate, and share datasets with other users. In addition, Dremio allows companies to run their BI workloads from their data lake infrastructure, removing the need to build cubes or BI extracts.
Dremio and cloud data lakes
Here’s how Dremio helps you leverage your data lake:
With Dremio, you don’t need to worry about the schema and structure of the data that you put in your data lake. Dremio takes data from any source (relational or NoSQL) and converts it into a SQL-friendly format without needing to make extra copies. You can then curate, prepare, and transform your data using Dremio’s intuitive user interface, making it ready for analysis.
Dremio makes it easy for your data engineers to curate data for the specific needs of different teams and different jobs without needing to make copies of the data. By managing data curation in a virtual context, Dremio makes it fast, easy, and cost effective to design customized virtual datasets that filter, transform, join, and aggregate data from different sources. Virtual datasets are defined with standard SQL, so they fit into the skills and tools already in use by your data engineering teams.
Optimization and governance
In order to scale these results across your enterprise, Dremio provides a self-service semantic layer and governance for your data. Dremio’s semantic layer is an integrated, searchable catalog in the Data Graph that indexes all of your metadata, allowing business users to easily make sense of the data in the data lake. Anything created by users - spaces, directories, and virtual datasets - make up the semantic layer, all of which is indexed and searchable. The relationships between your data sources, virtual datasets, and all your queries are also maintained in the Data Graph, creating a data lineage, allowing you to govern and maintain your data.
At its core, Dremio makes your data self-service, allowing any data consumer in your company to find the answers to your most important business questions in your data lake. It does not matter whether you’re a business analyst who uses Tableau, Power BI, or Qlik, or a data scientist working in R or Python. Through the user interface, Dremio also allows you to share and curate virtual datasets without needing to make extra copies, optimize storage, and support collaboration across teams. Lastly, Dremio accelerates your BI tools and ad-hoc queries with reflections, and integrates with all of your favorite BI and data science tools, allowing you to leverage the tools you already know how to use on your data lake.
In this article, we discussed all of the keystone elements that you need to keep in mind to successfully approach a migration to the cloud data lake. We discussed a wide variety of best practices and checklists to go through to ensure that all possible angles are covered at the time of migration. We also provided valuable ideas to accelerate the time to insight and help with the big move. With Dremio’s Data Lake Engine, you can finally leverage all the potential of your data lake without any boundaries, and accelerate the time to insight from days to minutes.
For additional information and to learn more about Dremio, check out these additional resources: