5 minute read · October 16, 2020

Your Path to the Cloud Data Lake – Navigating the Thorny Path of Migration

Kevin Petrie · Eckerson Group

So far so good. Now how do you move your data?

Pack Your Bag

Map Your Course

Start Moving

Blaze the Next Trail

Let’s say you weighed the pros and cons, and decided to journey to the cloud data lake. Furthermore, you opted for a Cloud Data Lake Engine (CDLE) to run your Business Intelligence (BI) workloads once you get there. Designed and built well, the CDLE delivers performance and efficiency breakthroughs by applying interactive SQL query methods and a consolidated semantic layer to cloud-native object storage.

So far so good. Now how do you move your data?

Cloud migrations present a circuitous and thorny path in the best of circumstances. This blog introduces guidelines for architects and data engineers to plan and execute successful migrations. Given the complexity of the topic, one should treat this as a primer rather than a comprehensive guide. The common themes: think ahead, execute in phases, and adapt your plan based on lessons learned. Above all, remain vigilant on the journey to your data lake in the cloud. Wolves lurk.

Pack Your Bag

Architects and data engineers should start their migration planning process by creating an inventory of all their BI use cases, then profiling the workloads, data sets and business requirements for each. They should identify the use cases that have modest latency and throughput needs, and rely on relatively small volumes of data. Ideally these are departmental use cases that do not directly impact revenue. Plan to migrate those workloads and datasets first.

Ultimate Guide to the Cloud Data Lake Engine

Download the Whitepaper

By starting with a lighter pack, you improve the odds you will get there. Data teams can learn the basic dos/don’ts of migration, and test target performance, without posing significant risks to the business. Higher complexity workloads and datasets can wait until after the first migration proves successful.

Map Your Course

Data teams also need to define their migration approach. They might be able to simply “re-host,” or “lift and shift,” their existing application or workload to the Cloud Service Provider Infrastructure as a Service (IaaS) – Amazon S3, Azure Data Lake Storage (ADLS), etc. They can replicate their data, schema, metadata, and if necessary their ETL scripts, from source to target, without significant changes.

Look for help from your CDLE tool as well. Dremio, for example, can simplify things by transferring the semantic layer – that abstracted business of view of all the interdependent tables, columns and schemas – with no changes needed. Some migrations grow more complicated. For example, to integrate with specific Platform as a Service (PaaS) offerings on the cloud target, you might need to rewrite ETL scripts or the BI application itself. You also would need to change interfaces for your data to support new Software as a Service (SaaS) applications on the cloud.

As a final planning consideration, evaluate and select your migration tool. Options include homegrown ELT or ETL scripts, or change data capture (CDC) tools such as Fivetran or Qlik. CDC reduces the WAN bandwidth required for ongoing updates from source to target by eliminating the need for repeated batch copies.

Start Moving

Well-planned migrations should not pose major risks to operations or analytics activities, for two primary reasons. First, data teams should schedule the migration during slow business hours, when higher latency or lower throughput will not disrupt BI analysts or business managers. By allotting ample time, they can monitor and remediate migration issues with less risk of intruding on working hours.

Second, tools now support zero-downtime migrations in many cases. For example, the Dremio semantic layer can abstract the virtual datasets being queried from the underlying physical datasets and storage. Even as the location of the physical data changes during migrations, queries continue against the same virtual dataset without disruption. In addition, CDC tools maintain uptime by replicating incremental updates to the source during the initial load transfer. Once the load transfer and updates are complete, you can re-point your query application to the fully-synchronized target – with no downtime.

Blaze the Next Trail

Once data teams complete that first departmental migration, they can assess what they learned. Perhaps WAN transfer throughput fell short of expectations, or latency proved lower than expected on the target S3 platform. Data teams can adjust their future migration schedules or SLAs accordingly. With the post-mortem complete, you can plan your migration of higher-volume, more complex, more mission-critical BI workloads. Be sure to work closely with the BI analysts that consume those workloads to meet their needs without disruption. Then it is time to embark on your next journey to the data lake in the cloud.

About the author

Kevin Petrie

VP of Research at Eckerson Group

Kevin’s passion is to decipher what technology means to business leaders and practitioners. He has invested 25 years in technology, as an industry analyst, writer, instructor, product marketer, and services…

More About Kevin Petrie

Article Topics

Dremio Blog: Open Data Insights

Your Path to the Cloud Data Lake – Navigating the Thorny Path of Migration

Table of Contents

So far so good. Now how do you move your data?

Pack Your Bag

Ultimate Guide to the Cloud Data Lake Engine

Download the Whitepaper

Map Your Course

Start Moving

Blaze the Next Trail

Kevin Petrie

VP of Research at Eckerson Group

Achieve More with Cloud Data Lake: Accelerate Results with AI-Ready, Curated Datasets

Ready to Get Started?

Table of Contents

So far so good. Now how do you move your data?

Pack Your Bag

Ultimate Guide to the Cloud Data Lake Engine

Download the Whitepaper

Map Your Course

Start Moving

Blaze the Next Trail

Kevin Petrie

VP of Research at Eckerson Group

Achieve More with Cloud Data Lake: Accelerate Results with AI-Ready, Curated Datasets

Additional Resources

The Why and How of Using Apache Iceberg on Databricks

Intro to Dremio, Nessie, and Apache Iceberg on Your Laptop

5 Use Cases for the Dremio Lakehouse

Ready to Get Started?