May 3, 2024

A Tale of Two Migrations- How to Modernize Your Data Stack, the Right Way

It’s a tale as old as time: a data migration that was supposed to take months turns into years turns into something that no longer has an end date—all while going over budget and increasing in complexity every day. In this session, Gleb is going deep on the methods, tooling, and hard lessons learned during a years-long migration at Lyft. Specifically, he’ll share how you can leverage data quality testing methodologies like cross-database diffing to accelerate a data migration without sacrificing data quality.

Attendees will walk away with practices that will allow your data team to plan, move, and audit database objects with speed and confidence during a migration.

Sign up to watch all Subsurface 2024 sessions


Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.

Gleb Mezhanskiy:

Hi, everyone. My name is Gleb. I am CEO and co-founder of Datafold. Datafold is a data quality platform. We automate testing for data engineering teams. And prior to starting Datafold, I’ve been a data engineer for pretty much my entire career. I’ve been building data platforms at Autodesk, was one of the first data hires at Lyft. And it’s actually at Lyft where I got to lead and execute a really large scale migration and I failed. So today I’m going to talk about what are the hard lessons that I learned executing a data migration that hopefully you can take away and implement in your practices to do it more successfully than I did. And it’s called the Tale of Two Migrations because the second migration I actually helped enable as a data vendor. So we built software Datafold that actually helps data teams accelerate their migrations. And so I’ll be covering the success story of our customer and contrasted with the things that I did wrong as a data leader at Lyft. 

Data Migration

Okay, so first things first, let’s start with definitions. What do we mean by data migration? Typically there are two types of data migrations that we’re doing. It could be moving our workloads and storage and compute from one source to another. So for example, maybe you’re doing analytics on Oracle and now you’re moving those workloads, transformational workloads to Dremio powered data lake. Another example is we keep the same storage and compute layer, but we’re changing our transformation framework, for example, from more legacy platform like MicroStrategy to a newer open source framework like dbt. Sometimes we would be doing both together at the same time and that gets really complex. 

So why talk about migrations? In my experience, when you start the migration, it’s because things are not really working well. Either your current data platform is too slow, not capable enough, but you’ve likely already accumulated a lot of technical debt and investment on the older platform. And that means the migrations are complex projects involving multiple teams and stakeholders. And they’re not fun. This is not something that usually data engineers are excited to work on. It doesn’t feel like you’re creating something new. They almost overrun over budget and they never complete on time. And at the same time, migrations are really necessary because if your data platform is not performing up to the expectations, that’s a known starter for building anything downstream. So we’re doing migrations to really level up our infrastructure. So how can we do migrations better? The two tales that I’ll be talking about is the migration I’ve done at Lyft where we had to migrate over 500 tables that were built using Airflow, transformation tables, over 100 dashboards and over 10,000 ad hoc reports. So a lot of the complexity, especially on the consumption side, we were moving this at a time to a high powered data lake. Live is not really a modern system today, but the same principles could apply to any migration to data lake, for example, powered by Dremio. And then Affair on their end were migrating from Amazon Redshift to Snowflake, similar size and complexity in terms of migration. 

Tech Choices

So what are the five things we need to think about to execute migration successfully? It’s tech choice. We have to make sure that the platform we are migrating to is actually good. It fulfills our business expectations. It’s architecture. How are we thinking about moving the workloads and are we changing or not them in the process? How are we powering the migrations or outsourcing this or doing this in-house? How we sequence the project itself? And then finally, what are the steps in the migration work that we can actually automate? So in terms of tech choices, I won’t spend a lot of time here, but just to highlight my mistakes that I’ve done migrating from Redshift to a high powered data lake is that we really got a lot of these wrong. So Hive at the time was probably a mature technology, but it was already reaching the peak and it was starting to go down into the legacy world. The user experience was really bad and to illustrate sometimes it would take about three minutes just to compile SQL and tell you that you have a trailing comma. So really, really bad productivity and the cost was actually really high relative to what we had before because Hive is not necessarily the most efficient system. Since then, Lyft actually moved on to Trino and Spark. So they improved the performance, but at the time we started the project, it was really hard. 


Architecture. So this one is really interesting. When you start migration, you have really two choices, how you go about moving all the workloads that you have. You can either lift and shift, which is you take all your stored procedures, all your code, and you move it as is. You’re just converting it to the dialect to make sure it runs, but you’re not trying to refactor anything. You’re not changing the schema, you’re not optimizing the queries. And the second way, the proper way, is you improve as you go, right? Because when you do migration, you’ve already accumulated so much technical debt and some things just don’t make sense and there is piles and piles of spaghetti code, so why not re-architect from the process? So question for the audience, who thinks we should go for Lyft and shift? Just type it in the chat. Or who thinks we should go the proper way? Just type proper. No one really has a strong opinion? Lyft and shift, proper, okay. Okay. Looks like the more folks who want to do things the proper way. Okay, well, that’s awesome. 

So in my experience, I try to do it the proper way with Lyft, and that was absolutely the wrong thing to do. And the reason is migration itself is hard enough, you’re moving a lot of workloads, you’re under pressure of time, you have to maintain two systems. If you’re trying to change things in the process, if you’re trying to re-architect and improve things, that really is a recipe for blowing up the project scope massively. And my mistake was that at a time we had pretty bad data models, and what I tried to do is actually improve them, create a canonical star schema, align with the stakeholders, but that exploded our scope and slowed us down by over a year. So I became a strong believer that Lyft and shift is the way to go. You can always rebuild things once you land on the new system, but don’t try to do this because you’re putting the entire project at risk. 


The other thing is outsourcing, right? So we talked about migration not being really fun for your internal team. So maybe it’s a good idea to outsource it to a consultancy or a system integrator who actually have a lot of, they’ve seen a lot of projects like that, and they can probably do things faster. And you also don’t need to commit your entire team to doing that and say no to the other projects. So that’s basically to improve your opportunity costs, what your team could be doing if they’re not focused on migration. The recommendation here is to really think about outsourcing, especially if you’re doing Lyft and shift, but you should never pass consultants with re-architecting or improving a data model because again, that is a recipe for disaster. And as you may have guessed, that’s exactly what I tried to do in my project. So we had consultants, but I thought that they were so experienced that they would come in and help me build ideal star schema. That was a huge mistake because if it’s hard to do internally, it’s even harder to do someone who doesn’t have an entire business context and relationships, and again, led to the project ballooning. 


Next, let’s talk about sequencing. So you have, let’s say thousands of tables, lots of BI use cases, where do you start? So let’s look at this picture. We have a legacy data source that has some raw data, let’s say events or replica from your production systems, and then you create derived data using stored procedures or orchestrator. And then finally you have some business apps like BI tools, data applications, machine learning that are consuming that data. And we’re trying to move that to the new system, right? The best way to sequence your project is not to try first to move all of your transformations to new, but rather to just copy everything that you have that is currently consumed away in the business, the derived data, right? Move it to the new system. Basically, you can continuously replicate it with tools like, you know, AppSolder or Fivetran. And then switch over your business consumption to the new system. It’s actually very easy to do with Dremio, given the virtualization engine, right? You actually don’t have to move the data. You can just point your BI dashboard to Dremio and have it query the legacy system, which is a really good advantage of Dremio over other platforms. What that allows you to do is it allows you to essentially completely abstract the user away from where the data is created, right? So now your business doesn’t actually have business users, data users, don’t have dependency on the legacy system. What that allows you to do in turn is to essentially work on migrating all these transformations and how the derived data is built without, you know, worrying about breaking the business because your business safely consumes exactly what they were doing before, but now from the new system. 

Another advantage of doing that is that usually by the time that you’re doing migration, your legacy system is already overloaded. So for example, for us at Lyft, Redshift was already so overworked and we had a huge query queue and latency was really, really high. And so moving the business consumption to the new data lake and serving data from there, migration that was produced by Redshift really helped us alleviate a lot of the pressures and pain points for the business. 


And finally, let’s talk about automation. So migration is one of those projects that is highly repetitive and iterative process. How do we do this, right? We identify what are the pieces of code, what are the transformations we need to move over and then corresponding BI use cases, right? We then for each transformation, convert the code, run it in a new system, and then we have to validate that it’s correct. So, and we do this over and over again. And if you have, let’s say a thousand tables in your legacy data store, you probably need to do this a thousand times, but then each time requires each asset to be migrated requires lots of iterations. And so thinking about what you can automate in this process and how you can have an efficient process for conversion of your workloads can really massively accelerate your project. So on the planning phase, one of the things that you can do is invest in column level lineage solution. What that allows you to do is understand what are the dependencies in your legacy system? How does the data flow from the raw data to transformations all the way to BI stakeholders and data users and machine learning apps. And that really helps you understand what sequence should be for moving over those workloads. What are those that are more critical? What are those that are not even used today? So you can completely, you know, abandon them or deprecate them and not even bother by grading. So that just gives you a lot of visibility and allows you to plan your migration in a smart way. 

The second thing you can automate is SQL translation. So these days you can use a tool like Datafold. You can use LLMs, you know, ChatGPT and Cloud3 are actually not bad at translating SQL. They’re not great. They’re making a lot of mistakes. And it takes quite a bit of prompt engineering in order to translate SQL. But it’s fairly doable. You don’t no longer need to have humans actually remap everything. And finally, validation. So what’s really important is migration is never successful until the business is comfortable using the new data. For the business to be comfortable using new data, you have to prove to your stakeholders that whatever you produce with a new system, let’s say Dremio Power Data Lake that you just created, is fully on par metric-wise, it’s the same data, you know, same KPIs, same reports, same dashboards with what you had in legacy. If the business senses there’s any discrepancy between those, that means you are not successful, right? They will never accept that, especially if you are working in a regulated environment, if you’re a public company, you can simply not have your data change while you’re doing migration. So the validation piece is where I personally as a data engineer spent most of my time. And it’s been an incredibly painful process to understand what are the discrepancies and where are they coming from and how I’m improving when I’m, you know, tweaking the code or otherwise trying to bring my new data platform on par. 

Data Diff

So how do you solve this? You solve this with a tool called Datadiff. So Datadiff is essentially just like GIFDiff, but for data. We built this tool, Datafold, because we believe that being able to compare data within a database or across databases is a really fundamental capability for a data engineering workflow. So when you apply it in a migration context, you actually can do really cool things such as comparing data in your legacy source. Let’s say in this example, it’s Postgres and your data lake powered by Dremio. So what Datadiff gives you is essentially for every primary key of the table, the two keys, you know, in this case, Postgres versus Dremio, how I need to change this and giving you a full idea of whether your data matches or not. That doesn’t necessarily have to be value level diff, although that’s very powerful. Datadiff also provides you statistical level aggregates so you can see whether your metrics are matching, whether distributions are changing, and otherwise you have a lot of insights into how data is different. 

And what that in turn allows you to do is essentially make sure that when you are moving the data over and you, let’s say you moved over the derived data, you’re serving it to the business apps. Now you’re working through recreating all the transformations so you can completely turn off the legacy system. You can run Datadiff to make sure that what’s built from scratch on the new system fully matches the old. And that way you accelerate your iteration for every asset that you’re migrating and accelerating the overall project. So this is a case study from FAIR. Like I said, this is a successful story unlike what I experienced in my day of data engineering. So they actually were migrating 5,000 tables and they were able to accelerate the timeline of the project by about six months with automating testing with Datadiff. And the cool thing that they’ve done is they essentially scheduled diffs to run on schedule between the legacy and the new system every day for every data set. So every day they had a really good idea. What is the parity between the old and new system? What are the things they need to fix? What are the things they need to prioritize where the discrepancies were large? And very importantly, they provided their business stakeholders full visibility into how the project is faring. 

So that’s how you run successful migrations. If you want to talk to me, you can find me on LinkedIn or email me at [email protected] and I’ll be happy to show you how you can use Datafold for accelerating your migration projects.