March 2, 2023

10:10 am - 10:40 am PST

Data Lakehouse – Why, What & How & accelerate and De-risk adoption

During this session, our focus will be on the evolution of data management needs, which led to the development of data lakehouses from data warehouses. We will explore various use cases of how organizations have integrated data lakehouses and the benefits they have gained from doing so. Additionally, we will discuss how enterprises can quickly adopt data lakehouses, considering the different architectural options available. Finally, we will highlight how DBShiftTM can facilitate and expedite the adoption of data lakehouses while ensuring a risk-free process.

Topics Covered

Lakehouse Architecture

Sign up to watch all Subsurface 2023 sessions


Note: This transcript was created using speech recognition software. It may contain errors.

Ramki Ranganathan:

Hello everyone. My name is Ramki Ranganathan and I’m thrill to be speaking with you today about Data Lake House adoption. With over 15 years of experience at Cystic Solutions working in data management and analytics, I’ve seen firsthand the tremendous value that are properly implemented Data Lake House can bring to organizations of all and industries. This experience has given me a unique perspective on the challenges and opportunities involved in adopting a data lake house approach, and I’m really excited to share my insights with you all today. Thank you for joining me today, and I look forward to our discussion

In today’s session. The first part of the session will discuss the case for Data Lake Houses. Next, we’ll explore data lake house adoption strategies, including best practices for planning and implementing a data lake house, and tips for ensuring a successful transition. After that, we will look at Data Lake House migration scope, including things to consider and how to manage the migration process to minimize disruptions will then understand how DB shift our unique cystic cystic unique approach can help organizations de-risk and accelerate their data. Lakehouse migration. In the case study section, we’ll take a close a look at a real world example of a successful modernization done through DB shift. Finally, we’ll wrap with a q and a session giving you the opportunity to ask any questions. You may have

A brief introduction about Cystic Solutions. We are a modern analytics consulting firm and we specialize in data management and analytics, products and services. We are founded in 1993 and we have grown to over three 50 global consultants and 400 plus customers. Our machine always has been to enable business transformation through the intelligent application of data. Our key focus areas include data management, analytics, and insights through core and advanced analytics, including machine learning, cloud modernization, and building intelligent applications. Overall, we have been a trusted partner for several organizations looking to harness the power of data to drive business success, and our team of experienced consultants can help you transform your business operations and unlock new opportunities for global.

So let’s dive right into the session today. We are all familiar with the data warehouse. A data warehouse follows what we call as a top-down approach because of its model first design. Now in a top-down approach, the process begins by understanding the business requirements, the business problems that we are trying to solve, and the analytics dashboards and all reports required to accomplish the business goals. This involves identifying the business dimensions, key performance indicators and metrics that are critical to providing insights necessary to meet the business goals. Based on this understanding, the data warehouse schema and the are created now, once the schema and model are created, the focus then shifts to the e l process. The data is extracted from the source systems and transformed into the desired format and then loaded into the data warehouse. Once the data is available in the data warehouse, reports and dashboards are built based on specializations and specifications and delivered to the business users. While this approach works great, there are some disadvantages to it, which will be evident once we understand the bottom up approach.

Now in a data lake, it allows organizations to store large volumes of structured, semi-structured and unstructured data in a centralized location without the need for redefined schema. The data is interested into the data lake in its raw form and is transformed later as needed for analysis and reporting. In complete contrast to a top down approach where the schema defined first, the bottom of approach allows organizations to store and process large volumes of raw data in a flexible and scalable manner. Now, this bottom up approach provides the ability to ask more of your data. Since the data lake has all information needed without any constraints, however, it also requires more advanced data management and processing techniques to ensure that the data is properly curated and transformed. When needed. To summarize, the bottom up approach differs from the top down approach by not limiting the data elements available for analysis based on a business requirement. This is partially made possible because of the cost of storing and processing raw data in a data lake is way cheaper. Compatible a data warehouse

While augmenting a data warehouse with the data lake satisfies an organization’s information needs. There are tons of benefits. These two environments could be combined into what we call as a data lake house. It can reduce costs by avoiding data silos and data transfer processes between the data warehouse and the data lake. It improves reliability by eliminating data transfers, thereby reducing the risk of quality issues and ensuring data consistency. Data duplications are avoided by combining the data warehouse and data lake and the data lake. Data lakehouse supports all different kinds of data personnels, either it be people consuming dashboards, power users, business analysts, data scientists and whatnot. It also provides a common meta metadata layer that can be used to apply fine grain data governance policies. It also helps to treat data as a product and be grouped in the domains to support a data mesh kind of architecture. Finally, the data lakehouse helps to keep data organized and free from the KS of data swamps, which are unstructured data lakes that can become difficult to manage over time.

Now we have seen what are the benefits of a data lake house? How are enterprises adopting the data lake house? I would like to walk you through the results of the research conducted by Ventana Research in 2022 on the topic. Data Lakes. Dynamic Insights aim to understand the current state and future tense trends in Data Lakers adoption. The research involved a survey of 500 professionals across a range of industries, roles and geographies. Here are the key findings of the research. The key stats from the research to focus on are number one, the adoption of data lake houses is increasing rapidly with 91% of organizations already using data lakes for analytics. More importantly, 23% of the companies have migrated their data warehouse to a data lake house. Now, while data lake houses offer several benefits, organizations do face challenges in adopting data lake houses. One way to mitigate the risk is to follow a migration methodology.

The methodology starts with a discovery phase that analyzes the current data landscape to come up with a specific multi-faced adoption roadmap. Now then comes the analysis and design phase, which takes complete inventory of objects that need to be migrated and performs structural mappings between source and target systems. Then comes the migration and the re-engineering phase. Now in this phase, the majority of the tasks required for data lakehouse adoption happens. This typically involves data structure, migration, data migration, etl, SQL migration, dashboards, visualizations, migration. For example, tableau or power bi, python notebooks, migration. And even in some of the dashboards, custom sequels are used and those need to be migrated as well. The final phase is the deploy phase in which we completely test and validate objects migrated to a data lake house, and then we finally deploy the data lake cost of production and monitor monitors usage. We’ll look at each of these individual phases in detail and how our DV ship solution can de-risk this adoption through automation and acceleration.

Now, the discovery phase is a process that helps organization plan and implement a successful adoption of data lakehouse. The overall goal of this phase is to come up with a strategy for data lakehouse adoption and is built around four pillars, namely data, technology, policy and scale. Under the data pillar, the organization studies its data inventory, data touchpoints and frequency and mechanisms of data usage. This helps them understand the data assets they have and how they can use them to derive business value. Under the technology pillar, the organization performs a technology assessment and comes up with a future state architecture for the data lakehouse. This ensures that the organization has the right technology infrastructure in place to support the data lakehouse adoption. Under the policy pillar, the organization studies its organizational culture methodology and standards to ensure that the data lake adoption aligns with its policies and processes. The organization also identifies success metrics to measure the impact of the regular adoption. Finally, under the skill pillar, the organization performs a data literacy assessment to identify training needs and ensure that its people have the necessary skills to derive business value from the data lake option. The key outcome of this discovery phase is a multi-faced milestones based implementation roadmap along with technology platforms and data literacy plan to equip different data personal with required skills,

Then observe migration phase. Now, irrespective of the data lakehouse adoption strategy, which is either to replace the data warehouse or to augment the data warehouse with the data lakehouse, you might have to deal with one or two or all of the work streams. Listed here at the core is actual data structures and data migrations From the data warehouse to the data lakehouse platform, data structures and historical data have to be migrated. Now, there are touchpoints on either side of the data warehouse, which need to be migrated as structured. On the left hand side, we have what we call as enterprise ETL or extract, transform and load processes or SQL processes that bring in data into the data warehouse. On the right hand side, we have the consumption of the data in the data warehouse through analytics or visualization or dashboard platforms, either using all of the work’s features or using custom SQL to extract data from the data warehouse. Now these are the three words teams that need to be handled as part of this data adoption. Now, the suggested approach is to have a multi-faced migration plan and have the focus on data migration during the initial phase, and to deal with the touchpoint platforms in future phases. In either adoption strategies, either to augment or to replace the data warehouse, we have to deal with the migration of these work streams and perform detailed validation as well. Now let’s take a closer look at each of these work streams.

Now, the data migration work stream involves the migration of database objects, including tables views, stored procedures, functions, et cetera, from a traditional data warehouse platform like Matisa metadata to a modern data lake cost platform like trio. In addition to the data structure migration, historical data has to be loaded into the data lake house and validate based on the complexity of the data warehouse, the volume of data we might have to do repeated migrations or phase migrations. And at every point we have to validate the migrated data. Now, in addition to the automated migration following the forklift approach, we look at specific areas of the data lakehouse that can be re-engineered to take full advantage of the features that a data lakehouse platform like Premier has to offer. This is called as a re-engineering phase. There are many data validation strategies that can be applied to different areas of the data warehouse. Starting from a simple row count based validation, we can go all the way up to a complete hundred percent validation of every single row and every single column. Now, migrating data structures and historical data from a data warehouse to a data lake house can be a complex process that poses several challenges. And here are some of the key challenges.

Data warehouses and data lake houses have different data structures and storage formats. Data warehouse data is usually structured while data, lakehouse data is unstructured. This means that the data must be transformed and restructured to fit the new environment. Data warehouses often store a large amount of data, and transferring this volume of data to a data lake house can be a time consuming and resource intensive process. It’s important to have a plan in place to manage the transfer of large volumes of data. Ensuring the security of sensitive data during migration process is also critical. Repeated data validations have to be performed to ensure the quality of data migrated to the data. Lakehouse, just to give you a sense of the complexity involved in the data migration, I would like to take a scenario where, for example, if you are migrating from a Teradata data warehouse to a new data lakehouse platform, now there are well over 1600 built-in methods and operators in Teradata, which need to be migrated to a platform like of which more than 50% are not directly mapped to the platform. There are about 22 data types in in Teradata compared to almost close to 10 data types in.

So let me take a slight and show you the data type mappings between the two platforms. this is a Teradata drio mapping of data types. As you can see, there are certain data types which are not readily available. For example, graphic. Same thing with the built-in functions. There are well over 300 plus functions based on the version of Teradata that you’re using that need to be mapped or converted to to renew. Some of these functions are as say, standard, they can be used as is and renew also does a fabulous job in making sure it’s compatible. But there are Teradata specific functions which need to be migrated as well. Very quickly, I’ll just show you the interface of

Ben Hudson:

Roki. Yeah, but you, I think your slides might be a little stuck. Do you want to refresh your screen share? We’re stuck on the data migration slide that talks about data validation types and data structure and supporting.

Ramki Ranganathan:

Sure. Thanks for letting me know. Thanks. Let me stop sharing and re-share again.

Ben Hudson:

Yep, no worries. Now we have it. Now we have DB shift and we can see step one, step two, step three.

Ramki Ranganathan:


Ben Hudson:

So we can see your full screen now.

Ramki Ranganathan:


Ben Hudson:

Oh we couldn’t see the data mappings from Teradata to dremeo either.

Ramki Ranganathan:

Sorry about that. Let me try it one more time.

Ben Hudson:

There you go. Yep. We see here editor now.

Ramki Ranganathan:

Yep. So this was what I was talking about, the data type mappings between Teradata and reo. and here are the functions about 330 functions that need to be migrated to reo, right? And I was about to go into our DB shift platform. so very quickly I’m going to show you, let’s say I as an analyst, I have a bunch of SQL statements in Teradata that I need to migrate to reo. So I use the platform and I have a, a bunch of sequels that I can upload,

And it gives you complete flexibility in terms of doing migration of all these sequel statements in one shot. Or you can pick and choose what you want to migrate. Let me take a brief moment to show you some of the sequels and the differences between Teradata and Renew highlighted here. You know, you can directly do data manipulations in in Teradata which need to be converted to format. More importantly, I want to show you the store procedures and store procedures are not readily mapped, readily available in in like even a simple procedure like this in Teradata and companies who have been using Teradata invested heavily on, on procedures. so even things like that, you can actually use the platform to convert to ramo and in this case, it has to be converted to a user define function using a combination of Python right.

Now. In addition to converting these SQL statements and procedures, that data validation, as I was talking about, all the converter queries you can validate in one shot in a unified interface here, and you can choose what kind of data validations you want to Perform. Okay,

So I just wanted to quickly show you this the interface as well. Let me just jump right back into the presentation. So that’s about the data migration.

So in the ETL or SQL migration phase, we migrate all of the touchpoints that feed data into the data warehouse and rewire them to the data lakehouse. There are two different approaches that are possible. The first approach is where the ETL platform stays as is, in which case the ETL processes in the platform have to be rewired and changed to point to the data lakehouse. And the second approach, the modernization of ETL platform is involved. Like for example, you might decide to move from SQL server integration services to Amazon glue or material, in which case a complete re-engineering is, is is required. Also, as stated in the previous data migration phase, repeated data validations have to be performed on these ETL processes as well. So individual ETL processes have to be run on similar environments on the data warehouse as well as in the data lakehouse and the results have to be compared.

And the same is the case with analytics migration as well. So you might have your reports and dashboards in in Power BI and, and you know, decide to stick with Power bi, in which case all those dashboards and reports need to be rewired to Data Lake House and validated. Or you can perform modernization, you can decide to move from for example, MicroStrategy to to Power BI platform as well. Now, the the simple task of rewiring it can be daunting because, you know, the thousands and thousands of reports in one of our migration we had about 30,000 plus dashboards and reports that that does in micro strategy, right? So rewiring each of these dashboards and validating them can be a daunting process for you, right? so that’s where you know, we the DB shift comes in. Now, as you can see, all the three migration work streams, the data migration, the ETL or SQL migration and analytics migration involves a lot of effort, a lot of skills and is, has, has its own risks and challenges.

Now, cystics migration accelerator solution DB shift is a framework which can reduce these risks through automation and acceleration. DV shift can automatically map this data structures, perform the data migration, and also automate and accelerate the migration of the other two work streams as well. Now, just to pass here, and I will show you the the, you know, if we have to migrate from a platform like Informatica to for example, blue. so I have a sample set of Informatica processes here. And of course these processes are specific to the ETL platforms that you have. And in Informa Informatica, you have the concept of mapping sessions and sources and targets and workflows. On the case of rewiring your ETL processes in Informatica to a data accounts, you can just pick and choose your source and target connections to point to the data lakehouse and validate them. In the case of complete re-engineering, you can actually choose the processes that you want and you can actually choose your target platform that you want to migrate to, right? As you can see, some of these processes are, are really complicated. you know, when you’re transforming data and loading it into the data warehouse house.

So jumping right back to the presentation. So not only DB shift can automate the data structure, data migration, etl and reports migration. It counts. It can also automate the data validation processes. Well right, so that’s the power of DB DB shift,

Right? Overall, we have seen that we can achieve about 70% automation through through the DB shift platform. and I will illustrate that with a case study here. and before that, quick, very quickly, the benefits the business values of using DB shift, right? cost savings DB shift can help reduce modernization costs up to 70% speed of modernization. It can reduce the effort by at least 50% through automation, and it’s completely flexible. it can be easily customized to handle special migration requirements, allowing for more flexibility in the modernization process. It has monitoring dashboards and reports that you can clearly use to figure out the progress of the migration, which can improve project management and produce uncertainty. Finally, reliability, cost and deferred production without reliability of personal use. DB shift is based on our proven migration methodology and our skills which completely de-risks and accelerates your modernization journey.

Now, with respect to a case study, we used DB shift to migrate or modernize one of major insurance provider in North America. This Multiline insurer, phase two major challenges. One was the modernization of their end of life data warehouse, and the second was modernizing the data wrangling scripts from SaaS to . They used cytec DB shift to accelerate the migration. just to give you a sense for the migration effort, we had about 3000 processes in Informatica that we migrated in in about four weeks. And so this gave them a lot of opportunity to cut down their their migration costs and also focus more on, you know how to use the insights derived from a modern platform. So that’s all folks. thank you for attending this session on data Lakehouse Adoption. I hope you found this information and insights shared to be helpful near Johnny towards modernizing your data management practices. if you have any further questions or would like to discuss how DB Shift can help you achieve your data management goals, please feel free to reach out to us using this contact info already on this slide. Thank you once again for your time and attention.