Subsurface LIVE Winter 2021
Flexible Data Lake Architectures for Seamless Real-time Data and Machine Learning Integrations
This talk was born from some of our greatest victories won and worst losses suffered while designing and implementing data lakes, with a focus on real-time processing and machine learning pipeline integration. We will go through the various design problems spawned from the specific integrations and solutions we have used—from caching to avert the Slowly Changing Dimension problem through operational and analytical cluster separation to the fully-fledged MLOps process. We will showcase, using real examples, how those use cases are reflected in the data lake architecture, both when building from scratch and evolving an existing solution.
For the data architect, this session will provide a greater understanding of available design patterns. To a data scientist, it will provide a better understanding of the soon-to-be working environment.
Kamil Owczarek, Head of Big Data, GFT
Kamil Owczarek is the Head of Big Data at GFT Poland and a data engineer at heart. Kamil specializes in projects connected to stream processing and machine learning. His personal motto: "Data is always more important than you recognize it to be".
Piotr Kosmowski, Solution Architect, GFT
Piotr Kosmowski is a solution architect at GFT. He is experienced with the design and delivery of multi-tier solutions for the financial sector including microservices, public APIs and data lake in both on-premises and cloud environments. Working closely with the client, business and development team, Piotr helps improve organization delivery processes with an agile mindset.
Welcome everyone. Thank you for joining us for this session. A reminder, we will have a live Q and A after the presentation we recommend activating your microphone and camera for the Q and A portion of the session. Please join me in welcoming our speakers, Peter and Kamil let’s get started.
Okay, so let me share my screen. [00:00:30] Can you see it?
Okay. So hello, everyone. Welcome to a recession, which we will tackle our daily experience in building data lakes with real-time stream processing and machine learning integrations with focus on how this impacts the architecture of set [00:01:00] data lakes. My name is Peter Kosmorsky and I’m a solution architect at GFT Poland. For this presentation I’m joined by coming up Tarik our head of big data practice. We are going to take you to the journey from data lakes, to integration batch with real time processing and known architectures. Then we will talk about ML integration and ML ups and distinction between operational and analytical environments.
[00:01:30] So let’s start the presentation by telling you something you’ve already known, how and why data lakes are segmented. So obviously data lakes are massive aggregates of data from hundreds of thousands of sources. And this data often comes a messy with our no schema or not really fulfilling its theoretical schema contains low amount of useful [00:02:00] and required information for the final consumer. Sometimes it contains information it shouldn’t like personal or financial data of your customers or from the regulation perspective, physical location of the pasties, that data is constraint to the strict regions zones or even data centers, which obviously means that for each data set coming in, you will get a number of transforms starting with data sampling and [00:02:30] refinement to potentially a number of transformations to combine it with other datasets each having their own refined process. Of course you could in theory, do it all in one transformation.
However it leads very quickly to problems with maintenance, security and traceability. So naturally segmented data lakes evolved where we usually have around three standards zones, the landing zone also as a raw [00:03:00] or Brown zone where the data lands as it was obtained from a source system, just in a different storage format usually. Then processing zone also referred at staging or a silver zone where the data is refined, improved in quality. It’s come out in force, et cetera. And then data Mart also known as proceed or gold zone where we store datasets resulting in taking those [00:03:30] from the previous zone combining and enhancing them to serve data for well-defined to use case going further. Also this nicely covers the security part of requirements since usually in a systems like those, the computation job creating the second and third zone cannot access external data sources, but have to only get data from the previous zone itself. Additionally, we would typically limit sensitive data to that [00:04:00] landing zone only which can have extra security place on it.
Since we are on the same page with the standard data lakes. Let’s talk about evolution of this concept. Our clients usually have some data lakes with a batch processing place, which is a great data source for upstream systems like analytics, external APIs, et cetera, but at some point a business demands for new real time processing capabilities. [00:04:30] Very often in combination with ML, like sort of Allianz real-time recommendations, proactive servicing, automated user assistance, and many, many more. So machine learning again, huge popularity for the past couple of years. However, most of the literature is about preparing models and algorithms. However, not about how to prepare and use properly real time data on the PRODuction or how to promote ML models to PRODuction in corporate environments.
[00:05:00] Let’s talk, what are the challenges with integrating batch and real-time words first? So applying the same schema for batch, from batch to streaming data sets very often is not possible as for the code base. This is no more relevant since modern frameworks support both paradigms with great success yet still there are some challenges we need to remember when it comes to [00:05:30] designing architecture and pipelines. So when it comes to real-time processing, low latency comes before high throughput, as we want to see the results, as soon as possible data is being processed in a weight, smaller volumes constraint with a window size batch processing on the other hand must process big payloads. But on the other hand, all the payload is known before the processing starts. This, it gives the [00:06:00] batch processing additional opportunities for optimization like clustering data movement, local caching, and dynamic optimization, which are mostly unavailable for streaming processing.
So joining data between two data sets, which change very different frequencies. And I will demonstrate on the example. In the batch [00:06:30] processing or reference data is present before processing starts. So joining two data sets in a similar to join to materialize the views or tables, joining facts with the dimensions, which changes way slower then processing window size is not that trivial in the first row. We can see transaction fact and corresponding user dimension as joining data [00:07:00] can be possible only within window sometimes transactions won’t be able to find the last change for the user, like in the three middle windows, different windows size can be used, of course, but as a downside amount of transactions to be processed is increased at the same time. As a solution, some catching mechanics must be used for slowly changing data like compacting CAFCA topics or no SQL databases with fast data access. But [00:07:30] at this point, the whole architecture becomes complex from the operational perspective. So let’s discuss some well-known big data architectures, which addresses those challenges and talk about some of those strong and weak points.
The texture is about maintaining to distinguish parts for batch and streaming independently. It’s natural for existing data lakes, which requires modernization and streaming features. [00:08:00] Batch layer is responsible for PRODucing reliable data from ingested feeds in a batch manner. And it’s evaluating cheap storage for storing immutable data into different data formats speed layer M is a to handle the data that are not already delivered in the batch view due to the latency of the batch layer. It only deals with the resident data in order to provide a complete view [00:08:30] of the data to the user, by creating real-time views. Serving layer, consolidates outputs from batch and speed layers, and allows external clients to perform queries over the most recent data with a great level of consistency. And unfortunately, this architecture comes with trade off in terms of maintenance and complex technology stack.
Main premise behind [00:09:00] the cup architecture is that you can perform both real-time and batch processing with a single technology stack. All data is handled with even driven manner, and there is no separate technology to handle batch processing. It is based on their streaming architecture in which an incoming series of data is first stored in a messaging engine like Apache Kafka from there as streaming pipeline handles end to end [00:09:30] processing data and historical data or reprocessing data. For example, due to the code change can be done with the reprocessing, all the events from certain point of time. And at this point, some challenges may a cure in a messaging system, scalability or storage associated with retention policy, SLS, Delta architecture on the other hand relies on [00:10:00] some data engine, which adds ASIP on top of a used underlying storage, Apache iceberg, Apache hoodie, or Delta, like are good representatives here. At some point, data can be modified at the weighting event sourcing under the hood, which makes accessing consistent data much easier in comparison to Lambda by both batch and [00:10:30] speed layers. The data can be seamlessly handled and process at the same time in a batch and real-time fashion, which eliminates many logical challenges on the very low level. So naturally it comes with additional features related to data quality security, but this topic exceeds the scope of this presentation.
All right. So now let’s talk about machine learning, integration, ops, and good practices about separating [00:11:00] analytical and operational environments. Camille passing those to you.
Well, thank you for the voice. Yeah. Let’s talk about what we essentially managed to find in terms of how your Lake data sort of grows certain useful appendages when you’re trying to integrate machine learning. And so by doing that, I want to take you on this small journey where I’m first going to start complaining about the problems that we have and listing them. Then I’m going to go a [00:11:30] little bit building up the solution until we get our final conclusions. So the first problem that we really hit, but in trying to integrate our data engineering world. So basically data lakes that we were building with machine learning, for reasons like say, we want to use machine learning algorithms to automatically have your data from say your bronze layer, or you want to do some predictive analytics on the already improved data from silver and [00:12:00] put that in gold for your external clients, is that data scientists, people who develop those machine learning algorithms that we then want to apply, they kind of live in a very different world than data engineers. Even from very basic stuff, meaning like things version control systems, how you version, what you’re doing right in the data engineering world, you would usually have a good for that.
Then you would work in the get flow model, multi-branch basically, whereas for data science, when you’re doing experimenting, it’s sort of more of a linear process [00:12:30] in that you’re taking your one model and you’re trying different combinations of features, different inputs. And then you just want to go through your efficiency metrics and try to choose the best one that’s one example. But then we’re also going through things like, for instance, in data science, it doesn’t really pay off that much to have multiple instances of your environment. Generally as data engineer used to having dev environment and then QA, and then at least those two and then PRODuct, whereas [00:13:00] in data science, it really pays off to have one environment where you can experiment on the sample data from PROD and you potentially able to also reach some ad hoc, reach some external data sources, right.
And then of course the core tech stack is different. Again, in writing your compute jobs for your data lakes, you would usually use massive processing frameworks, like say spark or Flink, or partially something for streaming as well. But then when you’re doing data [00:13:30] science and constructing the algorithm, you’re usually not working on terabytes of data, but you just have carefully chosen and sample data there available for you. So you can easily use all those popular libraries like scikit-learn and pandas, which are basically a single thread there’s Python libraries. Right? And the thing is one is the environment in which the algorithm is being developed. The other one has to be applied then. So we have to carefully think about the touch points to sort of bridge the disconnect [00:14:00] between the two world. And then the other big problem we were frequently running into is different approaches to data security, because again, in the world of data lakes, you’re building the whole segment, that data lake like you mentioned, to keep your security to a maximum.
If you’re building care computation on jobs, which, they’re building the silver zone, then you’re only going to enable them to reach the Brown zone for input. And all the external data sources, you want them to have carefully onboarded into your data Lake. [00:14:30] You’re probably also going to have some things like trying to eliminate confidential data and personally identifiable data early. Whereas when you’re developing the ML model to later have applied, you want to be working on data that comes from proud. So that the model then is able to predict, for instance, in predictive analytics on your actual pro data. And you also want to be able to ad hoc reach to external data sources. So those two approaches to data security, they don’t [00:15:00] nicely mix. And so one of the things that we started with on our journey towards the final solution, or to the various tastes of that disconnect was MLS, which is defined by Wikipedia as a practice for collaboration and communication between data scientists and operations professionals, or data engineers for that matter to help manage PRODuction ML.
And it encompasses a lot of techniques from, [00:15:30] model registration for promotion, how you expose your model to the external world. Then once it’s running in PRODuction, you do monitoring what’s possible retraining. But what I’m going to focus on is mainly the registration and promotion part, because this is how we kind of managed to go towards bridging that disconnect, which was our core problem. And so if you look into how nowadays, usually the model is being developed, you have your data from your actual system. [00:16:00] So let’s say our data Lake, you have some data from external systems usually, and the data scientist is running the experiment, trying to fit the model for maximum accuracy on some data they have. And that problem process goes on and on. And each round of that PRODuces a certain instance of the model better or worse which are registered in your model registry.
And once you’ve done enough, then you’re able to choose according to let’s say, accuracy metrics, the best [00:16:30] model that you PRODuce. And this is then getting promoted essentially to be then applied in practice. So it has to be put brand and put behind some external API, I guess nowadays HTTP is fairly popular with ML frameworks, like say ML flow or the IBM one. So where we went with that is next step is definitely, you have to also test your model on a wider scope of data because it was developed and it worked [00:17:00] on a small subsample of data, but then if you have maybe less smaller, like a wider sample, it might not be as accurate, right? So for that, you want to first put it together with your data pipelines, which are presumably your compute jobs in the data Lake. You have it exposed.
So say via HTTP. So you can easily imagine how this can be integrated into data processing pipelines written in say spark or Flink, and put that together in some sort of testing environment, feed [00:17:30] that your test data with prepare the expected results. But let’s say on a more massive scale than before, and sort of run the model and verify your accuracy metrics. If it passes the threshold assumed then great, and we can start going towards PRODuction. But if it doesn’t, then it’s probably go back to experimentation and you need to develop another model again. And yeah, so this is, I guess the first step in terms of, we want to bridge this disconnect by basically passing the model from the data [00:18:00] science world towards your data Lake for the testing stage. Right? But then there was a second problem I complained about early, which is data security.
And we have that by the way, we approached solving that was to apply this nice technique called tokenization, which you’re probably familiar with, but essentially what it says is this, if you have your data and it’s overall sensitive, it’s probably not that every single field in the record is sensitive. It’s just that the few fields or a combination of those is sensitive. You can have things [00:18:30] like social security numbers and so on. So you just want to take those fields or combinations of fields, which are sensitive and re replace those instead of actual value, just put a reference token in there for that particular value. And you can start a combination of value and token to them later, be able to go back if you need your data detail canonized. It’s a basic and simple technique, but the good thing is unless you have very specific data requirements or very [00:19:00] minor data, which is really filled in with those highly sensitive fields, it’s usually is able to solve your problem without much complication.
And so then how you apply it to that particular case is this, you need data from PROD to run your experiments effectively, to develop a model which will work in PROD. So you can take your data. You can take it from Vermont, multiple zones in your data Lake from bronze, from silver, from gold, because for experimentation, that’s what you may need. [00:19:30] You can run it through a tokenizer, which should probably still work and store its own input inside the prosecutor PRODuct environment, possibly inside your data Lake or some secure storage nearby. And once the data is tokenized and it’s desensitized, I guess you can then export it to the sandbox environment for data science experimentation. And you’re not actually going to be running into huge risk of exposing some data, which should never [00:20:00] be exposed for obvious reasons. And then you’re able to develop a good model.
There is some slim chance. The model might actually depend on that very contents of those secure fields in which case that solution does not apply. But again, our experience suggests that’s not actually the big portion of situations that you’re going to run in like 5% maybe. So I think we’re starting to build a picture here in terms of bridging the gap between the data science and the data Lake world. [00:20:30] So I want to show you the final solution that we came up with by basically treading those steps. And that’s it that’s we will have in our final architecture, two sorts of environment. First is let’s say the operational environment, or you can say operational environments in that you will have multiple instances like DEV QA, possibly some SIT and PROD so that this is basically where our data Lake will be.
So here we will adhere [00:21:00] to all the rules concerning data ingestion into the Lake concerning how computer jobs can only take data from respective zones. So all the strict zoning rules rules about data, let’s say secure data, not really going out of the lending zone, but we will also take our developed model and integrated here and the other kinds of environment that we want to have is the sort of appendage we grew by considering how to integrate machine learning. And that’s [00:21:30] basically the analytical environment or the sort of sandbox for data science, people that we want to add to our data Lake, to have the models developed, which will then have the extracts from PRODuction, but strictly password tokenization to not breach data security rules. It will allow ad hoc access to external data sources. Although those may also undergo tokenization actually, and it will have just one single instance, right?
It’s not going to be like DEV and QA because [00:22:00] you’re not going to develop separate machine learning models for different environments. That doesn’t really make sense from the practical standpoint. And this is what it is. This is the sort of architecture we came up with. So you can see your operational environments in their proper succession with data pipelines being promoted. You can see the requirements for proper ingestion from external systems, but then you get the APAC crucial appendage, which is the analytical environment. So from PRODuction, from all sorts [00:22:30] of zones that you may have in your data Lake, you can have the data exported being tokenized on the way. You can then allow the experiments to also grab data from other systems. Then that leads you to develop the model, which can be integrated back into the operational environment, both in QA for testing and in PRODuction for actual application.
And what we found, there is that this architecture actually has a lot of benefits. For starters, it bridges [00:23:00] the gap between data science world and data engineering world, but it sort of does it in a way that they’re still fairly separated. They just have well-defined touchpoints, which gives you that the data science people and data engineering people they’re still able to work within their tech stack. We didn’t say force data science, people to work in spark or in flink, right? Which generally means they’re more familiar with what they’re doing and more PRODuctive that way. The other thing is that we of course managed [00:23:30] to conserve the strict rules of the dataset of data security in our data lakes and therefore no risk on that side. And last but not least since those two processes are clearly defined and they are interfaced in a way that they don’t really clash all that much.
That means that both can run smoothly and their own course continuously. And you’re also able to sort of automate it and make sure that you, for instance, have a proper [00:24:00] continuous deployment and continuous integration of separately for model development and separately for improvements of the data Lake and the development of your computational jobs. So those are the sorts of benefits that we managed to come across. And I hope that this architecture and the solutions we presented were interesting to you and with that, I think we shall go to Q and A
Emily: Great thanks so [00:24:30] much guys. Yeah. We’ve got a lot of questions. So let me just jump in right here. Our first question looks like, comes from Kuhn. Whew. Let’s see that becoming, let’s try one more Eugene. So we’ve got a question in the chat while we’re waiting for that from Cassie [00:25:00] Kent, what is your approach to maintain data history for both batch and streaming are STDs required?
Peter: So actually it depends on what is your model? And it depends on dependencies between the pipelines.
but I would say my [00:25:30] solution to that would be generally speaking, you need your streaming pipelines just to really to apply some things in real time. Right? So the idea is to have the batch data, the streaming data usually archived in the batch region in your standard data Lake. And that way, if you need it for a historical reason, it’s going to be there. But on the streaming side, you’d usually apply some sort of TTL. Cause this is essentially just [00:26:00] for fast computations, right? And if you want, need to go back to this after afterwards, then you can do that on the batch side, still data, big archive. Hope that answers the question.
If you have a full log, like in a cuff car, so you can run from the beginning or from the certain point of time, the whole streaming processing. So this is the KAP architecture.
Got it. [00:26:30] Okay. We have another question. He says, I understand that it’s being built. It is being built, data Lake and Databricks. Do you suggest using the same tech for the streaming speed layer or do you suggest something like Azure event hub?
So generally speaking, Event hub is great. We have good experience with Event hub in terms of, well, I guess it being fast and pretty low price, but if you want to do [00:27:00] Delta, I think that sort of implies you might, you want to be using Delta. Like I get the point that it’s not as fast as if you stream via event hub, but for proper the Delta like architecture, I don’t think it would be easy for you to base that on event hub per se, POS you want to add to that? Okay.
So I think I don’t have anything to add.
Nothing more to add. Okay. Next question. [00:27:30] So as how can the approach facilitate offline development and post-development reconciliation?
Well, I mean, which of the two approaches we’re talking specifically?
Should we go to the next question?
Or maybe just, yeah. Or maybe we can, then we can ask Eric [00:28:00] to specify which part he meant. And then we can go back to this one.
Okay. Let’s do that, Eric, if you want to chime in on the chat and then we’ll come back PRS, can you explain how you manage promoting your model from experiment model registry to PROD registry? Do you retain all the PRODuction data, not tokenized?
Interesting from one registry to PROD rigistry? So the basic idea is you’re able to [00:28:30] essentially say for instance, in ML flow, which sorry, the talk to a particular PRODuct, but that’s what I’m familiar with. Mostly you’re able to transition your model between different States, right. And sort of integrate that in your CI pipeline. So if the model passes the integration tests in your sort of QA environment, then you can then only in that case, transition it’s to your PROD state and expose [00:29:00] it under different URL. And so that URL can then be used just by your prof pipelines. Right.
Peter: Anyway, promoting anything from non-PRODuction to PRODuction should be a part of a deployment process and preparation of package or embedding this into some kind of pipeline promotion.
[00:29:30] Shall go to the… I think we might’ve lost Emily here. Are we still on line? Cause I can probably go on answering further questions, especially the next one is fairly important here.
Yeah. So maybe I’ll going to read this question.
[00:30:00] Well, yeah, I’d still want that someone asked if we can elaborate on the data pipeline going to tokenize and then the experimentation path. Right. So the basic idea here is that your data pipelines running in PROD would PRODuce your zones of your data Lake. Right? And then we can use that, basically pull the data from those zones, pass it, food, the tokenized, and then load it [00:30:30] to the sandbox environment only in the tokenized form. So then it’s open for experimentation without basically compromising the sensitive data. Since that data’s specifically been tokenized. I hope John and Raul that answers your questions.
The next one would be, how do you suggest managing the features in your analytical layer?
[00:31:00] The features. So I probably need to get that explained a little bit. Okay. They’re saying we’re still online and yeah. So I would probably need some clarification and that what you mean by features.
I would, again, to the analytical layer would be handled somehow by the developer. It’s [00:31:30] not like a development environment. It’s completely aside a environment dedicated specially for the analytics analysts for the experimentation. So you are building only a model and exercise it.
Whereas I’m not a hundred percent sure what my aunt means by feature store. But in the projects I was involved when we did [00:32:00] not really per se record features since say for instance, ML flow does not highlight the exact combination of features you are using, but we generally just use the foursome functionalities that weren’t straight up there in emo flow, you can use tagging. Right. But yeah, it does not list features one by one, by one, to some extent you can probably also [00:32:30] archive if you’re doing, for instance, you’re using SparkNotes or Jupiter, you can archive the scripts right. From particular runs. And maybe that could sort of work as a feature story. But I guess good point, I will need to research on that one. If it’s possible to basically have your model tag by the same exact list of features you are using. Right. But I don’t think there is a straight up solution for that in things like, say I’m a flow.
Yeah. Then we have a continuation of [00:33:00] our experience opens up for the open-source feature stores in this area. So I think we need to research this topic more.
And then there was a yeah, Paul over time, there’s a suggestion for open source implementation for a Delta architecture data Lake or event hub.
Actually the engines [00:33:30] itself, it’s the implemented, it comes from the Delta Lake. However, there are some Apache iceberg of course, and Apache here, which could like mimic or at least cover of some of those teachers. But this is not like they are implementing the same standard. They covering they code.
For Delta, you can use Hoodie and I’m pretty sure hoodie is open source. Right. So maybe so maybe that one [crosstalk 00:34:02] [00:34:00] I haven’t done it by my own hands in this part.
The next question. Do you have any suggestions building, maintaining transformations to create the tables in the Bronx or the gold layers in the data? Like, do you use a transformation DSL GRI, or do you do this part?
So actually [00:34:30] that’s a question I left to have, well, I did both. And what I want to say is this DSL is fine. If you have, if you have a language which nicely supports like say Scala, DSL, I think it’s still manageable, but I did Gooey at some point, but I don’t want to say failed miserably, but basically if you want to have like gooey and fully.