Klemen Strojan: Hello, all. Thanks for joining this session. As mentioned, my name is Klemen Strojan, I’m stationed in Slovenia. And I work as a lead data engineer for artificial intelligence and data science group at Knauf Insulation. Knauf Insulation is one of the worlds largest manufacturers of insulation products. We are present in more than 40 countries, and [00:00:30] have manufacturing sites in 15 of those countries. So the most important thing is why are we having this talk? And what am I trying to say today?
So when you’re working in a manufacturing company, clearly you’re not working in a software company. So a lot of your challenges is coming from the physical world. And if you really want to make a difference, if you really want to make use of the data, you need to use a [00:01:00] somewhat different approach. So the goal of this talk is for me to present our use case, and to maybe encourage you to take a similar approach to what we’ve done, because we really believe that it can be really successful.
So the agenda for today is, first of all, data. So what do we have? What flows in our pipelines? Then maybe the most important part, the people, followed [00:01:30] by technology. So how do we actually use everything? What are some of the pain points that you might stumble upon along the way? And of course, in the end, the use cases. So what have we done by now? And does it work?
So, what does flow in our pipelines? The most important data set in our company is the process data. If you imagine, we have almost 30 manufacturing facilities, [00:02:00] each of those has numerous production lines. Each line will have a few thousand of sensors. Each of the sensors will generate data, maybe a few times a second, maybe a few times a day. But as you can see, there’s quite a lot of data actually. We are not in the actual big data yet. So we’re not talking petabytes of data, but still we’re out of the normal, let’s [00:02:30] say, data usage. So we call this somewhat medium data size.
Now enough about data. Let’s talk about what really matters. So let’s talk about people. So, who actually develops and who uses our solutions? So here we come to the concept of data ops. So this can be defined in different ways. We don’t talk about data ops, just let’s say, developing solutions and also owning them. [00:03:00] So we think that’s too narrow of a definition. We think that data ops as a concept is a team that is a central team stationed somewhere between data producers, between IT, and between the users. So this is what we use in Knauf to describe a central data team that supports users and data initiatives across the whole company.
Our most important users, our so-called power users. [00:03:30] Important thing to note here is that these are material subject experts. So we’re talking about business experts. They have some knowledge of Python. They know a little bit of SQL. They would use tools such as Dremio, Anaconda Enterprise and so on. And they would be creating reports, ad hoc reports, maybe one time analysis. But most importantly, these are the people that are generating ideas.
Next step is digital engineers. This [00:04:00] is something that we’ve created inside of the company. We are still talking about material subject experts with maybe more extensive skills in terms of Python and SQL. They do know a little bit of statistics and machine learning. The tools they use are similar to those used by power users. And what they do is they take the analysis a little bit further. So they would develop, let’s say smaller proof of concepts. And when those proof of concepts [00:04:30] would be finalized, they would be also helping with the implementation of a solution in a manufacturing facility.
So now we come to the central people in the so-called data ops team. So who are the actual data specialists compared to business experts that I was just talking about? So here we have clearly data engineers, what data engineers do. Maybe important things to note [00:05:00] here is that they have to know at least something about distributed systems. They have to be skilled in data modeling and pipeline development. Of course, skilled in Python and SQL. They would use a much wider variety of tools. And their role is to enable all other users, such as power users and digital engineers to access the data and automate solutions.
Next come the data scientist. Again, we’re talking about the data expert [00:05:30] with more knowledge on statistics and machine learning. So he and she can, in fact, support power users and digital engineers on those topics. And they would be actually developing their own solutions, improving proof of concepts, and they would work closely with power users and digital engineers. Last, but definitely not least. We have software engineers. In our case, software engineers would tend to focus [00:06:00] on front end development. And they are the ones that are actually enabling our end users to use our tools, to use our solutions.
So, enough about the people. Now let’s talk technology. So what tools do we use? How does it all work? First of all, the data is stored in different locations. So we tend to [00:06:30] centralize everything on a data lake, yes. But in some cases, some of the source systems are not that heavily used, and we can query them directly. So what we would do, we would add some of the, let’s say sources, SQL databases, as sources to our federation layer. In other cases, we would replicate that data on the data lake. Of course, on the data lake, the data is mostly stored in parquet files, which goes [00:07:00] well along with Arrow.
Okay. Now we know the people, we know how the data is stored. We know what the data is. But how is it accessed? So this is not trivial. I know people from software all know how to do this. But it’s not always trivial to implement this when you’re starting anew in a company. So mostly our users would connect to the data using Dremio query engine, either directly or indirectly. Most importantly, Dremio [00:07:30] acts as a federation layer. As I mentioned before, we are combining different sources there. And maybe most importantly, we really love Arrow Flight. Why? Because it enables you [inaudible 00:07:44] data really quickly. And they can iterate across their ideas fast. And that’s, I will explain this later on, but this, this is really important. You really want to have that speed.
Now how the data is processed when you get it. [00:08:00] So we use Anaconda Enterprise where we run Python. Python is our bread and butter. Almost everything is done there. When we would need some additional distributed computation that’s done outside of the federation layer, [inaudible 00:08:15] Dask. Pipelines and all, let’s say, finalized solutions would be packaged into a Docker image and run in Kubernetes. And to orchestrate everything, we use [inaudible 00:08:27]. So this, [00:08:30] in a nutshell is our architecture. It’s really simple. And we’re quite proud on that simplicity because it took us a while to get there.
So here I would just like to add a word on distributed systems. I mentioned before that we have medium-sized data. So at least at some point of data processing, you have to do something on multiple machines. You have to do some stuff in parallel. What we tend to do at Knauf Insulation, we try to [00:09:00] hide that part from the end users so that end users don’t need to really understand distributed systems, and they are not learning all the new and exciting ways for things to go wrong, as Clapman put it a while ago.
So we talked about the data, the people, the technology. That’s all great. But [00:09:30] of course, there are some pain points. And I think it’s worth discussing it because maybe you can take something from me, and don’t make the same mistakes again. First of all, this is all clear to people here at this conference, probably to software engineers, to skilled data engineers. But you have lots of moving parts. So when you’re building a data pipeline, you will have data that’s being generated every second, everything can change. [00:10:00] There are new sensors that are being added. You have updates, upgrades managed by different entities. And what you get, you get the butterfly effect. So one really small change somewhere can have a huge impact on your pipelines. And you really don’t want that. But truth be told, you will have it. So this will always happen.
Our solution here is to communicate this complexity. So communicate the complexity across your organization. [00:10:30] Explain this to all the participants in your data flows so that they understand how this works, and they will then be able to communicate maybe all the [inaudible 00:10:43], you can prepare and so on. Of course, there are ways to mitigate those risks. I already mentioned before that we use Docker, we use Kubernetes. So containerized solutions [00:11:00] are definitely a way to mitigate that risk. So to get an immutable image, that’s always the same. And that’s definitely the way you want to go.
Of course, separation of environments. Clear. It’s trivial, but still not everybody does it. We’ve learned the next part the hard way. So testing the assets sounds terrible, but really, you have to do it. It’s the only way. And as said before, try to communicate as much as [00:11:30] possible. So, establish good relationships with people that are not in your team. So, with people from IT, with data producers, with data consumers. Get that relationship up and going. And of course, data pipelines will always fail. We all know that. The key is how many times, at what price, and how quickly are you able to get back online?
Of course, we [00:12:00] couldn’t finish pain points without talking about technical depth. When you start building the team, when you start delivering solutions, of course you have to deliver. And you have to deliver quickly. If not, you’re in a bad spot. And that’s understandable. But just make sure that while you progress, try to iterate across your solutions and try to reduce that technical depth. [00:12:30] So make sure that a data team or data ops team has enough time and resources that they can spend on reducing that technical depth. That’s really important.
And another thing is, how do you measure if you’re progressing? So how do you know that you’re reducing the pain points or the things that troubled you before? This is taken from Anderson’s book. He talks about [00:13:00] velocity. So, your delivery of solutions must increase with time. So the speed of delivering solutions must increase with time. And this is a true measure of success. With your first project, of course, a lot of things will go wrong and you will have a lot of obstacles to tackle. But while you progress, you shouldn’t be tackling the same obstacles over and over again. So that’s a lesson we’ve learned. And this is our true measure of success.
[00:13:30] So we’ve talked about all of the points mentioned before. And yeah, in theory, it all sounds great. But does it actually work? Well, it does. We have quite a few solutions in place that are actually almost running your production, or running in production, so to speak. And the first one I would like to mention here, and I will be really brief on these use cases. [00:14:00] And please bear with me. This is a bit specific to our manufacturing process. I’ll try to explain as much as possible, but here goes.
The first one is predictive maintenance. This is quite clear. So there’s a lot of assets in our plants. Those assets are worth a lot of money. Keeping those assets in stock costs a lot of money. And what we’ve done, we’re trying to predict, and we are predicting actually, [00:14:30] failure of those specific assets in our plants. So what that does, for example, you would know when to replace an asset. Okay. That’s clear. That’s, let’s say, direct lowering of costs. But for us, it’s also important to keep only a few of those parts in stock. So you know which plants will fail first, and you know that quite early on. [00:15:00] And so you can stock those assets close to those plants. So that’s one, let’s say, nice solution.
Another thing is the so-called spinner tracking system. And what a spinner is, a spinner is a key component of a class mineral wool production process. So if you imagine, we are melting rocks or we are melting glass, and then it’s a really interesting process. The Lava comes out of the furnace. [00:15:30] And that molten rock that melt is falling down into a spinner where fibers are forming. And those fibers are basically creating insulation, which you can then use. And what we’ve done, we’ve taken that key part of our production process and put together all the data that there is connected or relevant to that process. So let it be processed data, external data, [00:16:00] data from other parts of the company that, let’s say, affects that process.
And we’re not doing any machine learning here, not even any advanced statistics yet. But just creating that data set, putting it in place for users to use, to perform, let’s say, ad hoc analysis. I mean, it’s really useful. And this can make [00:16:30] a change. And this can going to be a really nice win when you’re starting up a company or a team. So this is something we’ve learned here. So you don’t need to create really complex systems, you can do a simple solution. If it brings value, it’s worth doing.
Another interesting thing is our anomaly detection systems. So I’ve mentioned before that [00:17:00] we’re melting rocks. So if you imagine you can’t just stop that process. And also that process takes a while, so that change has come into effect on the actual product. And what we want to do here is we want to detect anomalies before they have an effect on production. That does not necessarily mean that we are predicting anomalies because [00:17:30] for some, let’s say, sensors, or for some parts of the process, you can maybe detect anomaly when it actually happens, but you will still have enough time to react. And you will do some changes, you will correct some things, and there will be no effect on production. And this is really an important part of our use cases because you can apply it to numerous different sensors. And [00:18:00] we’re just starting to see the tip of the iceberg. The ways in which we can use this system are just huge.
And last but not least, when it comes to use cases, self-serving analytics. You’ve probably heard a lot of self-serving analytics, very popular topic in the recent months, for [00:18:30] example. But why is it important, and what are we actually doing with it? So we’re using automated data profiling and Jupiter notebook sessions with preloaded data. So why is this important? Just imagine, I mentioned earlier that you want your power users, your digital engineers, to be able to iterate quickly through ideas. So what we give them, there’s a catalog of data [00:19:00] that’s available. And somebody has an idea. A user has an idea. He or she will go to that catalog, find the relevant data set, run an automatic data profiling that would take them just a few seconds, check if this is the data that is useful for that idea, click on a button near the data that will launch a Jupiter notebook session with data already being preloaded, [00:19:30] so using arrow flight, in a matter of seconds. And that person can already start performing exploratory data analysis.
So in a matter of seconds, with a few clicks, you can already see if the data exists. And maybe in a few minutes you will say, okay, it’s completely different than I imagined. This idea will not work. Let’s go to the next one. Or maybe you will say, oh, there is no data set that I would be able [00:20:00] to use. And in that point, you would come back to the team and we would create that data set for you. So this speed is really, really important.
So what’s next? The plan is to keep moving forward. Of course, we’re still finalizing our production environment. So there’s still a few things to learn. We want to measure as much as possible internally and so on. We’re growing data teams, not just in terms of headcounts, [00:20:30] but more importantly, in terms of knowledge and skill. We’re running a data governance initiative so that practically each data point that’s being generated in the company will be owned by some specific person in the company. And by that, at some point, of course, we also want to achieve data remediation.
Now what’s the key takeaway that I want you to remember from this whole talk? [00:21:00] Start from the business. So in your company, find business experts because it’s more difficult to understand the business than to learn simple data analysis. Find business experts that have some interesting analytics. And believe me, you will find them, and then create the a team of data specialists that will support them. This will guarantee that the implemented solution will come from the business, will be accepted, [00:21:30] and will have the desired impact. Thank you very much for listening.
Speaker 2: Thanks, Klemen. Let’s go ahead and open up for Q&A. If you have questions, use the button in the upper right to share your audio and video. You’ll be automatically put into a queue. If for some reason you’re having trouble with that, you can just always ask your questions via chat. And I think that some people have already been asking. So let me go ahead and start this. [00:22:00] A question came in from Julia. On the topic of self-service analytics, is there any data that’s sensitive? If yes, what data protection methods do you have in place?
Klemen Strojan: In our case, there is some sensitive data. We’re not talking sensitive in terms of data from users and so on. But still, we have some [00:22:30] stuff that we don’t want to expose. And basically what we do is, we would expose only the data sets that you are allowed to see. So when you would come to the data catalog page, you will sign in and you would only see what’s been granted to you based on the groups you are part of and so on. So of course, yeah, you have to limit that.
Speaker 2: Thank you. I don’t see any more questions being asked. Does anyone else have [00:23:00] any questions for Klemen? Oh, which data catalog do you use?
Klemen Strojan: Oh yeah. When it comes to data cataloging, there’s a lot of solutions, if I’m completely honest. [00:23:30] At the end, we’ve just developed our own. That was the thing that worked for us. So we’re custom solution.
Speaker 2: Paige also has asked, could you talk more about the data governance journey?
Klemen Strojan: Yeah. Well, that’s a great question. We have much to learn there, if I’m completely honest. So we’re just starting that journey. What we think is, it’s a delicate topic. And we have to be really, [00:24:00] let’s say, you have to take the right approach. So you have to encourage and present this to all the users in the company in the right way. So don’t force them into anything. Try to give them knowledge, try to give them some positive feedback why this would be useful for them. That would be my advice. And maybe [00:24:30] start with data quality. That could also be it. If you believe that data quality is part of data governance, which I think it is, start with that. So first you have to measure what you have, then you can start fixing it.
Speaker 2: Great. Does anyone else have any questions for [00:25:00] Klemen? Okay. Nelish is asking, do you design your pipeline to be independent of de-buggability?
Klemen Strojan: We try to. We’re not always successful. But we try to do that.
Speaker 2: And then Paige has also asked, are you streaming [00:25:30] PII data into your data lake?
Klemen Strojan: Not yet. So everything I was mentioning is purely batch. So the streaming initiative is happening. Not at liberty to discuss it yet. But maybe next year we can discuss that. But currently we’re doing batch pools and batch processing.
Speaker 2: Tom has asked, what’s your view on data virtualization and/or data mesh approach?
Klemen Strojan: [00:26:00] If we’re talking data virtualization, if by that, you mean that if you don’t materialize everything so that you have more or less views and queries on top of your physical data, one copy, then yes. That’s definitely the way to go.
Speaker 2: Okay. Then follow up. Thomas has asked, if you had this [00:26:30] case, how did you convince the management that the data quality is not good enough and that this should be changed? It’s often hard to force a change if the result is not clear.
Klemen Strojan: In our case, it was, I won’t say easy. It was not [inaudible 00:26:46]. What we did is we created a few solutions. And when you do those solutions, first of all, you will have bugs in your code. You will solve that. Then you will have bugs maybe in some other parts of the pipeline or [00:27:00] the tooling. But at the end, the users will experience the problems that come from data. If you have data quality issues, you probably have them, you will experience them, and your tools won’t be delivering the value that you would like to. And by that, we’ve communicated back and said, “Okay, look. Here, this is missing because this was not created properly. Or this is not checked. Or this is not validated.” [00:27:30] So go with the use case. And really do measure it. Get to your management with the solution, not just with the problem. Tell them what you will do. Tell them the whole strategy.
Speaker 2: Tom had a follow-up question. Tom’s followup is, do you use EDW for structured data in combination with data lake for raw unstructured data, or do you put everything into an unfiled data store, or [00:28:00] unified?
Klemen Strojan: Yeah. Great question. So we don’t use a classical data warehouse, if that’s what you’re asking. So we have [inaudible 00:28:13] in a data lake. And then we would do virtualization on top of that. So we have a separate copy, but we wouldn’t do the classical data warehouse. And we wouldn’t remodel all of our data. Because [00:28:30] the fact is, our team is just too small to achieve something like that at this stage. And what we’ve done, we’ve granted access to raw data to people. And also, we’ve given them a lot of virtualization. So a lot of virtual data sets that are really useful for them. That was our solution.
Speaker 2: And are you happy with the performance if you’re using virtualization?
Klemen Strojan: Okay. Yeah. We’re we’re using Dremio. [00:29:00] And yeah, for now it’s working really good. I must say that we don’t have that many users yet, so we’re not talking hundreds of users. We’re somewhere below 100 users, let’s say. And for us, that works perfectly fine. Of course, you can always scale up. But then you come to the question of costs.
Speaker 2: And here’s one more, last question. Do you have some zoning implemented on your lake?
Klemen Strojan: [00:29:30] I don’t understand exactly what zoning in this case would mean. If it’s some separation of environments or separation of, let’s say, business entities, then yes.
Speaker 2: And then, okay. I’ll do one more. How do you find feedback from non-tech users who don’t know SQL or Python?
Klemen Strojan: So we try to promote learning. So we do workshops. [00:30:00] We encourage users with online courses and so on. So we try to teach them SQL. Because for basic use, it’s really simple. And if you have very competent and capable people in your company. So try to do that. We didn’t have that problem, if I’m completely honest. And if you do, you can always grant them access through Excel or something like that, just to iterate [00:30:30] quickly through ideas.
Speaker 2: Perfect. That’s all the questions we have time for, this session. If you didn’t get your questions, you’ll have an opportunity to ask Klemen your questions in the channel in sub-service Slack. Before you leave, we’d like you to fill out the super short Slido session survey on the right before you leave. Let us know. And the next sessions are coming up in five minutes. Thank you everyone. And have a good day.
Klemen Strojan: Thank you all.
Speaker 2: [00:31:00] Thanks Klemen.