May 2, 2024

Hadoop Modernization for Financial Services with Nomura

Learn more about a real-world path to analytics and data lake modernization from one of Dremio’s oldest and largest customers. Join us for a deep dive into Nomura’s path to Hadoop migration with Dremio, focusing on data lake modernization through the transition from Hadoop to Kubernetes with Min.io for storage. Gain firsthand knowledge and best practices from industry leaders to help you identify best practices for migration and to position your data infrastructure for unparalleled efficiency in the modern era.

Topics Covered

Dremio Use Cases

Lakehouse Analytics

Sign up to watch all Subsurface 2024 sessions

Speakers

Conor Brennan

Managing Director Risk IT, Nomura

Kevin Lambrecht

Managing Partner, UCE Systems

Mikhail Stolpner

Founder, UCE Systems

Transcript

Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.

Connor Brennan:

I’d like to say a big thank you to Mikhail and Kevin, who are the leads of our data infrastructure team. So today, we’re going to talk a little bit more about data latentization at Nomura. So I’m going to cover a couple of different themes here. But before I do that, I just wanted to kind of try and set the context a little bit and just explain to you what we do, because maybe some of you don’t know about Nomura. So Nomura is an investment bank, it’s a Japanese one. I work in risk IT, so we’re responsible for generating the market and the credit risk for all of the trading at Nomura. That means that we have to source all of the trades, sensitivities, and reference data. That is a very large data set. It’s about 1.8 billion rows per date, and we’ve materialized it into a very wide table of about 700 columns. So it’s a large, large data set. That’s our input. We have outputs, which we put back into our Hadoop infrastructure as well.

Challenges

So with that, let me start. So about five years ago is when we put Dremio into production first. At that point, it was very clear to us we had been using relational databases, some of which we still have today. But it was very clear that these were not going to be able to sustain the additional load that we could see coming. And at that point, we only had one production environment. So we looked around, and we settled on Dremio. Now, it’s been, as I said, a bit of a journey. Two years ago, we started our data infrastructure team, which we’ve been building out. That’s led by Kevin and Mikhail here. Two years ago, we were on a fairly old version of Dremio, like we were on 15x then. We had a number of production issues, which I mention here, which at that point in time, it was hard for us to diagnose them. We didn’t really have fundamentally the tools to be able to introspect the system. And some of these led to very serious outages. And because the risk that we produce is subject to regulatory reporting, we have very tight SLAs. And so this is a big deal, basically. We have to hit these SLAs with certainty.

We had a, I would say, rudimentary Grafana dashboard at that point in time. That actually originated from Dremio itself. But it didn’t really cover any executor-level metrics. It just gave you a rather simplistic view onto the coordinators. Every time we would do an upgrade of Dremio, it was pretty fraught. I don’t know. Other people, maybe, who run Dremio, they know this, because it was very hard for us to test properly, to have a regression test infrastructure that actually covered what we had in production. We weren’t able to emulate production-like workloads. Fundamentally, we didn’t understand what queries were running in production. And that’s a real problem with a system this size. So there were some of our challenges.

Workloads

So today, we run about half a million queries on a daily basis through our Dremio infrastructure. We have, in terms of the data sets that come in every day, so the entire system is event-based. As each one of those comes in, it goes down into our data layer, our ETL runs, and we incrementally produce these new external reflections. We do that, because at the time, that wasn’t available within Dremio, not at least for an incremental capability. The size of the cluster that we run, I’ll touch on this a little bit more. So we don’t run one, we run three coordinators, that’s quite important, and we have 35 executors in prod. So that gives you a sense of how our Dremio nodes look.

What does that run on? So our hardware that we run on is on Hadoop, so we have the co-location built in by default. Really recently, about a year ago, we upgraded our version of Hadoop, and at the same time, we moved to a new cluster. So this hardware is pretty new, and it’s reasonably chunky, it’s about 18 terabytes of memory, about 3.4 petabytes of disk in prod, same in DR, and then half in non-prod and staging. You’ll see later why I’m showing that.

So what have we done, really, in the last two years to make things better? Because this is the key, and I think this is where some of the key innovation has really happened from my perspective, that ultimately has led to a much more stable environment. So one of the things that we’ve evolved is our Grafana dashboard. So we now have, again, built out by Mikhail and Kevin here, executor-level metrics, which give us much, much more insight into what is going on in a distributed sense in the system. So we can see things like memory, I/O, what’s queuing, planning times, two things that are sort of linked. I mentioned earlier that we didn’t really have insight into what was running, and that’s a really quite hard problem. So we’ve sort of taken a two-prong approach to that. One thing is we have created– because we have some control over the queries that go into Dremio– we have created a metadata layer, which is very simple. You can see it here on the right side. It’s basically metadata inserted into the SQL, but it’s in a structured format.

Within Dremio– I’m sure you’re familiar with the logs UI– you’ve got, below that, queries.json. JSON, which has a lot of information about the query, every single query. So we harvest that. We take that from prod, and we then run a mini ETL over it. Part of what’s in there is the query. So we can also extract this metadata. And because it’s structured, we can understand what clients the queries originated from. And within a client, depending on how much metadata they have, maybe what subqueries it is. And all of this goes into our sort of analytics store. Now we have things like planning time, execution time. You can imagine we build up a history of this. So this gives us insights into how things change. We do a release with Dremio, and suddenly things go south. We can see what query patterns have changed, on average. Has the execution time, has the planning time gone up? It gives us insight that, fundamentally, we never had before. That’s one element. Equally important is we basically use it to be able to replay the same queries in the same chronological order that they occurred in production in non-prod.

So we’re able to have a very, very good regression pack that emulates exactly what we saw in production. So the combination of those two things has helped massively in us detecting issues before we go into production. Has it detected all of them? No. As I kind of showed you in the earlier slide, we’re about half the size in non-prod that we are in prod. And the other thing is we have less history. So those two things have caused us some issues as well.

So this is just a graphic of the Grafana dashboard. So on the left-hand side, you can see that’s what’s queuing. On the right-hand side, top right-hand side, this is what’s running. I’m not sure if you can see. So we peak up to 200 concurrent queries. This is actually our three coordinators. If you can see on the very, very bottom, that’s the master coordinator. We’ve deliberately moved everything off, because a single master coordinator is essentially a single point of failure. And so by doing this, this has increased the stability massively. Previously when we had just one coordinator, we had a lot of problems with it getting maxed out. So this has allowed us to sort of scale out in a more predictable way.

Workload Management

Okay. Workload management. I kind of touched on the self-analytics, so I won’t go through that again. So the coordinators, as I just mentioned, we have a master and we have two scale-outs. So that’s — they have a load balancer that sits kind of in front of them and distributes the load. That’s helped us a lot. We’ve also configured — in the beginning, we just had one engine. So what’s an engine? An engine is a logical group of executors below it. What we’ve now done is to create multiple engines that have different sets or groups of executors. So it’s really workload segregation. So that, combined with us applying tags to queries from clients, allows us on the fly to dynamically route via rules incoming queries to different engines. So if we have a case where, you know, a particular query pattern is starting to overload an engine, we can on the fly move that. There’s finer-grain control, like you can specify the concurrency and sort of memory limits, but, you know, that’s more sort of fine-tuning. But this is really — you know, this segregation of workload has really helped with our stability as well. Okay. So I mentioned we’re on Hadoop.

So some of the limitations with Hadoop. I mean, fundamentally, it’s pretty inefficient how it stores data. You’ve got three copies, so you get to use a third of your addressable disk space, okay? It’s a monolithic stack. Everything is tightly bound. It’s very kind of expensive DevOps, you know. It’s an old technology at this point. It’s very hard to get people, kind of new people, to want to work on it. And you know, fundamentally, it’s incompatible with the whole sort of paradigm of, you know, modern cloud computing where there’s separation of compute and storage.

Data Platform Modernization

Okay. So with that in mind, the other thing I should mention is that between market and credit risk, I would say we’re probably about halfway through moving our entire stack onto cloud. And for us, that’s AWS. So we have half of our code base that interacts with a file system. It’s S3. And the other half is HDFS or POSIX. So there’s this kind of bifurcation. So it was pretty clear to us that we would like to have an S3-compliant file system on prem. And so we started to look at what the possibilities there were. So we had certain constraints. Today, we run Dremio very, very hot, right? It’s pretty much maxed out the entire time. And that leverages on Hadoop, the co-location. So it’s — one of our core constraints or the core constraint, I would say, was to have Dremio needs to be at least as performant as it is today on whatever new data infrastructure we try and build. We also need to evaluate stability. Okay.

So we had sort of two potential routes we could have taken at Nemura. Internally, we already have Hitachi HPC, which is an appliance that provides S3. And then we also explored MinIO, okay? Now, this maybe sounds a little bit obvious, but it wasn’t immediately obvious to me, right? So implicit in the kind of segregation of compute and storage, if we think of it from a Dremio perspective, for that to work, you’ve got to have an ultra-fast network, right? And so when we looked at the HPC appliance, because it’s an appliance, the Dremio nodes are on a physically separate service. The network that we would need internally between the two, we just didn’t have. And it was literally tens of millions of dollars that we would need to support that. And so the reason I showed you the hardware that we have in the beginning, that’s relatively new. So we wanted to see, can we repurpose that hardware and co-locate MinIO and Dremio on the same hardware, using Kubernetes to scale out Dremio? This is all on-prem, okay? So this is not necessarily a paradigm that MinIO had seen before, but we worked with them on it, and ultimately we got their sign-off on it as a reference architecture. And so we have this now running in dev, staging, and next month we’ll be releasing our first workloads to production.

Our benchmarking, we basically took four nodes in non-prod, and we rebuilt those as four nodes out of our Hadoop with the disk on it. We rebuilt those as MinIO nodes, and we ran the same workload using, what I mentioned earlier, our workload replayer. And we were able to show basically 13.9% north, so to speak, for MinIO versus Hadoop. So the main constraint of it being at least as performant, we were able to meet. I mean, it has many other benefits for us. One that I touch on, I mentioned earlier, Hadoop has the three-factor replication. MinIO uses a thing called erasure encoding, which is a little bit similar to RAID, but long story short, you get 75% of the addressable disk space versus 33. So you’re getting more than double usable disk space straight off. So for us, that’s a huge benefit as well. I’m not even talking about licensing fees. This is just kind of on a pure kind of physical disk comparison. Okay.

Hybrid Data Platform

So this is ultimately what our hybrid data platform is going to look like. We have on-prem and cloud. We have an S3 fabric that will span on-prem and cloud. We’ve tested this with our existing processes that were already running on AWS. Running on-prem, zero code changes. It’s totally seamless. And we will be using Kubernetes or Amazon EKS to distribute our — or to orchestrate our compute. So it’s given what I think is going from a sort of, you know, a pretty much defunct technology to a very forward-looking technology that, importantly, actually supports, you know, a hybrid, a true hybrid, on-prem and cloud. And for us, that’s super important, because we’re going to be having large parts of our data on-prem for some time. And also, as I sort of touched on, with Dremio, we run it super hot. So actually, I don’t think we could save any money by running it on the cloud. Because we just max it out on the cluster that we have. So you know, this sort of works for us. Okay.

The Lakehouse Journey

The Lakehouse journey. So I mentioned that we have been using Dremio for five years. So what that has meant is that there was, quite frankly, a lot of functionality that we needed that didn’t exist in the core product. So we’ve ended up building that ourselves. You know, the main reason that we went with Dremio was its openness and its lightening fast. If you build your materialized views, what we call a raw reflection, if you optimize it for querying, it goes like lightning. And that’s basically what we did. But it didn’t have incremental reflections. It didn’t have any versioning capability. So all of this, we’ve had to build in ourselves. And frankly, that’s quite — it’s a lot of engineering. We’ve got it to work, you know, 90-plus percent of the time. But when it goes wrong, it goes wrong spectacularly. And it’s just hard to support.

So for us, Iceberg has immense benefits, right? I mean, it finally brings right semantics to Dremio, right? So we still have Oracle and HBase. This will ultimately allow us to decommission those, okay? But importantly, it also — you know, we have Spark, we have Dremio, we’ll have a single catalog now that both of those can coexist on without — you know, seamlessly. The versioning thing is huge for us. Let me try and explain how that works a little bit. So I mentioned earlier we have around about 500 datasets that come in every day, right? And so the model prior to Iceberg is everything is effectively append-only. So somehow we have to point out what is the latest dataset, right? And how we chose to do that — you know, there are different ways you could do it. But how we chose to do that is by systematically updating the VDS every time these come in. So you can imagine there’s like a kind of big in-clause embedded in it, which references some unique ID in every one of these datasets, right? That’s — again, it’s a lot of engineering, because it’s not just the raw reflection. We have multiple aggregate reflections that hang off of all of this. And so that introduced another problem, because we tried to mimic these — to make the whole thing essentially atomic. Because if we have the base and we have these aggregates, you know, it’s very important that — like if a query hits the base and it hits the aggregate, it gets the same result, right? But that’s sort of impossible to do.

So we — again, there was a lot of engineering went into this. And Iceberg, because when you ingest, you get the history, this totally goes away, right? I mean, we get — because you have full history, by definition, you have the current, right? But from our perspective, there is like a multitude of ETL that we’re going to be able to roll back. And that ultimately will translate, for us at least, into making the system more stable. The other area is in metadata management, right? You can imagine we got the base and we got, like, say, 20 aggregate reflections. We’re having to do all the metadata refreshes ourselves for this. It’s very complex, very brittle. We ran into issues with doing it concurrently. And so then we had to build a single centralized process, which had throttling and conflation embedded within it. You know, I mean, it’s a lot of engineering, and most of the time it works, but it still has a lot of edge cases. And again, when it goes wrong, it goes wrong spectacularly.

With Iceberg, it’s gone. When you ingest, it’s done. So it’s — again, for us, it’s massive, right? So it really, really simplifies our whole sort of data pipeline. We’ve just started on this journey with Iceberg. We have our first workload going into production next month. But over the next 12 months, as we migrate to MinIO, we’ll also be migrating to Iceberg. So I think from our perspective, you know, this is a huge, huge enabler for us, but also it simplifies our stack massively. Some of the things, you know, we’ll be looking at in the future, you know, elastic scaling potentially with Dremio on-prem, with Kubernetes, or even out onto AWS if we need, you know, proper burst. As I said — well, it’s kind of implicit, I guess. All of the hardware that will be running this on right now is spinning disk, right? Which is not maybe the natural or the first choice for MinIO, but it was very important for us from a cost perspective that we could repurpose this hardware, right? I mean, I can’t stress that enough. If we weren’t able to do that, I don’t think — I’m certain I wouldn’t be here today talking about that. So as a first step into us being able to introduce MinIO, this was key, right? And this is actually pretty complex if you think about it, because we’re in situ decommissioning Hadoop and then putting in MinIO instead on the same actual hardware. It’s like kind of running the marathon and doing open-heart surgery, right? The technical bar is pretty high, but we’re doing it from a sort of cost-constraint perspective. And because fundamentally we have the right people in place that can enable us to do this. The other thing that I’d mention here that’s important that has enabled us to be able to do this migration is our group now has ownership of the entire stack. And from that, I mean from the hardware all the way up. And that’s a key enabler for us, because otherwise we wouldn’t be able to kind of swap nodes out and in. There’s some other kind of things here, but they’re fairly self-explanatory.