May 2, 2024

Winning the War on Data with Dremio

Learn how TD Securities is transforming its data infrastructure by using Dremio to modernize their Hadoop data lake. This session covers TD Securities’ approach to deploying a data fabric to enable data visualization across legacy databases and improve access to their data lake. We’ll cover our core business and analytics challenges and the impact of this transformation to drive faster and better outcomes for the business.

Topics Covered

Data Mesh and Fabric
Dremio Use Cases
Lakehouse Analytics

Sign up to watch all Subsurface 2024 sessions

Transcript

Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.

Lucas Durand:

I’m excited to chat with you today about how we’re winning the war on data with Dremio. First, who am I? Why am I in the room? Why am I talking to you? So my name is Lucas Durand. Pronouns are he/him/his and I’m with TD Securities which is the capital markets investment banking side of TD Bank which I think we’ve mentioned a couple times in other sessions today is just outside the window if you want to look. Anyways, big bank from Toronto all over the Eastern Seaboard now. So yeah I’ve done a lot of different things with TD. I’ve been a software engineer, a quant, a data scientist. I used to be a physicist and I’m here to talk to you about the war on data that’s raging right now and that we are winning. It’s a very exciting time but it’s a very serious talk as well. Oh yeah, I have some background with different things. I like Python. I’m gonna try not to talk about it though. We’re talking about data. We’ll talk about Python though. So this is an introduction to me but I think there’s more to it because this is a war that we’re fighting. So let’s update this with a little more relevant things here. So I’m still Lucas Durand but I’m not really a director. I’m a lieutenant in the war on data. It’s raging. I fought with TD long years and really like the background doesn’t matter. It’s just this is a pirate. This is a pirate war. We’re fighting on the seas and data is the currency of this battle. 

The War on Data

Okay, so what’s happening here? The war is unavoidable. Data is everywhere. It fuels our most important processes across industries, across the globe and it comes for us all. You will enlist. You will fight the fight or your business will not succeed. It’s fought by land air and sea. It’s raged for generations. Data is not new. The way we work with it might be new but the way we work with it might also be old. We’re just bankers on a ship in the ocean trying to fight a war. Okay, so some of this is true but let’s actually ground this in what we’re doing. So like I mentioned before TD Securities is the investment banking capital market side of TD Bank and although we are pirates and we are fighting the war on data, we are doing it with mathematics and computers. We are not on boats. Most of the time we are in offices. Sometimes we’re on boats but rarely related to the work or the war. But when we think about investment banking it’s really routed, rooted in in data. So there’s a couple pieces of that are the bread and butter of investment banking and capital markets from a technology data lens, okay? 

So here they are. The first is pricing. So being able to price data, understand instruments that we’re buying and selling is really, really core to the business. It’s not as simple as looking up a value in a table. Sometimes it is. You can go and look at the stock ticker and say that’s the price, right? But there’s a lot more complexity and more instruments that are derivatives and rely on other values. Actually pricing something is a difficult task. This has been the majority of the work in the data space in investment banking after the last few decades and a lot of the systems are built around doing this in real time in order to compete in the market. The second piece is risk. Being able to report on how we are doing. Are we managing our risk and falling inside of the risk set aside by regulators? 

Another big piece and a lot of the data and technology infrastructure is built around that as well. The last piece in the data landscape, which is potentially the fun piece, is analytics. Now you don’t always get to the analytics side because having real-time risk and real-time pricing takes up a lot of effort and it really has influenced the way that technology has developed in the space. So to actually address the bullet points in this slide, investment banking is very competitive. Data is at its core. You fall behind if you’re unable to just have the table stakes of working with real-time data, real-time insights. Winning this means using the power of data effectively. Yes, it’s fiercely competitive but things are also always changing. So rapid evolution of what we care about, what instruments we’re supporting is happening continuously and we’re always trying to catch up with that. Either being faster and faster or supporting more and more diverse kinds of instruments like more complicated pricing, more complicated risk and valuation, or new regulations saying that we have to do X and Y in a different time frame or with new kinds of data. Okay, and then a little bit of fun analytics with like Twitter sentiment or something, right? But only on the weekends. 

Architecture Before

Okay, so a lot of the infrastructure that ends up growing up in a business like this, you can you can see we have a little bit of everything. So data warehouses still play a very large role in the bank. A lot of data comes in through various different sources. So typically the way it works is different desks, different classes of our data and our business will have their own applications and they’ll have their own warehouses. This is upwards of like 75 different systems and different kinds of technology that we might care about. Different sources where data can be in different formats, different models, things that aren’t always or are rarely controlled by the business itself. These are like vendor products that have been bought over the last few decades and integrated in. So data is in warehouses, right? What that means is typically we have to deal with all of the different data governance at that warehouse level. There’s no nice abstraction on top of it. I mean kind of leading into what I want to talk about but there’s no nice abstraction on top of it. We’re implementing the same things over and over with different systems in different places. 

Data apps are a big part of it as well. A big revolution 10 or 15 years ago is everything became applications. We were exposing data through REST APIs. This made it very, very great for building out more custom tooling, for integrating things from different sources. Now all of a sudden we have a common language to kind of talk about data through REST APIs but the problem is these are bespoke. Applications and services are written maybe with a common framework, maybe separately for different businesses or different systems and then we have to go through a technology change process every time we want to update something. So something like data quality is still being done on the system level. It’s all being done in different places and it’s difficult to try and manage this so like we’re making technology changes when we want to change data, which is different than what you do it with a warehouse. So benefits, pros and cons. 

Alright so then comes the advent of the data lake. All of a sudden we’re having things in one place with common tooling, it’s commercial, off-the-shelf, scalable in many senses and it really pushes forward the analytics use cases. All of a sudden it doesn’t take weeks or months to build reports. We don’t have to do ETL once the data is in the lake. We can do this at runtime. Data scientists are excited about getting real insights and analytics and feeding the business through that. It’s an exciting time to be working with data but it requires a lot of advanced skill sets. So now we have this data scientist character who’s a very knowledgeable background in science and research and experimentation. They’re learning a lot of more complex languages, starting to work with frameworks where they’re thinking about data at scale. We’re not just writing SQL or working with the REST API and abstracting away all of the work to a project or a framework or service. Instead you’re having somebody who has to actually think and compute and tune, right? So it’s difficult to really scale that up without hiring swaths and swaths of these expensive data scientists, of which I am one. So there we are and inevitably over time the data lake becomes a data swamp. Now I’m a big supporter of swamps. I think they’re an exciting biome and I think a lot of interesting life lives there and I think that’s something that really only a scientist can say. The reality is if you’re a business person and you’re interested in insights and you want to know how your business works, you don’t want to go out into a swamp and wait around and smell and experience it. You want something a lot cleaner. So inevitably lakes become swamps and this can become a difficult thing to support. So before I talk a bit more about where this is gone and what the future looks like, let’s try to understand a bit about what’s the journey of somebody who’s using these tools? How can it get better? What do we want this to be like? So some of the pitfalls you’ll see can come out here. 

User Story: Jennys Data Journey

So this is about Jenny. Jenny wants to consolidate some stock loan data as part of our merger with a new company. It’s a kind of a difficult process to go through and there’s data in a lot of different sources and a lot of different places like I mentioned and there’s a couple different phases we’ll go through. So the first piece is discovering the data. So typically in a traditional or an old world sense when we’re working with data apps and warehouses, we’re lacking a catalogue or something common to say like this is where the data is. A lot of emails get exchanged. Probably what we’re looking at is like messaging and emailing and walking around, going to people’s offices and seeing like where is this data that I need. I’m setting up a meeting with you early next week to talk about your stock loan data or do you know who has a stock loan data. You’re engaging a network of people to try and find something which is obviously not a great way to do things. Another piece of this is that probably what you end up getting at the end of the day is somebody else’s derivative extractive data. Oh I have a copy of this. It’s an Excel file. I’ll email it to you. Is this what you want? Going forward I might be able to keep emailing this to you. You could start to build a solution off of this but you’re getting tied into something that’s not very repeatable, not very reliable. It’s certainly like very backwards in terms of the way that we want data to work. 

It might take you weeks to build out something and make some insights and intuition about your data. You probably have to take that Excel file that someone’s emailing you and then ingest it into a database or whatever your new process is. You’re gonna have to engage a team in the cloud or operations side who’s going to spin up compute and hardware for you because there isn’t some existing data place. You need to create one or requisition one. And that’s procuring infrastructure. It’s not very fun. Finally you have to write some code to actually take that Excel spreadsheet from the email and start ingesting it in and you probably find out at some point that this is a derivative Excel spreadsheet. There’s a real source somewhere that you can get access to and once the firewalls are opened you can connect to it and you can start to build out your process, right. And nothing’s really gotten better throughout all this. We’ve gone through this like somewhat antiquated ETL process with like way too many emails and the spreadsheet was involved and should never have been involved. It’s not good. 

And then at the end of all of this you’ve built a technology process. Hopefully over time it’s you’ve started to build out some common framework for it but likely what you have to do is you’ve set up some logging, monitoring and alerting and now you have your own little technology process that is going to tell you when things are good and bad but it’s all very bespoke. It’s all very hard to maintain. Using Dremio we can do a lot better than this. So our whole data discovery piece is solved by having a data catalog and having a data catalog that’s automatically generated from our metadata from our data. So that makes it really easy for me to look around and see does anything here look like the kind of thing that I need. It really, I mean worst case scenario you’re still reaching out to an owner of that data in order to find out more about it but the best case is you can see, evaluate and understand that this is the data that you need. You can start to build out that process. 

We can work directly inside of the platform. We don’t need to requisition new hardware. You know this is a modern scaling solution. We don’t need to build out something custom just to work with it. In terms of architecture, we maybe don’t even need to build out some custom code or a loader or some sort of connection. We can virtually define how we need things to move around in order to be in a format that we can use. Very exciting. And in terms of development, potentially we don’t have to write pipelines either. Really we could focus on a governance process now where we’re working through approving the data, making sure it’s in the formats we want and doing that all in a very transparent way. And maintenance is great. We can just use APIs. We can just access the data. We can really focus on things like data quality and governance and understanding the risk that we’re putting ourselves into. A lot better when we don’t have to worry about the infrastructure, architecture and pieces like that. Great. 

Architecture After

So all of this requires that we have some sort of centralized architecture and this is where a data fabric, lake house and in a future state a mesh becomes very, very powerful. So we still like our existing data ecosystem. Some of it is bespoke. Some of it is required in order to have things like low latency, real-time streaming data. You know, we’re not going to solve that with something off the shelf, something open and available. And we want to keep that. There’s a lot of effort put into that. There’s some big understanding about how all that data works. We know it and love it. We don’t love all of it but we love enough of it. There’s lots of different data models like I mentioned. Things in a lot of different places. Storage media, connection protocols and governance is still stuck in all the different places. By putting this behind a data fabric and virtualizing these sources into one single place, now we can start to tackle governance properly across all of our data and think about the data model across our business instead of at best like across an asset class or a specific line of business. Like an ideal case would be that a line of business has its own singular data warehouse but that’s just not the reality either. 

So now we have a single warehouse like object across all the lines of business. We can start to standardize these things. We can start to build out the idea of like what is our data model? What is our common taxonomy? How do we talk about different entities in our dictionary of terms? And we can start to do curation, governance and cataloging of data at that level. We’re spending a lot of resources before doing data quality and governance at this granular level. Not only having to think about each of these separate databases or data sources separately but also having to do it in respectively different frameworks or languages or paradigms depending on is this like an object store or is it a API or is it a NAS drive or is it like where is the data and how we’re interacting with it. It does play a big role into like how quickly we can iterate on this. And really we want to have a standard across all of our data. So this is a big help for that. 

And not having to worry about architecture and infrastructure saying like point at my data where it is, you’ll do the rest. As well as having like all that good stuff from a lake of dynamic scaling and in a centralized location, also magic. So where we’re going with this is towards the lake house. So we want to take what we’ve learned, continue to have that one central place where data can come into in the future. But not throw away the pieces that either belong somewhere else because this is some bespoke piece of architecture or hardware that we need in order to do X or Y. But also not have to worry in the future about requisitioning new hardware, scaling, or the fact that like 85 or 90% of our use cases are probably fine in a data lake kind of architecture, right? It doesn’t all have to be real-time and streaming. It doesn’t all have to be on like all flash drives like vast would do for you. But the option to integrate that in the future, great. 

Winning the War on Data

So taking all this together, this is really helping us to win the war on data. We’re back to that now. So the pirates, pirate hats are back on. So some of the wins that we’re seeing here are that we’re able to focus on these bigger initiatives, governance, data quality, onboarding new data at scale. So being able to understand our risk better and the data we have better with fewer resources means that we can focus on onboarding new data. A big part of the success of large like financial firms is about the amount of diverse data that you have access to. And looking into alternative data, like not just the standard like market data, reference data, and trade data, but looking outside of that into more indicators across different asset classes, across like like Twitter sentiment data, or climate data, or things like that, really unlock the kind of analytics we can do and the kind of data-driven decision-making that we can do. And that’s something that we can focus on more now because we don’t have to put it on the resources to having these bespoke teams or very large groups of experts managing something bespoke to a specific kind of a business. 

We’re also seeing that the people who are closest to the data, the domain experts, are able to engage without having to be either also technology experts, database experts, data scientists, right? The tooling itself is allowing people to be part of that data process potentially end-to-end without having to be, you know, wearing multiple different hats throughout the process. And so we’re seeing our like high-value customers engaging directly, which is fantastic, and how we want to kind of take that knowledge and derive value from it. Fantastic. 

So this is the story I wanted to put out here about how we’re fighting the war on data, how it’s winning, and how it’s starting to allow us to differentiate ourselves more against our top competitors. And it’s absolutely a big driver towards our goals of being a top 10 investment bank. So huge thanks to the data and to Dremio. It’s really something we’re seeing driving in the last couple years since adopting this. So I do want to introduce the concept now of Q&A. Love to hear some questions from you and answer them. And just to say thanks for tuning in and for coming to the room, which is hot, I will mention. It’s very hot. I also wanted to flash over like a thanks, I think. And if anyone’s interested in Python and the unofficial Dremio client and wants to chat later, I think that would be cool. I’m interested in like, is this still relevant? Do we want to revive it? Could this be an open-source collaboration? Just really curious about it.