March 1, 2023

3:00 pm - 3:30 pm GMT

Reimagining One Mount’s Organization’s Data Economy with Multiple Business Units

Our journey begins based on the view that we can create enhanced value from leveraging the synergies between our business through strategic integration of data that exist in our business ecosystem. As with many organizations, we were not exempt from the existence of data silos and diverse data technologies. This story picks up at the point in our journey where our use case was to simplify our data ecosystem. And so we embarked on a journey that would reimagine our data technology foundation which would enable us to realize a data economy based on streamlined data exchanges that would enrich our business ecosystem.

Topics Covered

Customer Use Cases
Keynotes
Lakehouse Architecture
Real-world implementation

Sign up to watch all Subsurface 2023 sessions

Transcript

Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.

Gia Truong:

Okay, good afternoon, everyone. My thing’s normally morning, my time, where I’m from, but it’s all good. So I’m Gia and my topic today is, we actually have gone on a bit of a journey to actually reimagine our data, I guess our data ecosystem in our organization. We’re a relatively young company, so I’ll talk through a little bit about that in today’s talk. And so the topic is really reimagining our organization’s data economy and data is actually at the heart of what we do and our business, really. Our business really is an ecosystem of startups. So I’ve already been introduced. I won’t go too much more, but other than that, me personally, I’m a husband and a father of two teenagers, so shout out to all those parents, fathers in here. I’ve actually worked most of my career in Melbourne, Australia.

One Mount

But two years ago I actually moved to Vietnam, Hanoi, to actually help One Mount group, which is a start-up incubator and accelerator. It’s actually an ecosystem of businesses. And so what is One Mount? Founded in September, 2019 based out of Hanoi, but geared towards really building Vietnam as a country. And so their goal when they first started was really to build Vietnam’s largest digital ecosystem. We exist just to help Vietnam progress in terms of its technological sophistication. Technology and data, as I mentioned, is at the heart of what we do. Who do we serve? We serve everyone, people, businesses, to launch corporates and enterprises. And our vision really is to bring Vietnam 4.0 and Industry 4.0 into Vietnam. And our mission is to be able to empower individuals and businesses really to reach their full potential.

And so to give you a bit the context before I go into actually our data journey and why I keep saying data and technology is quite at the heart of what we do. We are an ecosystem of businesses but three of the key businesses that we run are actually a customer engagement platform. We call VIN ID. VIN ID is a super app. You talk about e-wallets, you talk about customer loyalty points, the ability to actually accumulate points and then actually use points on a platform where it’s purchasing either football tickets, events, and so on. So that’s actually one of our businesses. So we do have one of Vietnam’s largest loyalty programs. The other one is One Mount Distribution. So becoming a trusted partner of distributors as well as businesses.

So a lot of b2b. So we aggregate supply chain distribution to a lot of businesses. And then the last one is real estate. Unlike other mature markets, real estate; picture getting a quote from a real estate agent, how do you know how to trust and what to trust? We have to create a trusted ecosystem around property and property pricing, such as creating automated valuation models is actually what we do. And we actually, we built and deployed that, again, leveraging these platforms and our architecture to do that. So you can actually go to our websites and actually see that we’ve actually valued property in Vietnam. And through that, over time, with enough data, we can actually create more accurate models that people can actually trust as a result. So that’s actually at the core of what we do. And data underpins a lot of that along with technology.

Our Data Journey

And so our data journey. And so the case for change for us, as I mentioned before, the opportunity is actually to create enhanced value by bringing data and by the synergies between the data from our various businesses. Each business on its own creates its own value, but when brought together in our ecosystem, we can create more enhanced end-to-end services across the whole value chain as a result. Now that’s the opportunity itself, but every opportunity, every story has a villain and an obstacle. And the villain here, as I think you’ve all mentioned before, Richard, Peter, Shavil and everyone else, we have data silos because these businesses actually did exist on their own before we actually brought them together. And so we have data silos, we have diverse technologies and tools and that gets in the way of us being able to synergize and create synergies in a cost and efficient manner.

One possibility is not to do anything. That’s one possibility. The other possibility here is actually to reimagine and look at how we could bring all of this together. But really the data economy for us is a streamlined exchange of data between our businesses, allowing them to operate independently as their own businesses, but treating the ecosystem in such a way that data can be easily exchanged so that there’s not a lot of friction in that sharing process. As Richard mentioned before, if there’s a data set that I need to share, just like if I bring in an external data set, I need to rather easily and quickly share that across the ecosystem as easily as possible. But there are legalities because they’re all legally different businesses, and so they’re not just business units. And so we have to deal with that as well from a government, but also from a legal perspective as well.

Data Organization

And so before we go further, just give the context on how we sort of organize our data organization. This isn’t really complete, but just to give you a bit of a sense, and I suspect a lot of organizations organize this way, but each of the verticals here, you see, we have data scientists and data analysts sitting in our CDO, our chief data office, which I left out, but we actually have our data governance and data management team, which I lead as well in there. Then we’ve got our data engineers, IT operations, product teams, the rest of the business operations and so on. But just to keep it simple, I didn’t put all of that in there. And each of these verticals are basically each of our businesses. And so if you treat them as each of their business, they’re all really business domain teams, and they create data products.

And so I’ll give you an example, in the middle I put RE, which is our recommendation engine. So one of our business domains actually built a recommendation engine for our customer engagement platform. And that recommendation is actually not only used in a customer B2C context, but we actually use the same thing also in our B2B in our supply chain business as well. And so that actually gets these products built in one business, but gets shared across all the businesses. And underpinning that is our data platform underneath where we can actually easily share data across.

Our Solution

And so as we looked at the opportunity to really solve the opportunity that we had, which is how do we effectively share data? As you saw the diagram before, you can see here that one of the challenges we had was we had multiple data platforms, and some of the teams actually worked only on one platform. But when we wanted to share, the data engineers needed to do more work, they needed to effectively create more code to synchronize data and distribute data across the different platforms. That creates unnecessary work. And so that was actually one of the cases for our change, was how do we create synergies between the data by removing actually some of those silos so we can actually then allow the synergies to be more streamlined. Other questions we asked ourselves was, how can we build something more evolutionary?

Evolving Our Platform

So how can our platform actually evolve over time? And you’ll see that we actually evolve the platform even during delivery. And you’re about to see some of those examples as I walk through the architecture and our journey itself. How do we make it reusable, as I mentioned before, how to reuse it across our ecosystem? Allowing our users and user community and business domain team, so they actually self-serve and create their own data products. And how do we actually control the cost better? At least feel like we could have better control over our costs. There’s a lot more questions, but just a couple of high level questions that we posed there that I’ve shared here and that then sort of brought us to how do we realize the vision? So the five categories I mentioned before to synergize through our data.

For us, that implies centralized IT, but federated consumption. In terms of evolutionary a hundred percent cloud-based, the ability to be cloud agnostic as much as possible, and also minimize vendor lock-in. And so we only have a few vendors on our platform. Most of our technology is all open standards as a result. So Dremio is one of very few that we’ve actually locked. We’ve actually purchased and procured. But other than that, everything else is mainly open standard. Reusable, like I mentioned before, build one leveraged across multiple businesses. You can imagine our architecture not only onboarding new startups, but maybe when we sell a startup, we need to better carve that entire architecture and separate all the data in the architecture, and we’re built for that. At least we’re built to try and serve that need as well. So they can actually sell and you can actually carve the whole technology out as part of the sell.

Create their own data products. This is very self-serve, deliver their own products. They were already doing it, but it wasn’t very efficient. And so the architecture allows for that. Cost control, again, taking control of the cost back. If you can sort of envision where we were coming from, we were coming from a big query background as well as traditional warehousing. So think of my SQL and so on, our goal and so on. At that, we actually had multiple platforms with these technologies. Each of them, obviously those of you have used cloud warehouses, you’d know the challenge. I mean, they’re easily used, but geez, they’re very hard to control costs. And that was a challenge. That was a challenge.

Our Journey with Dremio

But other things actually, whether it’s easily apparent at the start or not, is when you start to go on this journey with Dremio, or at least we did with BigQuery. A lot of the benefits that you get from platforms at BigQuery in terms of scalability, actually, there’s actually challenges that come out in a different way that you notice when you actually implement platforms like Dremio. It’s more organization and people, that’s the problem. And I’ll mention a couple of those in our learnings. And so the architecture itself, at a high level, so synergies with data. For us, the synergy is really Dremio and Spark. These are the two layers where we use the spark compute is shed primarily in the data engineering layer to whether actually transform the data itself. Dremio is a single layer where we actually have all our users access data. And so other than data science, we do actually open data science to access through the spark layer. It’s unnecessary to actually push down that volume of data free Dremio.

Sometimes they’re actually accessing almost the entire dataset. And so we don’t want that going through Dremio because it’s a waste of the compute itself. So depending on the use case, we will bypass Dremio. But generally speaking, our principle was everything through Dremio as a first point. And then if need be, we will make exceptions for it.

Evolutionary

Evolutionary. There’s quite a few components here that make it evolutionary, but I’d say probably the key piece that made it possible for us to have more cloud agnostic is actually just the table formats. And our architecture is actually using both Delta Lake and Iceberg in Harmony. We can switch between the two and we actually did it during delivery as well. And so I’ll share a little bit about that experience. Reusable cloud storage, DBT, Spark, Airflow. Anyone here use DBT in their organization?

One hand up. Awesome, awesome tool. Awesome technology, actually. I’ll share why later. It actually was one of the things we decided very early on to adopt for the reason that it gave us the flexibility we needed. Create own data products as mentioned before, self-service enabled, again, through Dremio and controlling our costs. This was important for us. So a lot of that was through Dremio. On other platforms where it auto scales, you can write a lot of queries and you’re going to write highly inefficient queries, and get the job done. The platform just scales unlimitedly. And that’s the problem when you go through Dremio, you can’t do that. You can’t write inefficient queries. Doesn’t work because we are choosing not to scale. We purposely did it for that reason. And so that problem comes out organizationally. Our people were not geared towards writing efficient queries in a number of cases and they struggled to migrate onto the platform when we did the migration. So deploying something like this requires that we actually skew up our people to be able to better write SQL, query and so on. And so there’s a benefit to it, but it’s hard work to get there.

Outcomes

Okay, I’ll get the outcomes and I’ll go back and talk a little bit about the architecture and the learnings itself. Because I think that’s probably one of the valuable things that we got out of this. Project time was 20 months. We actually decommissioned two of our legacy data platforms. That housed pretty much all of our data. By the time we finished, we cleared close to more than half a terabyte, half a petabyte of data on the legacy platform itself. About 2000 Kafka topics. So we actually stream data all through Kafka, unless it’s those platforms that we can’t stream. So SAP was a struggle, Salesforce was a struggle. So we actually do batch on those ones. Five business domains. So like I mentioned before, supply chain logistics, FinTech, retail, real estate as well as corporate.

We actually run somewhere around about 150 to 200,000 common table expressions through Dremio every day. So common table expressions sometimes can have 20 SQL queries in there, sometimes five, sometimes 10, whatever that might be. So these aren’t necessary SQL queries on their own. So the volume of SQL is actually running much higher than that. But we run roughly around that. And so the size of our clusters around about 16 nodes, 16 compute nodes. And we use this platform across all our businesses. And we actually went live December last year and we shut down both platforms in January this year. And I’ve done consulting for 20 years and I know what it’s like to try and shut down a data warehouse unsuccessfully. Many, many times.

And this journey for us was important because we had top-down support and bottom up vision was clear, top-down and the people from bottom up actually understood the vision. We are the size where we can actually operate with sufficient flexibility. I can imagine, like Peter, you probably might struggle with that. I’ve helped a lot of large enterprises in a past life and I know what it’s like in large enterprises and our experience is from a small to medium business. And so as I share this example of our journey, it’s really one of that size. And so we didn’t have too much friction in terms of the ability to rally the organization top down, bottom up. And if something’s missing in the plan, people pull together and get it done. And that’s how we got this done in the timeframe that we got it done in.

Learnings

But I certainly can imagine larger organizations will probably struggle just because I’ve been through that journey before as well. And so that was the outcome itself. And so some of the learnings now, I haven’t put all the learnings, but there’s a lot more learnings we have. But I’ll put a few here just to give a flavor and open to questions after this because I suspect everyone here has been on these kind of journeys before. You have your own questions. So these are sort of key decisions and learnings. I really didn’t want to frame it as challenges. I think it’s a bit negative. So looking at it more as a positive from that ingestion perspective making the decision whether to go Kafka at topic level whether it’s DB level or table level makes a difference because then your Kafka architecture needs to scale. At a table level, I mean, even for our small business, we’re talking 2,000 plus topics which can be sizable for Kafka architecture. We use the schema registry, so Confluent is the other software that we actually procured. So this is not open source Kafka. So we use the confluent schema registry to give us the flexibility version of schemers so that we can actually manage schema evolution in our ingestion. Debezium for our CDC producer. So we found that Debezium was fit for purpose for our news. If you have a lot of Oracle, you might want to go to a Golden Gate from our experience, because Golden Gate CDC works better with Oracle. Debezium works well with things like all the other ones like MySQL and so on. It works well with Oracle actually as well. The challenge with CDC will be some of these things, some of the minor things we’ve found are just file format conversions.

DBT

So data type conversions. They can be tricky, they can really stuff up when you’re ingesting and then it’s actually changing underneath you the data type. So that’s going to be important. That was an important learning. So if you don’t have Oracle, our experience with Debezium actually works much better. Spark; we use Spark streaming as well as Batch for ingestion. Data preparation, so DBT and the open table formats actually underpin our entire architecture. The ability to move between Iceberg. So we first went Delta, and the reason we went Delta, even though we eventually wanted to go Iceberg, was when we first adopted Dremio almost two years ago, Iceberg wasn’t really supported. So we went Delta first. So from ingestion all the way through to our data preparation layer, which is our DBT transform layer, all the way through the serving layer where Dremio then picks up the data and actually serves the users.

But on our journey, we found that data latency, the refresh in Dremio was a problem for us. It wasn’t refreshing fast enough. So even though the underlying Delta tables were already updated, Dremio couldn’t refresh quick enough. And so what we did was we switched to Iceberg, but we’d already built the foundation to do that. So we chose DBT for the point that we wanted a declarative based solution. And so DBT gave us that ability to implement declarative based solution. And so once we declare the code once, we can just switch the table formats underneath. So whilst we were actually going through our delivery of our project, we actually switched to Iceberg without actually much change of code at all. Actually, we change one of our businesses entirely, the Iceberg within a week.

Actually the [inaudible] is just actually a day, right? Because you just flip a switch and it’s done. However, DBT doesn’t support Iceberg out of the box and we are using open source. So we actually have to customize. So we customize DBT to support Iceberg with the Hive meta store. So that’s what we did that, that’s important because I’ve touched on that. There’s another customization we did later. And so as I mentioned, DBT Iceberg. So we use DBT Spark Driver with DBT and then we customize that to support Iceberg. The data serving, Dremio being the single platform, we actually use it for everything. So we do BI and analytics on this as well. It requires resetting expectations. For us, it did. We came from a big query world at least one of our platforms was.

And so it would just munch and crunch and do everything. And so we had to reset some expectations on what it’s like to migrate to a platform like Dremio and that people needed to skill up. I made a point later on about the last point in data survey, end user sql knowledge actually when transitioning needs to be trained. Our learning was that we didn’t spend enough time training them up. And so they struggled to [inaudible] into Dremio as a result because they couldn’t write efficient queries. What used to work on BigQuery that ran 30 seconds won’t even finish on Dremio if you lift and shift it as is. And so there was actually things that we needed to change in that to actually make it work. But once we did it, it still ran in the time that we needed.

Compromise

But there needs to be compromise. Not everything can run as fast as it was on BigQuery, and that’s just the reality. So we accepted where we needed to compromise on performance, but be able to standardize on one single platform without needing to go back to loading into another database just for the purpose of doing bi and a bi. So we did that. And so as I mentioned before we switched to Iceberg because it gave us data refresh rate that we needed. The freshness of data through Dremio. When we used DBT to write Iceberg files, by default it used as the Hive meta store. So you could use Dremio to hook into the Hive meta store that’s part of DBT to read the Iceberg files. But we found that that wasn’t stable, not in our deployment.

We used Kubernetes for our entire deployment across all our architecture. That’s one of the abilities for us to be able to actually move between clouds. That’s one of the underpinning principles of the cloud agnostic architecture that we built. That and the OpenTable formats. And so what we actually had to do was we actually had to customize DBT iceberg with the Hadoop catalog. And once we did that, then Dremio can directly read the files without going through the Hive Meta Store catalog. This only makes sense, I think, for those who actually use DBT. If you don’t, then it probably doesn’t really matter.

Full Lineage

For the other management, we actually have full lineage from start all the way through the end as long as we use DBT. And so we have DBT Dremio drivers, we have DBT Spark. So from the point of ingestion, transform all the way through the Dremio end tables. We actually have full lineage. Why is that important? We can ingest that lineage into data hub. Anyone here who uses Data Hub from LinkedIn? It’s open sourcing, you can use it for free. But it’s actually a great metadata search engine. So we use that for our glossary, our cataloging, for data visibility as well as data quality. So that’s a lot of our focus for this year, but we actually bring the DBT and the lineage from our ETL directly into the catalog. But DBT itself has a lineage graph that it generates. So the moment you built the code, it would generate the full lineage for you. But what we chose Data Hub as a centralized data governance, data management solution, and then cloud costs Kubernetes. I think it’s highly valuable in terms of the simplicity of managing, but those who use Kubernetes would know cost overruns are very easy if you don’t control it.

Cloud Costs

So we have our own alerting mechanisms to be able to alert and then fine tune a lot of our Kubernetes deployments particularly in this data platform. And then cloud costs; if not implemented properly, your cloud cost could be exorbitantly large, particularly depending on how you ingest data and how many operations because cloud cost doesn’t come from the size of the data. It comes from the operations that we run on it. And so optimizing a lot of the architecture for insert only and read versus updates, it’s going to be critical. You update when you need to, but you want to minimize updates as much as you can, because it costs more operations to update than it is to insert and read as a result. And on Google Cloud, one write equals 12 reads. So ideally you want to write once and read as much as possible. And so those were what we did to optimize our costs.

What is Next For Us?

And so what’s next for us? There’s obviously one obvious thing I didn’t put up here. We’re still on Dremio and we will need to upgrade Dremio, but it was obvious I left it out, but we’ve just upgraded to 21.7.1 on our non-production. So we’re going to go working with Jacine and the team. We thought that that made a lot more sense for us just to go with that version for now. But ideally sometime this year we’d go a major version jump because there’s a few features that we really actually want in version 23 and above.

And so that’s what we’re doing. But that’s actually a very key part of our operations for this year is managing and easing upgrades into production with relatively low I guess issues as a result. But other than that, beyond that, a key focus is integrating DataHub with the rest of our platform. So data catalog, business glossary, metadata itself, metadata search, that’s a really key part of our focus for this year. Enhancing the data validation, reconciliation as well as alerting. So we’ve built the foundation, but it needs more enhancing. For example, making sure that our CDC, particularly with CDC, our lowest latency right now is one minute. So we ingest data every minute.

We could go lower, but we’ve chosen not to for reasons of business need. We don’t need it and so we haven’t done it. And so as that synchronization runs into issues, it means we need to resolve at a much higher rate when that happens. And so if we don’t have sufficient alerting and reconciliation ability, even if it’s not failing, if we’re missing data, we need to know that it’s out of sync with the source. And so that’s really critical for us in a very low latency solution. This is actually pretty big for us this year, is actually testing out our DR setup, implementing as well as testing, remodeling. So when we did the migration, particularly from BigQuery, but two existing warehouses, we only did the minimum required from a remodel and ETL perspective.

We tried to shift as much as possible. Even that alone was quite a bit of an effort. And the reason we did that was because we wanted to simplify the reconciliation process. We did not want to remodel and re-engineer all the ETLs even though we knew we needed to, for the reason that we had nothing to reference it back to prove that we migrated it right. And so we had to migrate it as much as possible so that when we were done, we had a reference point to reference back and reconcile with the source. Once we’re operating as BAU, now we’re going to refocus on restructuring and remodeling our data and redoing a number of the ETLs that we knew we needed to do and fix. And so that’s a focus for us this year.

Training, very critical for us. Good features, but when users don’t really understand how to use them, well they’re no good. And so training was something that we didn’t prioritize a lot in our journey of 20 months. But for this year we are actually prioritizing that. And then I mentioned they’re exploring operation use cases. The reality is we’re already using operation use cases on Dremio. But we’re actually looking at more streaming this year to prepare us to move into the next phase for this platform itself. That’s pretty much it on our journey.

header-bg