Bazaar`s Data Platform

This session will provide a look into Bazaar`s data journey, including how we are embracing open source data lakehouse and data mesh approaches and empowering our teams to create data-centric products to solve some of the toughest challenges in the Pakistani retail industry by marrying data and AI. The session will briefly touch upon the open source stack we used to build our platforms and some of the tools and approaches we are really excited about.

Topics Covered

Data Mesh

Real-world implementation

Sign up to watch all Subsurface 2023 sessions

Speaker

Umair Abro

Engineering Coach (Data & ML Engineering)

Transcript

Note: This transcript was created using speech recognition software. It may contain errors.

Umair Abro:

Welcome to this session. I’m Umair. I’m currently an engineering manager at Bazaar Technologies, which is a startup here in Pakistan, which works with all of the retail sectors. A brief agenda, what I’m trying to cover today. So I will just briefly introduce my company, what we do what kinda business lines are we in, and how do we kind of actually use data to make sense of our day-to-day operations. I would also share some details around our internal data platform which we call Brock. I would also be just touching a bit on Greenside, which is another platform we’re building over our data platform. it’s geared toward more towards machine learning and ai. Cool. So let’s get started. So again, let me just briefly go through Bazaar Technologies.

we kind of call ourselves the operating system for retail here in Bazaar. We kind of work in the B2B sector mainly where we kind of service a lot of customers who are not directly selling and not directly the consumer of products. They basically are business, a small and medium business, which which serves Pakistan Pakistan’s economy in some way. this could be a small Korean store on a, or a mom and pop shop, which kind of sells products like Pepsi, Coca-Cola, or a supermarket kind of settings. And this could be also a small factory, which needs raw materials and all that stuff. So let me just jump into our main verticals. So we basically operate in three verticals in Bazaars universe. One of them is retail which you can say is staple products or products from NASA and P N G.

we also have another vertical where we serve industries like factories, mills, floor mills, all that stuff. And we kind of provide them with raw materials. there’s a third vertical in our startup where we kind of focus on providing credit and FinTech credit and lending opportunities to our customers where we help them grow their business, providing them landing on the stuff they’re buying from us or stuff they need to grow their businesses. primarily what we do is, sorry about that. So, primarily we do is we have around 1 million businesses that we are directly serving which is roughly around 50 plus thousand cities here in Pakistan. We’re also serving 200 plus partners which are lives of Coke, Pepsi, Dashon, and all those stuff. moving forward, we kind of have these mul multiple platforms and apps in each of our vertical, which serves the end customers.

So we are trying to be digital through and through for them. this is our industry portable, where we kind of help these small factories or industries to buy raw materials from us. And we also have a FinTech app, which we kind of use to serve our customer with coins and royalty points, and also provide them credit scores and credit lining. let me just briefly touch up on the data side of this. At Bazaar, we are really data driven. We really focus on data for our day-to-day decisions, be that the engineering team, the design team, the product team, or the operations team, the really data driven. So to build the platform, we also kind of had to go into that mindset where we were thinking as a product team first before platform team. So I’m just sharing some of our product personas that we kind of use to build our platform.

So we have Taha Taha as a data engineer whose day-to-day life circles around making pipelines and models for the business requirements. he really cares about and is really worried about schema changes in the base tables, which can break his pipelines and really cause his data models to fail. he really is frustrated about the lineage problem when a data, a cable changes, and all of his downstream pipelines failed. second persona we kind of serve is also related to directly related to data personnels, it’s Maryam. Maryam kind of serves the business analyst as a business analyst. So what she does is kind of care about the KPIs that are created using data, the dashboards she’s really frustrated over the under registration or overstating of numbers and the refresh in data. So we also kind of have this persona for mr.

So AMR for is a product engineer who kind of creates and products for his customers or his product line, and tries to solve problems with data or incorporates data pipelines and user facing analytics. In his apps. He is really focused on solving problems and really cares about user experience. Finally, we have Ash Ash as is a business manager who kind of really sees and owns the end-to-end operations of our deliveries. he’s really wor worried about the asset visibility, what kind of stocks he has in his warehouse. Where is, where is that going? How is the that being transformed from one place to another? He needs a live data visibility of everything they need. He really cares about making the informed decision on his end. otherwise it is a business implication and loss of precious merchandise.

as a data team we all so care about how are we going what kind of use cases we’re going to cover, and what do we have? So how we started creating this platform was first put our put on our project hats and really figure out who our customers were. Then we try to figure out, okay, what do we want to do and what do we have? So we have this thought experiment where we kind of we call this the four Ds of dangerous data just because there’s a lot that can go wrong in each state. So we kind of divide all of our assets into four particular orders. So we have data sources in the start. These are all our apps, the behavioral data coming from them, all the RDS instances, or all the other R D B M as databases flat files, you name it.

and then if you just look at the end, we have data consumption where we want it to be. When we started this platform, what we initially wanted in the use case domain, what we initially wanted to target was, okay, how would we want to have some sort of bi n analytics tools and technologies so we can do effective reporting. What else we wanted to really achieve was have this approach where we could create data centric products. That could be our end rider apps, that could be our warehouse management and all that stuff. plus what we also wanted to achieve with this particular practice was also enable our end teams to incorporate some type of intelligence in their apps that can be recommendation, pricing, recommendations, and all that stuff. Finally, we have this particular idea where we wanted to open up the platform to be used in different teams and domains and give them domain specific analytics, like for we could show some kind of performance and marketing metrics for our partners, which we actually do and is highly profitable and fun place to be.

let me just go towards what we had when we started building this platform. So we had around 10 client apps when we started. these were our in-house apps and our customer apps. back then we had around around thousand plus tables and documents and the user form in form of R D s and MongoDB and around hundred plus microservices. we also had this Coca cluster for 25 plus top topics of events, which were kind of shared in all the microservices. And what we wanted to do is also capture that in our data platform. So we could do some kind of real time analytics for some use cases. all in all, in any given our at Bazaar, we process around 40 plus TB of data in different permutations to kind of complete our orders, fulfill our operations and whatnot.

so keeping that in mind, we kind of embarked on this journey to create our own data platform. we called it Brock. so Bazaar is based in Pakistan, and it’s a Muslim dominant country for Muslims. At Brock is an animal, which the prophet kind of used to travel in the lightning speech. So we named our platform because of the speed where we were bringing in our data analytics and ml. Bazaar is also a relatively young company, so we didn’t have all that legacy of data warehouses and other type of data solution. So we kind of started with a clean slate, which meant we could have directly gone to the new mainstream technologies that were coming up. So we kind of adopted two of two mindsets. the leak house and the data mesh where we kind of started, we really wanted to create an ecosystem where the data is not being data is not being kept in a singular vendor locked in state.

We to make it open and accessible for all. And what else? and there was another challenge where we wanted the central data platform team to just create the platform and focus on the passive standard for the platform, and kind of he help distribute the ownership of data towards the individual teams who kind of generate that data that way the central team could focus on in enabling and evolving the platform rather than having to kind of also do data engineering and become the bottleneck in the system. some other things that the engineering side really cared about was we wanted to have a Delta architecture, which meant that we wanted to do both real time and batch jobs according to our use case. We don’t want to do real time for everything because it didn’t make sense to us.

we also cared about how would we kind of roll back our bad, bad data because we were going into data Lake first approach where there is no asset data set and you have to rely on OpenTable format. So you have really have to think about how would you kind of roll back your back transformations, which could ha be expensive in a lot of cases. third thing was we didn’t wanted to go into proprietary software. We really wanted to keep ourself in a space where we were free to choose what works and use that instead of choosing a solution and making it work. So we tried to avoid vendor lock-ins wherever it was possible. we also cared about our solution has to support the open source, and it has to work with Kubernetes because we had this vision that we are going to build everything on the open container format rather than using dedicated plus I do clusters and the legacy data platform strategies.

So we really wanted everything to be in Kubernetes space. for 14th was we really wanted to make it cost effective because Bazaar was a relatively young startup, and cost really matters for us. And we kind of had this mentality from our beginning to be frugal. we kind of started in the covid times and it really shaped our mental u p cost effective wherever we can without kind of compromising on the user experience. So after kind of going through all that, the first decision we had to make was choosing a table format. what we figure out that you can’t go wrong with choosing either one of these iceberg Delta and hoodie are equally good in their respective ways. It’s just about what kind of problems you are willing to solve and your team dynamics and expertise. we really cared about the asset ma transactions.

So we really wanted to make sure that our data is accurate. And there were already a lot of data analysts that were using the RD ss. So we want data smooth transition and don’t, don’t have to make them kind of go through the complex formats of pocket files and how to combine a market file. What if you override a pocket file? So we wanted our tables to be asset compliant. we were also looking for something where we could do efficient bulk loading because we didn’t have any legacy. So we were going to load a lot of data into our lake house. we were also looking for some kind of managed utilities where we don’t have to write a lot of code because when we were starting, we were just two engineers working towards a problem and we already had to service our users.

partition evolution was another thing that we kind of really cared about. And time travel is a must when you’re focused on the open lakehouse formats, because at any given time when you want to see what was the state of your lake house two days back or hour back, it’s really helpful for those type of stuff. And finally, we wanted something that had already built in data quality of these data quality stuff. So like checking for nus figuring out duplicates having some kind of validation tests. So we kind of end up using Apache hoodie, but people can use anything and it would provide the same result. You just have to choose what kind of different set of problems that you will be solving for any given table format. there was another decision when we kind of were ready with our table format.

We had to find a way, how would we kind of store this data. what we kind of came up with was the initial Delta Lake approach where we would have these three layers of data, the roll layer, the silver layer, and the gold layer, and would kind of gradually kind of improve the quality of our data and move it to another layer. So we also had these three layers. Our first layer was roll layer, which had a m RT able, which is merge on Wright. It was partitioned daily, and it was running fairly real time with hoodie streUmair utility. we had some form of base well quality validations over there and some base governance that which team can access that based on their teams. And also we kind of implemented a basic bloom filter where there, so it could be easily searched, and there’s not a lot of QE lags.

we also and this table was absurd only then after that, this was a realtime layer for us. after every realtime layer, there’s, there’s a bad job running, which kind of updates. Our second layer, which is also a R layer. it’s kind of a who holy grail data product layer. we don’t traditionally we don’t do traditional data modeling because we are a young startup and the business dynamics, the definitions of our models were changing really frequently. As we were scaling new product lines were coming in, new business realities were forming. So we had to keep it really at all so analysts could shape it really easily. So we had this layer of data products, which defined our base business models. Okay, how many customers are coming in from multiple apps? What kind of apps are they reflecting to?

What kind of orders are they punching in? We kind of created those holy girl data products, which were serving as a model layer for us, but it was really ad ho. You didn’t have the star schema or the DB two stuff, so that way it could be really easy for the analyst to understand and maintain on their own. it was also partitioned on daily, so we could ca keep small batches and keep it updating the quality, quality and governance in this layer kind of went up. We check for a lot of things, like what is the average rate of orders being placed in an hour? and if it goes down a certain layer or if goes up a certain range, we do generate alerts to let the right people know to just kind of go in there and look at if that’s valid or not.

we kind of also restricted our governance layer over here, and we started masking a lot of data so people could just look at what they really need and kind of maintain that governance for the entire organization. this layer was insert only, so there was no updates in here, which made this layer work even faster. the final layer, which we call the bold layer is our final data product, which you can say is the final data set, which can be curated into our tablet dashboard, a super side dashboard. And you don’t have to do a lot of joins in these. it’s also partition. And the update rate for this particular layer was 30 to 60 minutes based on the use cases. The idea was mostly that if you need real time data, you can get it from silver with 15 minute delay, or if you really need real time, we would just have a qi with would run on the raw air.

the quality quality and governance on the gold layer was really strict, where we kind of only gave access to certain people who really need that product. this labor also C O W, which is copy on, right? So basically every time this job runs, we kind of override all the data instead of kind of maintaining multiple states of data liker tables. like I was mentioning o r and crs. There were some decisions why we choose certain tables in certain layers. copy on right are really good for higher low, higher curi actually lower curity time and higher data latency. Higher data latency. Whenever your job is running, it would take more time compared tor tables to write the data in your lakehouse. So we kind of created a mix and match of these tables that where do we care about the cur latency and where do we care about the job latency?

So for our job latency, we used mo r tables like in rawer and our silver layer. But for higher tables, which were being curated with our BI tools, we stick with your W tables copy on right tables. if you go through the chart and just look at the update cost for the io, it’s really high for copy on right tables and cause it currently rewr totally rewrite the pocket files under the parking files. And it’s if you have a lot of data products that cost can add up. So we kind of kept it low moving forward after we were kind of really sure what would our layers look like and what kind of open table format we would use. we had to choose a Q engine. so there are a lot of choices in Q engines.

You could use Presto trio, like we use you can go with Ramo. there are a lot of options and there are a lot of things that you may consider for your particular Lake house implementation. for us, we really cared about open source and we really cared about Trino because of our ex previous experiences with Trino. what, how we implemented that was we had currently we have three implementations three different clusters. we call them high obtain diesel and hydrogen. So it’s based on the gasoline and these actual name also represents their efficiency. So high Octa clusters are the p clusters and are used for dashboards and data apps, which really use a low latency system. we give them a lot of resources and configs are kind of set for the maximum minimum purity times.

for heavier loads, we use the diesel clusters and hydrogen clusters. so what happens is the process is our clients like superstar, Tableau airflow, and we have this data service layer, which kind of integrates our data platform into other apps. It just hits the Tino Gateway. based on the request, it decides which cluster should it route the request to the Presto. we are actually currently using Presto Gateway, which was kind of created at Lift, but eventually we want to move towards trio centric cluster for more efficiency moving forward. So after all those decision, this is how our data platform kind of looked. So we have this lake house layer with open table format, Apache hoodie. We had in our insert layer, we had this dium and Kafka combination, which kind of feed our realtime data.

And then we have our Spark jobs, which kind of do the batch lifting for us. And you can see the data modeling in data modeling space, the desk layer. So we kind of have this heavy concept heavy data concept where we provide data as a service to our internal teams, and we create data domains. Those are small data models and forms of data sets or data products, which are maintained here in the Lakehouse. And these are eventually feed into a lot of client apps BI tools and the machine learning pipelines moving forward. So I will just briefly touch upon this part where actually in the heavy development of this particular part. So when we kind of went through all the base lakehouse implementations, we wanted to kind of go further and really kind of leverage our system and create the ML overhead.

So idea was we already have the data lake in place lake house at place, and our data as a service. What we wanted to do with the inside is build upon it and use the re reusable parts from Lake House and our ML platform. it’s still experimental, but we have created some APIs which kind of accesses access the data in our lake house and kind of run some algorithms on it and saves the model back in the lake house. So we also use our lake house as our model registry and have this custom golan server that kind of serves that to our app. we’re still thinking about building a lot of custom tooling around it. we’re still figuring out how the, we wanted to do the automated ab engines for our models. So we’re still trying to figure out how will we capture all the events into our lake house and create a custom AB test for all our models that are running in production.

yep. So all in all, this is how bazaars data stack is. we have apart from what I already discussed, we have a lot of other tools. We basically use Grafana for our data observability, and we also use open metadata for our cataloging and great expectations for a lot of data validations. again, for am ML ware is still building out this stack, but some of the models that created are created in Spark, Amal, and are hosted on our Lakehouse. But in the future we want to move towards a more mature project like Amal Flow or Q Flow to kind of serve all those models. yeah, and that’s pretty much it.