Driving Better Analytics Using Cloud Data Lakes

Session Abstract

HyreCar Inc. (NASDAQ: HYRE) is a leading national carsharing marketplace for ridesharing, food, and package delivery via its proprietary technology platform. The Company has established a leading presence in Mobility as a Service (MaaS) and achieved incredible revenue growth over the years, the full year 2020 revenue increased 59% compared with 2019.

What holds the keys to unlocking our business success? Certainly data plays an important role. In this talk, we want to share our successful user story of being a cloud data lake pioneer. We will share data ecosystem we build that empowers the rapid growth of the business.

We will start by sharing how and why we decided to move from on-prem to the cloud and how we reduced data warehoused loads and costs by leveraging Dremio data lake storage. Then we will talk about cloud data lake architecture we build by setting up architecture from different layers and leveraging different tools and technologies such as Dremio, Databricks, Superset and other tools. In addition, we want to share how this modernize our data analytics and make it more streamlined and scalable.

Video Transcript

Kim:    With that, I’d like to welcome our next speaker, Ken Grimes, the CTO of HyreCar. Ken, over to you.

Ken Grimes:    Thanks so much, Kim. So like, Kim says, my name is Ken. I’m the CTO at HyreCar. And today I’m going to be talking to you guys a little bit about how we were able to migrate on-prem analytics to cloud with a set of open-source tools that we’re going [00:00:30] to be going over in the talk today. I’m mostly going to be focusing on the story of how we went from on-prem analytics to the cloud and the challenges that we faced in doing so and how that technology helped us overcome those challenges. So just a little bit to go over for this speech, we’re going to be going over a little bit about what HyreCar is, how we got to where we are today, and what kind of challenges that [00:01:00] the data that we have really presented with us when we wanted to make this transition, we’re also going to be going over what our new ecosystem looks like.

And some of the data infrastructures utilities in helping us achieve success for our analytics team, but just a little overview about what HyreCar is, HyreCar has been around for about five years now. We are a mid stage company at this point. [00:01:30] HyreCar has gone over in the last three years or so has seen a significant amount of growth. And with the advance of the usage of our platform, we’ve also had a much more sophisticated ask from our customers on both sides of our marketplace for greater sophistication of technology, but also as our company scales, we also have greater regulatory requirements to adhere to.

[00:02:00] So just a little recap of where HyreCar really fits into the market. We are a mobility as a service company, we serve a two-sided marketplace, both two drivers looking to find a car and owners looking to make extra income on their vehicles. And increasingly we are pushing more into the commercial and business space for the owner side of our platform. Over the last year, we’ve seen [00:02:30] about a 60% increase in our year over year revenue. And we’re expecting to see significant growth this year. The regulatory requirements that I discussed are also coming into the four this year. And this was one of the main objectives that we had in revamping our data ecosystem to ensure that we’re able to get accurate reporting both for auditors and for internal decision making.

So these [00:03:00] are the main objectives that we had when we wanted to move to a cloud infrastructure. We wanted to reduce our costs. Obviously with our old system, everything was on-prem. Many of our analysts were working off of their own laptops or personal warehouses. So there was a lot to overcome there when it was a matter of aggregating all of this and keeping it all in one place. [00:03:30] Security, obviously a major issue for anybody who has experience working with on-prem data infrastructures, luckily with a startup like ours, we didn’t have the major infrastructure migrations that larger companies might face, but we did have a lot of our own challenges to overcome, particularly in the security space where access to our databases wasn’t very well-regulated initially. As a young company, that [00:04:00] is something that I assume many of you have experienced not really putting security first, but when survival is no longer the primary objective, it is time to ensure that you aren’t at risk for outside attackers or really just loose controls on the data going out, particularly, when it comes to reporting and financials.

But despite the security, we also needed to ensure [00:04:30] that the accessibility of our data remained paramount, much of the resistance that we saw in adopting these new cloud tools was really just kind of changing mindsets and having to retrain people who were used to kind of self-serving data in an insecure way without disrupting their workflow. So accessibility was very, very important to this. We also needed to minimize the risk of data loss. [00:05:00] Historically, we had pretty significant holes in some of our data, and we had some issues with the integrity of the data that we were putting out multiple sources are true across different warehouses and databases. So as that data is being moved around in a manual fashion, you can imagine a great deal of that gets lost. And then there’s just holes in the reporting and data going out to people that it shouldn’t. So those were [00:05:30] some other important objectives for us.

Also, in the accelerated deployment zone and also optimized efficiency, what we’re really looking for with cloud deployment was to increase the turnaround time on analytics and reporting. In our legacy infrastructure we would often be working with out of date reports, data that needed to be refreshed that was maybe six months old, very regularly, nothing automated [00:06:00] whatsoever, everything running out of Excel spreadsheets, which I assume sounds somewhat familiar to some of you. And then finally we want it to be able to improve the collaborative abilities of our analysts and our data engineers and our data scientists, which was a difficulty with the on-prem solutions that we had in place, because without the ability to share your live work, we had a pretty good [00:06:30] ecosystem for our developers to be able to code together, work together on the same projects, obviously have the same code repos.

And we wanted to give some of that capability to the analytics on the backend of our system as well. So this was a rather ambitious project. I think we were able to get away with it in part because of the youth of the company, but we also had to tackle a great deal of [00:07:00] resistance to each of these individual issues, which I’ll be going over shortly. So let’s do a quick overview of the challenges that we faced. Not only did we have difficulties that we needed to overcome for our analysts, for our data engineers, and scientists, but also predominantly for our stakeholders who had very much been used to kind of having to fetch their own data, [00:07:30] do their own work, create their own analyses, and really just put the elbow grease to the manual system and get results.

So the first issue to overcome here was the financial cost of migrating to the cloud. Obviously when you’re aggregating things, you can create a large cost center that’s easily visible and presents a very large number compared to [00:08:00] the distributed cost of having everybody run their own analytics in each department where the line items seem smaller, but the overall cost is much more significant. So we took an approach of creating a detailed roadmap, showing the migration of the old system, the benefits that we would have in the cloud and launched in a phase-by-phase approach so that we didn’t immediately [00:08:30] provide our finance team the shell shock of a huge ticket item that seemed to come out of the blue. The main thing that we did with this was initially on the technical side, we started by launching our data messaging backbone on Kafka, which is our pub sub mechanism and our message broker for the new data ecosystem that we have in the cloud.

The incremental cost of Kafka [00:09:00] was relatively small initially and ramped up only slowly over time as adoption to the new tools came online. That cost kind of grew slowly over time. And we also found through that migration that by migrating the message brokering to Kafka, we were able to control costs and keep them much lower than we had been able to do with more integrated solutions [00:09:30] from some of the cloud providers out there. So this actually wound up being at the end of the day, a cost reduction for us, for moving data around, as we were able to eliminate a lot of the Cannes connector tools out there, say like Stitch or Fivetran or whatever, we still use tools like that for rapid ETL, but by not having to rely on them for all of our data transmission, we were able to reduce costs significantly for our data ecosystem. [00:10:00] The skill shortage that we had, HyreCar as a young company, sometimes struggles to get the skills that it needs across the business and in technology, which these days is a complaint I think for everyone.

But what we really focused on over the course of the year that it took to get this done was hiring for talent in domain expertise in cloud analytics, [00:10:30] particularly in event driven systems and one novel technique that we are getting more and more familiar called event sourcing. Internally, we spent a great deal of time building a culture of learning among the rest of the business units. This was something that initially takes a little bit of a push to get started, trying to take people away from their very busy days is not [00:11:00] the easiest thing to do in the world, but we were able to really get people excited about it as we were starting to show the power of analytics in real time, as they were moving through Kafka into some of our monitoring tools that we had set up in the initial phase.

And now that we’re more established, the reporting mechanisms we have and the faculties we have with Tableau, have been [00:11:30] dramatically increased with the new data ecosystem. So the adoption has kind of built a lot of momentum and by ensuring that everyone is willing to learn and use the new tools, we were able to show off a lot of the value that our individual business units, particularly in sales and marketing can derive from the system really clearly and easily. But I want to stress that it took a lot of [00:12:00] forward discussion with the right people who make the decisions in each of these departments to ensure that their teams were on board and looking forward to not having to do their own work on analyses and could actually have a team and a set of tools that they could rely on to obviate a great deal of their week to week work, which in our estimation [00:12:30] was taking our managers depending on which position they were in, somewhere between 20 and 50% of their man hours total.

And then finally there was a great deal of adoption resistance that we needed to overcome. Just change resistance in general is always a thing, but we were able to secure some buy-in from our executive team, particularly our CEO, once we got the ball rolling and were able to get him [00:13:00] on board with that and pushing from the top down the importance of a data ecosystem where we can enforce our data integrity and the quality of the data that we’re putting out, and also the speed of the reporting that we can achieve. And then finally, we had to spend a good deal of time. One-on-one just getting into the weeds with the managers and business units at the company to train them on [00:13:30] new tools, how to access the data in the new ecosystem and how to break their reliance on the kind of tried and true manual way that they had initially learned to interact with our data.

So those issues were of course, difficult to overcome, but I think at the end of the day, the biggest problems mostly were, people issues once we had the technology in place. [00:14:00] And that’s where I think a lot of times when we get into the weeds on implementing the right system for people, we can forget about the very important side of the soft skills and getting everybody to understand how this is going to be a benefit for them. So we’re going to go over a little bit about how we were able to do that after I discuss this maybe a bit too detailed, a chart on the data ecosystem that we put in place and how [00:14:30] we migrated away from a centralized data warehouse, really just kind of running in a combination of MySQL and manually pulled reports.

So in our new ecosystem, we actually tried to visualize our backend for our analysts and the business as three main silos. The first being data generation, this would be your API, your interaction with your clients, integrated services, [00:15:00] such as our payment system, identity provider, lead systems, and Telematics, which is very important in the automotive space, as you can imagine. So we used an open source technology called Snowplow to start gathering analytics from our front end applications. There are other non open-source proprietary services in that space. We still also kind [00:15:30] of make use of segment in some places which has some similar functions, but with Snowplow, we were able to actually just pipe structured data about the interactions with our applications directly into our Kafka pipeline. And from there, into our data warehouse, ensuring that we had really, really high quality analytics over the interactions of all of our customers [00:16:00] with very minimal effort.

And then we also embedded Kafka into our backend beneath the API layer. Kafka actually serves as a persistence layer for us with this event sourcing technique, where we no longer really care about the databases that we use, at least in the initial designs that we do. And we kind of defer the modeling [00:16:30] of the transactions that happen on our platform until later on in the analytics process, or if the front ends need to be able to query that information, they can reconstruct entities just directly from those events. So we went to this model and because of our upcoming compliance needs, having a transactional log for everything that happens on our system really ensures that [00:17:00] at no point are we going to have to go back and backfill data really, it’s just a matter of modeling what we’ve already got and the models that our customers interact with say for their car entities on the platform or the rentals that they have on the platform. Those can be constructed on the fly from the events as well.

So that actually freed up a lot of implementation resources because our data engineering team, as a young company [00:17:30] is relatively small. We only had two or three data engineers during the implementation of this. And they were able to really alleviate a lot of their time on the data generation and application side. It really didn’t have to offer very much help to our developers and they were able to focus much more on putting together the data lake and the analytics system that we have [00:18:00] on top of that to really empower our analysts, to do their best work. So then the intermediary data sources we have on here, we use Confluent Cloud to provide managed hosting for Kafka for us. This is where we also kind of build our own custom connectors, but Confluent also provides that as a service as well.

So if you really need to get data piped in and out of the system, you have opportunity to leverage [00:18:30] those canne tools like a segment or a Fivetran or Stitch or whatever else is out there, but you also have the opportunity to build something yourself that isn’t going to have the associated usage costs, or you’re really just paying for the messaging fees, which are much, much cheaper than the service fees for maintaining the connectors. So Confluent has been a major partner for us in getting this new [00:19:00] integration up and running and making sure that our analytics are as real-time as possible. And from there, we move into our data warehouse, we’re using a technology called Delta Lake, and we’re very, very heavily influenced by Databricks in their community. With Delta Lake, we essentially pipe all of our raw data into what’s called bronze tables in that nomenclature.

Delta Lake is very similar to a parquet file format, [00:19:30] but it also supports streaming. And the recommended organization for this at least at a very high level is to organize into three tiers of tables, bronze as I discussed being the raw data that we’re pushing in, you can also source that directly out of a console and cloud with ksqlDB, and then also the silver tables or the cleaned massaged data that we can actually use for multiple different analyses [00:20:00] is the next layer of that. We essentially use Databricks in concurrence with Dremio and occasional manual data engineering to get data into the bronze tables and silver tables. With Databricks, we were able to accomplish a lot of our goals that we had gone over previously, particularly in the collaboration aspect with [00:20:30] Databricks, you have access to notebooks that can be shared live between analysts that are working together on the same tasks.

It makes it very easy for them to perform peer reviews on each other’s work. There’s very clear inputs and outputs taking in the raw data from the bronze tables; mixing it, cleaning it, enriching it, and putting it back into cleaned up silver tables that can become our verified data sources, which go into our data dictionary to be used for [00:21:00] further analyses and feature data that we put out to our customers, both internal and external. So Databricks, in addition to being a managed spark host, which really gives you access to the compute necessary to run large scale analytics, although that can get a little bit more expensive. It also comes with the new capabilities that you get from being able to do that in the first [00:21:30] place. So we’re able to actually create analyses usually with a two or three hour turnaround now that may involve pulling in data sets from all across the business.

This is a rapid deployment versus what we had looked at in the past, which major analyses would take a minimum of a month or two and could run sometimes up to a quarter or two. So being able to turn that down into maximum one week [00:22:00] turnaround on major analyses, this has really, really accelerated our ability to get work done and to build on top of the work that we’ve done in the past by organizing our tables under those verified data sources. And then finally, we export most of our internal reports at this point to Tableau, which we use as a visualization layer rather than really relying on Tableau prep if you guys have kind of used that in the [00:22:30] past, we really do most of the prep work within Databricks and use Tableau to visualize and make accessible the output for our business units.

We’ve had a great deal of success. I think in getting adoption now that this whole ecosystem is in place. And I mean, I couldn’t be happier with the feedback that we’ve gotten from the company on what we’ve been able to achieve. So just a little bit about [00:23:00] how this infrastructure really specifically helped our analytics team. Once this infrastructure was in place, our analytics team was able to completely supplant all of the historical reporting manual or otherwise that had been done at the company over the preceding four years.

That gives you kind of an idea of how much work you can accomplish. And this was a very small analytics team of [00:23:30] really just two or three analysts at a time taking a stand with this. And as I said, just in two very short months, we were able to bring in all of the company’s reporting into a single place, organize all of that data in a clear structure that all analysts can use and benefit from and start surfacing that data to the business in ways that can be easily consumed and is automatically populated [00:24:00] and refreshed on an ongoing basis a real-time basis in many cases.

So the analytics team really saw the benefits here because with all the data integrated into the same place, obviously this satisfied our accessibility needs. Dremio really came into play there, stitching together a lot of the different legacy data sources. While we were trying to push all of our new events into Kafka, there’s still a lot [00:24:30] of legacy data that we ETL in from other sources that live in a bunch of different databases from MySQL to Postgres to Redshift. So we’ve got a lot of different intermediary data sources there that need to be pulled into that analytic system. And by using Dremio, sort of that abstraction layer between our analytics system and the developer focused intermediary data sources where persistence [00:25:00] initially happens. We were able to give our analysts the tool to source data for these validated data sources without really having to interface too much with the way that are developers are constantly changing things around within the ecosystem. So this gave them a lot of flexibility to just separate their concerns and move really quickly in their own swim lane.

So this also gave us the ability to analyze [00:25:30] any kind of data that we had in the system, whether it’s structured unstructured, we’re using Mongo for unstructured data. And I think we might start using Cassandra soon, but this also enabled a much faster decision-making. And then the analytics for our environment really just had far better transparency. And after I mentioned Regions with those notebooks and Databricks a much better [00:26:00] collaboration, but I would also say that with the centralization of this data, we also de-[siloed 00:26:36] the knowledge at the business level, which allowed a breakdown between the silos of our different departments, which was an unforeseen windfall of this. But once we were able to launch this system, the collaboration of our analysts and the sort [00:26:30] of open sourcing of the knowledge that the company really led to a breakdown of the defensive silos at the management level in the company, because everyone is now speaking the same language.

And we have essentially a shared vernacular for how we talk about each component of the business. So this shared language that erupted from this was incredibly powerful for the collaboration [00:27:00] of our business units working together, which has had a great deal of positive impact on our portfolio management as well. And then finally visualizing the data is so much easier now, it’s very easy for us to show the outcomes of the work that we do across the business from product to operations. It makes it very easy and very explainable, what changes we make produce, which impacts [00:27:30] that affect are very important operational metrics. So I think that gives a pretty strong overview. And I thank you guys for listening to the talk today. I think now we’ve got a few minutes to open it up for Q and A, and I’m also available on our Slack channel for anything that’s spills over.

Kim:    Great. Thanks, Ken. That was great. If you do have a question, just use the button in the upper right-hand corner of your screen to share your audio and video, and you’ll be put into a queue. And if for some reason [00:28:00] you do have trouble sharing your audio, then you can just ask the question in the chat. I’m going to take one question now, and then I’ll switch the question in the chat. [Shisrini 00:28:10], I am giving you access. While we’re giving it just a moment for [Shisrni 00:28:22] to come on. We do have a question in the chat. Are you using Mongo DB with Dremio? And is that what you said?

Ken Grimes:    Yes, we [00:28:30] are. There are some limitations to that. You can really only get one depth back out of Dremio, it’ll actually just kind of flatten the response into a single level of depth into the object in Mongo. So we can use Dremio to fetch data out of Mongo, and we do most of the time, but in some cases you do have to connect directly to Dremio or excuse me, to Mongo and use their graying language to get more complex queries [00:29:00] out of there. We also have a few pipelines that fetch data out of Mongo, and then flatten it out into the warehouse. But our data engineers handle that with individual deployments, essentially special purpose to construct some tables in the data lake.

Kim:    Great. Thank you. Another question. Are you planning to release any packages for Rideshare platforms?

Ken Grimes:    Yes. So this is [00:29:30] actually an interesting question for us because we just started partnering with ThoughtWorks who are one of the leaders in doing event sourcing architectures. We have found that this kind of architecture is not really pervasive within our space, and there’s really an opportunity there for us to build a framework that others could work on top of, that really combines sort of an opinionated prescriptive approach to doing a domain-driven design [00:30:00] implementation with CQRS and event sourcing. So we are experimenting potentially around starting an open source project for that. And I hope that, that’s the kind of product that you’re talking about there, but if not, please clarify. And I can kind of go on more on the product side of things, but yeah. We do have some plans to do that.

Kim:    Perfect. Before we cut off, I know that there’s a couple of additional questions, but that is all the questions that we have time for. [00:30:30] If we did not get to your question or you have an additional one, you’ll have the opportunity to ask it in Ken’s channel in the sub-service Slack before you leave. We’d appreciate if you would fill out the super short Slido survey on the right-hand corner of the screen, the next sessions are coming up in just about five minutes and the expo hall is also open. I encourage you to check out the booths to get the demo on the latest tech and win some awesome giveaways. Thanks so much everyone, and enjoy the rest of the conference. Thanks, Ken.

Ken Grimes:    Thanks guys.