Dremio Jekyll


Subsurface Summer 2020

Smart Data Lakes for Predictive and Prescriptive Analytics

Session Abstract

This session features the smart data lake — what it is, and how it’s evolving and revolutionizing cloud and edge analytics. Attendees will see firsthand how Telenav, a leading provider of connected car and location-based services, uses smart data lakes and what we built for data artifacts that require predictive and prescriptive analytics.

Presented By

Kumar Madalli, VP of Search, Data Science and Big Data Platforms, Telenav

Kumar Maddali is the vice president of product development at Telenav where he leads initiatives in search, data science, and big data platforms. Prior to Telenav, Kumar led the digital transformation and modernization of systems at companies such as InQuira, Stratify, and Intergraph by creating data products, services, and applications all powered by analytics.


Webinar Transcript

Host

Good afternoon everybody or good evening, depending on where you are in the world. Thank you for joining us for this session. As a reminder, we will have a live Q and A after the presentation. We do recommend activating your microphone and camera for the Q and A portion of that session to make it more interactive. So with that, please join me in welcoming our next speaker, Kumar Madali, VP of search data science and big data platforms at Telenav. Kumar, over to you.

Kumar Madali

Thank you. Thank you. Hello everyone. This is Kumar Madali. Thank you for joining this session. I wish this could have been in-person session, but due to unprecedented reasons we are meeting online. Hey, nevertheless, I'm glad we are able to convene. Thank you all for joining again, and thanks to Dremio for giving this opportunity for us to share about telling our journey as smart data links. To make this session between track two, [inaudible 00:01:03], we created few polls and as we go through this presentation, you will see those popping up in the poll window. I would appreciate if you all can be part of it.

Briefly about me, I have 25 years of experience in enterprise search and location based services, primarily building search services and data platform powered by analytics in different verticals, such as E discovery, knowledge management at companies like Esquire, Stratify and Intergraph. Besides work, I run. I'm an ardent half marathon runner with the best time under two hours.

Usually I do one or two half marathons every year, but due to COVID situation, I couldn't do any run yet this year. Let's see how it goes as we progress. What do we do at Delila? We provide delightful connected car experiences through our product offerings such as navigation in car commerce and infotainment, including productivity. And this horizon is expanding. And let's go deep dive into that presentation. I will give you some context about the domain we are dealing and take you guys through this journey. It all starts with the user, user gets into the vehicle and that typical user would go through certain activity, which includes going to work or time to go into restaurant for a dining, sorting things out at bank or grocery and going on a road trip for travel or doing a leisure activity such as going to park or theater fitness, lots for food.

And temporal activity typically would be, either to do it right now, or next few minutes, to hours, or you plan ahead for vacation, or when you arrive at destination and then weekend holidays or weekdays, there is a variation in terms of what you do and how you do. If you look into Aura, it's about the user and we capture some information about user and when it comes to the vehicle, we collect telemetry from the vehicle and it's all about the spatial and temporal data, what we did. And we'll talk about some numbers. We support about 90 plus countries across all regions, 40 plus languages and 40 million plus unit users we have. And this is growing and I'm going to skip some numbers and 80 billion plus events per year, which translates to about 220 million events per day.

And about 80 terabytes plus new data per year. Are accumulated to data. [inaudible 00:03:43] What we have, it is close to petabyte. And if you look into the lens of V3, we have velocity and variety and volume wise as we are collecting the telemetry from vehicles, it is expected to grow even more exponentially. We have about like last one, two years if you look into three exceeded year-over-year growth when it comes to velocity and volume wise, it is growing two X.

And this is, as I said, expected to grow exponentially. What are some of the challenges which made us to think about that something like smart data link and our journey with that smart data links. Data is present as a silo, having the fixed data. Usually they are disconnected from the stuff, the system, making it difficult to get the right information at the right time for the right stakeholders. That's one of the key challenge. And data is growing across all dimensions, and we need to have an ability to manage it and while controlling the cost. Having that democratization for the data by enabling the discoverability through APIs and self service [inaudible 00:04:56]

And providing a multicloud support, we are mainly AWS shop, but we do have the support of A-list cloud in China. And in addition, we do have on-prime kind of data processing set up with a lot more regulatory and restrictions. And then the data formats, which are getting rapidly changed, and that needs to be tackled. Having a data guard to measure quality and compliance. Having an ability to do on demand basis or scheduled manner, you can run the data guard, so that you can measure the quality at any time as needed.

Importantly, this is what ... like how do you make your data smart data is where like contextual and semantic relationships, which needs to be created as you are progressing through that data life cycle. This is an important piece. What I mean by that, when we talk about the context in previous slide, we have seen the temporal aspect of it.

So when you are going to work or returning home on the weekday, what activity you do over the weekend versus when you are on a vacation could be different. Type of activity may be seen as example, dining. You may do the dining, but how you do and where you do would vary significantly. You may be doing it in the relaxed present, like during weekend or on the vacation. You may be doing it quick grab and go, or like fast food during the weekdays. This is where the context similarity and association of the context plays an important role. And we will talk through, as we progress on how we are making it smart data, how smart data link is helping with these aspects. Semantic relationships, same thing. If you are parking a vehicle in the ... if you stop the car in the parking and you are headed ... you may go to either restaurant or you may go to bank or to grocery.

So then you do, and then what kind of sequence you do and where you are doing it, whether it is the home area as what we have seen in the previous slide, or you maybe doing it away from home area. There are certain associations and relationships which needs to be created. That is what like I talk about this semantic relations. And what is a smart data link? What are the characteristics? It should be single source of truth. And it should have an ability to acquire and store any data structure and structure as semi-structured at any scale with the optimal cost and having a catalog to enable schema Henri, [inaudible 00:07:47] meaning you should have an ability to do that transformations at any time down the line so that you can solve the new use cases or create new business models.

In addition to having greater scheme, right [inaudible 00:07:58], this enables to have a lot more flexibility than being rigid. Democratization by having the self service utilization tools and APIs. Flexibility to decorate tucks and additional knowledge as we are progressing through the data life cycle with the contextual and semantic relations. Governance with the data quality compliance, so on and so forth. And it's good. Like we redefined smart data link characteristics, but what it should enable. You need to enable the creation of new business models and build data products and services with agility and then have cloud and edge analytics. Now let's talk about the components. What are the components of that smart data link? Data hub. What is a data hub? It's the data store containing the single source of truth, plus multiple semantic stores to create the data products. We briefly talked about example of what the semantic means, and then the context aspect of it as well.

And how do you create data hub? Through services. Services create data products and they consume data products. What is a data product? Or you can refer it to as an artifact also. It's nothing but a profile or a model that gets used it for predictions. So the profile could be either a special profile or a user profile or a vehicle profile or combination of both. To give an example, what we have seen in the previous slide, typical user activity would be confined to a specific area. You can frame that as a home area product, and then you can have all interesting use cases to be solved either in commerce or beyond commerce. That is one example of how you can frame this data as by creating a data product and leverage it in other areas. And the vehicle data product would be road-safety products, which could better granularity of a given small area like home area, or it could be to state, country and it can roll up to even richer.

And now I'll talk about some of the key flows or flavors of data, how it flows through the system and lands into the data hub. By no means it is a comprehensive variation of all the flows, but this is just to illustrate a point on how it goes through various stages.

Data comes in the form of service logs, and this gets collected through fluenty agents, gets executed or passes through fluenty aggregators to eventually land in elastic search. And then the service metrics, which are captured through the premitiers [inaudible 00:10:51] client libraries gets scraped and processed as part of premitiers server and goes into the time series database.

If you look into the data hub, regardless of where the data is getting stored, your single source of truth is always history. Other forms of the data service events, which usually comply to even specification. And these are typically the actions that will be taken based on certain triggers and the telemetric data coming from the vehicles as sensory information or behavioral data based on the interaction of the user with the applications that are residing in the vehicle, goes through the event hub as events, and then some events are configured to go through the stream processing and that gets into the stream processing flow.

And then eventually after the decoration, as you can see that some semantic enrichment happen as part of the stream processing and goes to the S3 for the single source of truth, as well as it gets folded into Druid for real-time operational analytics and certain parts would get into the red shift. So that for downstream API services consumption via API is enabled into the R tier. [inaudible 00:12:00] and other parts of the data which flows through the event hub, gets through the batch processing. It comes to the raw storage and then it goes to the batch processing and eventually lands in S3. And some parts again comes to the red shift for a downstream consumption. And the road network data, which is a special POI data point of interest or other data fields such as whether it gets through the transactional system and processed through the pasture [inaudible 00:12:32] connectors and then comes to the landing storage or rush storage.

And eventually after going through certain decorations, gets forwarded into S3, which is a single source of truth and certainly a 10th of the downstream consumption. It takes it to that shift. So if you look into the data hub, primarily you do have a single source of truth, which is S3, and there are additional semantic stores out there. And you can think of this whole data hub as complete storage with an abstraction, and which enables to have the governance. And if required, some of these semantic stores can be rebuilt or reconstructed based on the single source of proof, what to have it in the S3 with the decorations such as stacks that is present in S3.

And it's good. Now we have data hub. And let's see how the data products get created. Again, the same data hub will have single source of proof and the semantic stores. And it goes through the execute engine via data hub connectors, and then the data products gets created. Services and then it comes to the data hub connectors and based on the type of the data product, it may get federated across different semantic stores, or in some cases, it directly goes to S3 all combinations are possible. And certain immediate like ad hoc analysis and exploration and part of at the NAR and the Neo [inaudible 00:14:01] is also getting used.

And putting all pieces together, you do have a data hub services and data product services create data products and services consume data products and services eventually are posted into data hub. And all these are going to buy security data catalog with the quality guards to data quality guards and compliance. And the manifest of data products also are being forwarded into the data catalog so that any reconsolation can be done at a later point. And all these are to enable different forms of analytics for the users, could be internal users or it could be external or partners.

Now let's talk through about the analytics and then we'll walk through one specific use case as we are progressing. A quick glance of analytics. It needs to answer certain questions, mainly the descriptive, what has happened in the past until few seconds back and then diagnostic, which is why something has happened. If you take an example of the same home area, what I listed or what I gave, what I talked about earlier. So there are certain trips that would happen. There is some affinity for a given user to do certain activity when you're doing it in home area versus when you do it outside. So what happened is about the factual information and having an ability to do any kind of slicing and dicing or exploration is what it enables. What didn't happen is what diagnostic would enable to help.

And this would feed into the predictive. It would help to answer questions like what may happen in the future. And the collective information would get feed into the script too, which enables to list out, set out for options for getting that optimal outcome. And descriptive and diagnostic again, the under the hood, it is all the data hub is where the semantic content, single source of truth is present, goes through the execution engine and different layers of self visualization is enabled for observability. Usually it goes to Grafana and for our functional and product analytics, it goes to coupon, not [inaudible 00:16:25] Tablo and superset based on different stakeholders and internal different business organization, different visualization tools are enabled to throughout. And in addition, the services also controlled via APIs, but any of these insights and data.

And let's talk about the predictive analytics. Again, this is data hub here pretty said, create [inaudible 00:16:50] is a data product, which was created from the example, what we talked about in the earlier slides, but that is going to be there in S3 itself as a single source of proof. And all the bookkeeping and metadata is managed it in the catalog. But for illustration purposes, it is being listed it as in separate block in the data hub. And it goes through the execution block. And this is where that ML model life cycle kicks in and what are starts with an experimentation. And then usually experimentation happens through the Jupiter notebooks and it comes to the development phase as part of the development phase, EPS get created first, and then the model building model parameters tuning, and then evaluation happens as part of this space. And then the model gets created.

And this example, like destination model from the trips, what we have captured it based on the user's activity one of the model or artifact of that would be that destination model. And then that gets stored in S3. And some meta information is stored in the dynamo, regardless of verity stored the bookkeeping information and matter information is always stored in the catalog and then comes to the testing phase. Testing would happen either through the servers or the containers based on the characteristics that are required for a specific model, and then goes to the production and then applications would start consuming again [inaudible 00:18:21] and certain models or certain products would be sent to the event hub as a message. And whoever is listening to those topics would consume the updated models and post to the monitoring. And these interactive visualization would happen. And also in certain cases, the simulation happens for the models to see how the behavior would be if it picks up and changes.

And data product is the trips, which has been created based on the user's activity in a given area or overall user activity across all the areas. And that gets used it to create the destination model. And then how it gets used it to give a concrete use case as user gets into the car, then you start the ignition. You will see that prediction showing up where that user may likely heading at that point in time. And critics to analytics, here the critic to data products also gets used it in addition to other semantic stores. And it almost goes to exactly the same cycle I will skip through in good interest of time.

And exactly goes through the same flow. But that model that gets created is the productivity model. So what it means is that as in the previous example, venue, then the user gets into the vehicle and start ignition and going to a restaurant as there is a reservation that has been made. And there is a traffic incident on the way, which makes user not to reach on time for the reservation. At that time, there are two possible outcomes that can happen. Either you may have to postpone the reservation to later point in time, if there is an availability or you need to cancel it. So this happens based on the notifications under the hood, like what we have built it, that would be communicated either you will change the reservation or it get cancel. These are the actions that can be taken for the user by the system. This is what the prescription for the productivity, as to give an example. How it looks.

And now Let's talk little bit about what is next. So we need to transform into MLRs. Right now there's a quite a bit of operationalization that needs to happen to make models more robust working in conjunction with the profiles for the prescriptive analytics. And then we need to have a more incremental processing and we are watching a patchy hoodie and iceberg this space actually. And we'd like to get to this and adopt to this sooner than later. And the creative new data products and enabling new business models in real time is another critical area. It is like a journey and not a destination. And we need to continue to iterate and improvise our process and then the technologies around that.

And what are the key takeaways [inaudible 00:21:37] given phases to visualization and APIs with the contextual and semantic relationships. If you look into Maslow's law of hierarchy that mandatory need sort of like you do have a data and having the single source of truth is like mandatory. And then you have the discoverability and governance for your data. Then it comes to the fulfillment needs, which are like creating the semantic richness. And then having the data products. Data products is like slips on the pop with the fulfillment as a fulfillment need. And that is what we need to decide and get to.

And don't underestimate the monitoring and data quality guards to prevent the data slab. [inaudible 00:22:24] Otherwise, it's a garbage in garbage out. If you don't know what you're dealing with and you don't have enough checks and balances with the data guard, it becomes increasingly complex and challenging. Everything cost.

What matters is important in terms of the characteristics is very critical. As an example, if your business doesn't demand to have a multicloud support, at least in a visible feature, then how much you want to have an abstraction layer and what extent you want to directly use the technology. You need to strike a good balance because there is like you said, dollar cost and then the opportunity cost gets in what.

And clear the explorer opportunities to create new data products. And you need to take an advantage of smart data link. To drown analogy, you have a typical in a software development, you have the source code, and which enables you to create some binary artifacts. If you have an issue with any binary, generally the debugging or troubleshooting, you'll try to trace and then look into the swiss code, of course through the log files, and there are additional mechanics, or they have to go first with troubleshooting. But the point I'm saying it is in the end, like you'll get to the swiss code to see what's happening then they try to address. Likewise when you're creating the models to enable your ... it was technical capabilities through predictions and prescription.

You need to have a good way to explain why something has happened and when it happens. So for that tracing it to get to the profile and then having it a good day re cancellation would be highly helpful. But at the same time, depending on whether you're doing it for the cohort of users or small set of users, you just can have enough sampling in your creating the profiles while you want to have and complete that explainability. With that, I would open floor for any questions.

Host

All right. Thanks Kumar. So as Kumar mentioned now is your chance to ask some questions live. If you have a question, please use the button on the upper, right to share your audio and video, and then you'll automatically be put in a queue. So let's go ahead and get started. All right. If for some reason you don't pop up on the screen and you still would like to ask your question, unfortunately, I've lost a couple of speakers already. Please do ask them in the session chat and we'll go ahead and ask them of Kumar verbally. Let's try one more. Any luck there. Nope.

Kumar Madali

And as probably, you may have seen it in the poll window. Some of the pushes, those are the questions you need to ask yourself very critically to see how your data link can leverage to bring the value and the monetization to the augmentation.

Gupiko:

Hi, Kumar, this is Gupiko Pugushna. It was a nice presentation and again I have two questions. One, you said that the data on the smart data can be captured and a user is at restaurant or places. So how that data is captured. I mean, this a product or whatever service that you have is used the car, right? So I just want to understand like how the user's behavior is captured when he's outside of the car. And the second question, how the security in care or in the process of capturing user's behavior sometimes people may not want to share their location, may be a specific information. I mean, it's their choice, right? So just wanted to understand how that security aspects are being consider in this scenario.

Kumar Madali

Great. Good questions, Gupi, thank you for asking. I will quickly answer the first question. Second one is a little loaded. Raleigh, like we can take through that as we get into the Slack channel. I'll be happy to answer that question in more detail. I'll try to answer for both questions. So you're absolutely right. So when you are in the vehicle, there's only certain information that you can capture, but the way you collected the information or the data from the vehicle is based on the ignition start. And then the GPS probe that enables you to construct and see where you are heading and where you are poor parking. However, once you park at a location, you may be going to restaurant or the bank or some other area, or to run some other erand that aspect of it, you will not be knowing, it unless you have an active navigation, some kind of a scheduler like through which that integration is happening, right?

So if it is an active navigation, if you have provided the specific destination, then you do have that information and that gets used it. If it is not there through the moment data ... and that means, based on that lack of pedestrian, as you are heading into a specific location, that information would get captured. And then this is where that semantic richness comes into picture build on where you are heading, based on that particular latitude and longitude. You can compute it to see what is that particular point of interest or an entity that you are heading.

Good call. All right. So I think given the time, I think that's all the questions we have a chance to answer today. Thank you, Kumar for this breakout session. And for everyone attending Kumar will be available in the dedicated Slack channel for more discussion and Q and A, so just look for his name and the Slack channel.




Ready to Learn More? Here Are Some Resources to Help

Need Some Help?