Dremio Jekyll


Subsurface LIVE Winter 2021

5 Lessons Learnt from Building Context Aware Smart Data Lake

Session Abstract

This session features five lessons learned from building a context-aware smart data lake. We will discuss the Five Ws (who, what, when, where and why) of the context-aware data lake supporting 40+ languages. Attendees will see firsthand how Telenav, a leading provider of connected car and location-based services, works with spatial and time series data.

Presented By

Kumar Maddali, Vice President, Product Development, Telenav

Kumar Maddali is the Vice President of Product Development at Telenav where he leads initiatives in search, data science and big data platforms. Prior to Telenav, Kumar led the digital transformation and modernization of systems at companies such as Oracle, Stratify and Intergraph by creating data products, services and applications all powered by analytics.


Webinar Transcript

Speaker 1:

(silence). Hey everyone. Thanks for joining. [00:00:30] We’ll be getting started in about four minutes here. (silence). Hey folks. Thanks [00:02:00] for joining. We’ve got about two more minutes. Wait for a few more attendees, and we’ll get started on time, at 01:05 PM Pacific Time. (silence).

Kumar Maddali:

Excuse me.

Speaker 1:

Hello everybody, [00:04:00] and thank you for joining us for this session. I wanted to go through a few housekeeping items before we start. First, we will have a live Q&A after the presentation. We recommend activating your microphone and camera for the Q&A portion of the session. Simply use the button on the upper right stream to share your audio and video. You’ll automatically be put in the queue, and there we can have you ask your question live. So with that, I’d like to welcome our next [00:04:30] speaker, Kumar Maddali, vice president of product development at Telenav. Kumar, over to you.

Kumar Maddali:

Thank you. Thank you for having me. Good afternoon. Good morning. Good evening, where all you guys are located. Hope you guys have had a fantastic time at this conference. Thanks to Dremio for setting up this conference and thank you all for choosing to spend next 25 to 30 minutes with me. Without further delay, let’s get going.

[00:05:00] Can you guess what these places are about? Truth or consequences? No name. Can you guess? Are these TV shows or radio shows, or game, or some random phrases? It’s difficult to interpret our guess without having any additional context, right? Let’s see what these are. [00:05:30] Truth or consequences is a city in New Mexico. No name is a settlement area, which is like a township with a small population in Colorado. Why do you care? Imagine if you are dealing with the special data and building location based services, as an example, location based search. It is all about understanding user centric in the context.

How do you [00:06:00] get that? How do you understand the intent? If you have the data that is coming into your system, and if you can generate the context as it is appearing or as it is flowing through your data lake, then it enables to have [inaudible 00:06:15] enrichments to your data that helps to solve interesting use cases and provide better experiences. This talk is all about how you can create context awareness through your data [00:06:30] while you are building the data lake, and what the lessons we have learned, or rather the best practices, and how you can apply this in your own context or in your own business, while you are building your own data lake.

Briefly about me, I have 25 years of experience in enterprise search, web search and location based search, and analytics in different verticals. I’m a passionate half marathon runner. And these [00:07:00] days, like most of you, I’m also surviving by working from home and often going on hikes. You can reach me out at LinkedIn or TWitter.

What do we at Telenav? We provide delightful connected car experiences through our in-car software services. And we have the software deployed in 90 plus countries and 40 plus languages. These are the regions where our software is installed, in various automobiles, in different regions. [00:07:30] Primarily we deal with special data, which is a time series in nature, and also the telemetry aspect of it, like the vehicle or automobile sensors, along with behavioral data of how users interact with the software. All that information like [inaudible 00:07:48]. These are the type of data what we do.

This is the agenda we’ll go over. What is smart data link? I’ll go through concrete example of what [00:08:00] does it look like, the context awareness, and how you can do it while building the data lake, and some of the lessons learned from our experience, which hopefully you can correlate in what you are doing in your data lake or in your business, and then apply.

Data lakes. Data lakes, as we know, anytime we talk about data lake, we look into volume, velocity, variety. These are all important characteristics of where [00:08:30] is your single source of truth and how good your injection efficiency, and how robust your ETS are, what kind of technology framework and tools that you are using, and how you can enable your analytics visualization. And do you have discoverability using schema on read? What kind of specification you have, and then is it a stream first? Streaming batching? And how do you optimize operational costs? And then what kind of interfaces you have for accessing [00:09:00] your data, whether it’s a secret interface or some other interfaces. And then the data governments. Like, how do you have this data traceability or cataloging? And then data security and all that.

These are all very important characteristics. Depending on your organization’s maturity level and also where you are in the journey of data lake building, you may be doing most of these are all of these, or I’m sure you may be having more than these also. These are some [00:09:30] characteristics which typically [inaudible 00:09:31]. But I want you guys to also have an additional lens, which is the context, how you can have this context awareness to your data while you are building.

Now let’s talk about this smart data lake. It’s good to see the interesting buzzwords, like smart, intelligent, cognitive, so on and so forth. Right? But it’s important to understand that what is it really smart here and [00:10:00] how are we defining. The frame of reference is so critical. So for us, smart data lake means it is helping deliver context aware semantic services. I will go through example of what this context awareness is, that provides you centralized governance, which includes stuff like following compliant to the regulations, and then the security, authentication, authorization, data principles, so and so forth. And data democratization [inaudible 00:10:30].

[00:10:30] Typically, in a organization, there are different center of excellences that may have been created, who offer or who’ll provide services. If you have the data democratization, then the different services they can decorate or have the data connectivity so that you can provide better enrichment, which can accelerate your innovation as well as expand your business models. And all this with enough governance, [inaudible 00:10:59] [00:11:00] is what you need to provide.

And analytics, there’re different forms of analytics, like descriptive [inaudible 00:11:06], predictive and prescriptive, so on and so forth. What has happened, and why it happened, and what may happen in the future? These are the different type of questions that you may have to answer depending on who the stakeholders are. And then having a good, clear boundaries with an APIs and data products would enable to accelerate in building [00:11:30] these context data services. Of course, all this needs to happen in reliable and robust manner.

Now we’ll talk about the context awareness. I will give you concrete example of the location. While we are taking it through what this context awareness is, I’ll spend some time here so that you can understand, and importantly, you can correlate what the context elements are there in your [00:12:00] business, what you are dealing with. These are the place types, which most of us can correlate, and that as we have some more other activities that happen in any of or most of these places. This is a typical time that we may be spending. It may vary from personal to personal, but generally, this is an illustration to see how much time you may spend, typically, at these places.

These are the days of the week where [00:12:30] you may spend time at these places. Again, it may vary based on the region where you are located and the seasonal part of it, could be different in certain regions. And at the same time, that is what you typically do in the months. May vary depending on whether it is summer or winter, based on the geographical location that you are in, so on and so forth. These are the type of POS and categories you may typically correlate [inaudible 00:12:55]. If you see in this place types, what is common [00:13:00] is that for all of these, there is a address element. And that is what I’m going to go through details.

Most common elements that what you see in the address, that any of us can navigate to is city, street, and door number. Of course, based on the region and country, you may have some additional information like the district, township, so on, so forth. But to keep it simple, these are the most common [inaudible 00:13:27] that you may need to have, [00:13:30] if you need to navigate a Google-specific location.

Now let’s see how it works. This is a world map. If you see, these are the cities that are present in different parts of the world. By looking at the names, it may look interesting, right? As an example, if you see street, it’s a city in UK. And if you put the default mental model, or if you operate in the different context, [00:14:00] typically, like when someone says the street, the immediate [inaudible 00:14:04].

To take an example, [inaudible 00:14:09]. It means [inaudible 00:14:12] direction. First is the street, the body. And then street is the street type. Okay? That is what your typical default thinking would be. But if you associate the context, based on the location, and then the personal, like what the historical references that the individual may have, [00:14:30] then you can have a better correlation when you are trying to create your knowledge, that can enrich the experience in different use cases.

It seems very strange, some of these, in the different languages again, certain places are worse if you see. And then particularly the multilingual countries, it looks very interesting what the interpretation of it in a different language would be for the same phrase or the [00:15:00] sentence. Now let’s look into the street. You see the color coding. Purple is street, and then the building is in yellow color, and blue is the sub building. This is in France and it’s in French language. To keep it simple, I will read the last part of it in English.

It is the 14th July, 1789, which is the French Revolution started. Right? In France, it is pretty common to use [00:15:30] historical, significant dates as the street names, which is very seldom elsewhere in the country. This isn’t different flavor of context that you need to be aware of, for all the data, what you are dealing with. If you see Belgium, it is a multilingual country, French, Dutch, and German are the spoken languages. If you look into the city, Brussels, it has both French and Dutch as spoken languages, [00:16:00] and there are certain national policies and regulations mandate to have in certain areas, who have the street boards or street signs exist in the specific language.

You have a combination of things in Dutch, French only, that’s optional, vice versa, in all kinds of combinations. If you see, some streets that appear, [inaudible 00:16:22], this is a French street or the French word, and then [foreign language 00:16:28] is a Dutch word. And you [00:16:30] do have streets appear like that. How interesting it would be based on the type of data, what you get in, and what kind of entities that you need to extract from the behavioral data or other forms of data, is an illustration that I want all of you to put the correlation in what you are doing.

This is a door number. As I said, I’m taking it, city, street and then the door number. Let’s look into the first example. 17 [00:17:00] over 19. This could be, depending on [inaudible 00:17:04], building 17 and 19, or building 17 to 19, or it could be apartment number 17 in building 19, or vice versa. Each of these have different semantic interpretations. If you start looking into the data as a data, and take it through that your normalization may take up certain punctuations, [00:17:30] depending on from where the data source or data is coming in. And that may completely change that interpretation of how and what you look into. That is what is important, to ensure you always enable to have the context awareness to your data as it is flowing through the system.

Now I’ll take a different variation of it, which is the relationship for the location. So far, we have seen the address part. Now this one, probably [00:18:00] some of you who are in the [inaudible 00:18:01] can correlate it. This is San Jose Valley Fair Shopping Center. And then it has Nordstrom next to [Gab 00:18:10], inside the Westfield Valley Fair Mall. Within Nordstrom, there is a coffee shop, and it is next to Students Creek, near intersection, near Freeway 880, so on and so forth.

If we look into this example very closely, there are quite a few prepositions. Right? You don’t have that, typically in the data, [00:18:30] but users, based on that, from where they are coming, they’re originated, or like in certain part of the countries or in certain regions, it is pretty common to refer to the location using propositions. You need to ensure that this aspect of it also, from the behavioral data and the other forms of the data, like you have collectingly and then correlated at the right place so that you can build a knowledge graph. And then the entities, [00:19:00] and then other downstream can leverage that intelligence.

Putting all these things together. Context is a function of one or more elements. I just have gone through very detailed location. And then the time as you can see it from the previous example, the opener was, or peak or off peak visit times that can have a significant impact in terms of what kind of activity you can do. So these are the other [00:19:30] forms of the context, which we’ll look into and decorate as we are taking it through, as part of that different stages of their data lake.

Now here, I want to think about what are the key context elements that you may have it in your business, and while you are building your data lake. And see, when and what you could have it, and that can really help to accelerate your innovation and in the business, and then solving the use cases. [00:20:00] Now I will just talk a little bit about the technology in terms of on that, like how the data flows and where we enrich this context.

excuse me. So these are the service logs, typically. Sorry, service logs that gets collected through [inaudible 00:20:22], that collector. And then it passes through the aggregators, and eventually lands in Elastic search. These metrics are captured [00:20:30] through the Prometheus client libraries, and it goes through Prometheus server processing to have scraping and few other steps, and [inaudible 00:20:39]. But all these single source of truth is S3. If you look into the service events, which usually comply to the specification, even specification, and the sensor data coming from the automobile and behavioral data with an interaction of users with applications, [00:21:00] all that get generated and it goes through the event hub.

And then some part of it gets processed as part of the stream processing, stream services. This is where some domain enrichments are, like semantic enrichments would happen with the context awareness. Then all the roll-ups are made available through [Bri 00:21:23], and some part of the key value user profile information would get into Cassandra. And [00:21:30] then some part of that goes through the batch processing. And again, as part of the batch process, also the domain enrichments or the semantic conditions would happen. This will be available, again, as the single source of truth to all the refinements in the history. And then your data liquidity engine is used for having all the discoverability, and slicing and dicing for all the downstream consumption.

This is the road network or PyData, which goes to the transactional, [00:22:00] custom connectors and Lansing Postgres database. And eventually all these are sold from the main services or the [inaudible 00:22:08], or through the [inaudible 00:22:12]. If you look into and connect the dots in what I have shown earlier, with the context awareness, you have the data coming in from that the road network or the PyData, or there may be publicly available data sources. Or having it from the behavioral data. Data [00:22:30] users interaction.

So taking that one and enriching it with the context, and then the semantic relationships, while you are processing through, can really help you to build interesting or info-derived entities, and then create the knowledge graph or certain additional in-third attribution that can help you to solve use cases. There’s the key point I want to illustrate. As it [00:23:00] is coming in from the data, data would remain data, unless you try to have an enrichment at the right place so that it can solve the use cases in more intelligent and smart way.

Now I talked through on the lessons learned. Typically, these are the five W principles which I’m sure most of you are aware of it. How you frame the problem is what typically lead you to the solution. Right? [00:23:30] So when you are framing the problem, you should ask the right questions, both from up, down, as well as bottom, up. Who’s your consumer, and why are you doing it? What are you doing? When you are doing? Where you are doing. So on, so forth. These are some of the lessons learned from our experience, what we have gone through. Why? Lesson learned. Why? Always think about product. Product centric thinking. Typically, when we talk about the data, [00:24:00] we always look into from the producer standpoint, right? What is the velocity? What is the volume? And where are you storing? And then how are you accessing it? And what kind of governance is there?

All that is very important. But in addition to that, having consumer centric thinking, or see who the consumer of your data will be and how they can access, what kind of APIs you need to expose, and do you need to have the data [00:24:30] products. If you compliment that thinking with a consumer centric approach, then that can really create lot more additional value than what you may have.

Lesson learned two. Who? There are usually different objectives and KPIs for different stakeholders. We all like to think that there are one set of objectives that are available for the data lake access. But in reality, if you see in any organization, there may be different stakeholders. This [00:25:00] is just glimpses of, what are the various stakeholders that you may have? You may have more or less, and some of the rules could be different.

But the point is, when you look in with the lens of product owner, they may be looking primarily from the cohorts of finance. And from engineering lead, maybe SLAs, throughput, so on, so forth. As if you do, how am I going to generate new revenue streams? And how am I going to expand my business? Stuff like that. If you look into data scientists, how can I just [00:25:30] extract some interesting partners? And how can I provide my predictions or the prescriptions? And how do I tune my models? Stuff like that.

Like that, there’s a different persona and different expectations from the data. But making sure that who are your consumers or who are your stakeholders that really helps you to make sure that the right balance is made, and then you make it available for all the stakeholders. Lesson learned three. [00:26:00] When? When you perform certain quality checks and classification is very important. If you are doing it too soon or too early in the processing, then you may have to go through certain, additional data integrity related checks. Particularly if you are doing it, any kind of intelligence of the knowledge graphs, you may have to have those updates happen.

If you are doing it too late, then your turnaround time, that overhead light would be there. So the point is that [00:26:30] there is a good balance that you need to maintain when you perform the data quality checks and then the classification, like what you do, which is required from governments or other perspectives as well so that you’ll make sure like your data is trustworthy for all the downstream consumption.

Lesson learned four. Where? Where are you doing it? Meaning, if you’re going to have the domain expertise [00:27:00] or the service expertise bringing it to the data teams, or have data [inaudible 00:27:07]. But there is no one right answer. This is where, based on your organization structure and value that you are bringing in from the data lake, you may have to strike a balance. And one thing to look into is that maybe decentralize the domain base ownership. The data products and all, let [00:27:30] that be provided to the services so that they can enrich the data, because the domain experts like could be [inaudible 00:27:39] much more than the infrastructure or data engineering teams.

But at the same time, centralized services such as governance and enabling infrastructure capabilities [inaudible 00:27:49]. As we talked earlier, this is where the democratization of data is very important, to make sure that everyone has an access to team of governance. [00:28:00] Lesson learned five. Characteristics and access patterns. Typically, it’s very important to make sure that, what are the characteristics that you are looking for? And do we have an army of people or do we have a small team? What technologies to choose? And then what processes that you want to adhere to? And then what are the access patterns that you may have for your data, both from right standpoint to reach standpoint with the context. Visual add context. All these are very important [00:28:30] in terms of how you drive the choice.

Takeaways. If there are any takeaways that you take from this presentation, give importance to context, awareness, and semantic enrichments to your data. How and where you do this context of awareness, throughout the journey of that various stages of the data lake, if you can just add this, then it gives really a lot more value and power [00:29:00] to your data for solving the use-cases. And always have these, the KPI and consumer notion, then always looking it with respect to the producers.

Maslow’s Hierarchy of data needs. If you see Maslow’s Hierarchy, there are basic needs, security and safety needs. And then you would have fulfillment needs. As an organization, where you stand right now and where you want to be, and then see how you [00:29:30] can bring in these data needs so that it can solve and help with all different stakeholders, what they are looking for. By applying this [inaudible 00:29:39].

Journey is as important as destination, because it’s always have a continuous evaluation, because the reality is that technology is changing and then things are getting updated. It’s very important to have that insights and learnings as part of your journey, [00:30:00] and apply those data companies manage. With that said, thank you all for your time. I’d be happy to take any questions related to how you can enrich the context or how you can clear the context to your data, or any questions along those lines. And after this, I’ll be available on Slack channel as well. Thank you.

Speaker 1:

All right, thanks. Let’s go ahead and open it up for Q&A. If you do have a question, please use the button in the upper [00:30:30] right, to share your audio and video, and we’ll automatically put you in the queue. It looks like we do have a couple of videos here, so let me see. Dwayne Phillips. I’m going to put you up first. Looks like Dwayne might’ve left the room, so we’ll go ahead and start asking some other questions while we’re waiting for the rest to queue up. First one is, do you use any particular API/ tools specifically for context enrichment?

Kumar Maddali:

[00:31:00] We use that APIs tools. Yes, for sure. We use certain tools, but those are not specific to context in enrichment. General tools like for accessing the data and exploring the data, and the API is like in terms of using the swagger and stuff like that. But the context enrichment primarily comes from the domain understanding and then the domain interpretation. That is where more than tools, we try to put emphasis on how we [00:31:30] can bring in the domain expertise in these areas.

Speaker 1: Okay.

Kumar Maddali: There’re two side [inaudible 00:31:36].

Speaker 1: Thanks Kumar. Next question. How is context embedded during the data processing?

Kumar Maddali:

I think this is a little bit involved. It all depends on what is it, the context elements you’re talking about. I briefly talked about the spatial data, time series data, as well as [00:32:00] some IoT data also, like the sensor data. Right? So what is it, like the data you’re processing? And that is what would help us to understand how and what kind of context you can embed. But the point I’m illustrating is you need to have the context enrichment as part of the processing. And then how you’re going to add depends on what is it your data looks like, and what is the domain you are dealing with. And then [00:32:30] we can talk to on that one. I’d be happy to answer more specific questions in your environment, in the channel.

Speaker 1:

Okay. Next question. Did you say that you were using PostgreSQL as a key value store? Also, how is the context enrichment represented as a key value pair?

Kumar Maddali:

Okay, good question. I did not say Postgres is usually [inaudible 00:32:55]. We use Cassandra as a key value store, but then Postgres is used as a relational [00:33:00] store, like where you need to have key asset properties, like for some of the road network data and [inaudible 00:33:07] data, which happens much earlier in the processing, and not for the downstream consumption [inaudible 00:33:15]. We primarily use the inverted index and key value, and then the graph structures for all the consumer centric interfaces and all.

Context enrichment, we present it like, usually the context. [00:33:30] Again, it depends on the data. If you are looking at the special data types, it’s mostly the graphical representation. You would have the knowledge graph to have it [inaudible 00:33:39] from one node to another node, and how the edges would have … And then that’s one bit. But there is no one standard answer on how you will represent your context, but it depends on the type of data. Some could be represented as a dictionary and some could be as a graph. It depends.

Speaker 1:

[00:34:00] Okay, we have one question coming in through video chat. Apologize, I’ve missed those. Let go ahead and open up. Nope. That person left as well. All right. I think that’s actually all the time we have for questions today. If we didn’t get to your question, you will have an opportunity to ask it in Kumar’s channel, in the subsurface Slack. Before you leave, I would greatly appreciate it if you would please fill out the super short, and it’s really short, Slido session survey on the top right. Just click [00:34:30] that Slido tab. The next sessions are coming up in about five minutes and the expo hall is also open. I encourage you to check out the booth to get demos with the latest tech and win some really awesome prizes. Thank you so much, Kumar. Fantastic presentation. I hope everybody enjoys the rest of the conference.

Kumar Maddali:

Thank you all. Have a great time.