March 1, 2023

10:45 am - 11:15 am PST

Taming The Chaos: Standardizing Disparate Police Data in a Modern Multi-Tenant B2G SaaS Environment

Government agencies struggle with managing complex and proprietary on-prem data systems, resulting in inefficiencies, data silos, and limited ability to better serve their communities through data insights. Making matters worse, the lack of data standardization across thousands of agencies hinders the transfer of successful improvements.

In this talk we begin to address how Dremio is being leveraged to tame continuously ingested data from numerous government agencies. The speaker will discuss the benefits of standardization, including improved Time to Value (TTV), accessibility, reduced costs, and enhanced collaboration among departments. We will also share some key lessons learned to help attendees understand how they can leverage technology to bring order to their data and maximize its value.

Topics Covered

Real-world implementation

Sign up to watch all Subsurface 2023 sessions


Note: This transcript was created using speech recognition software. It may contain errors.

Brendan Cronin:

Taming the chaos standardizing disparate police data in a modern multi-tenant business to government SaaS environment. So, let’s switch slides. So today, you know, just I’ll give a little intro of who I am, what we do as a company, and then we’ll just first level set on the data and the data problem that we’re actually solving. And then after that, we’ll just talk about some of how we’re leveraging Dremeo for these things and just some key lessons learned throughout the process. So name’s Brenda Cronan. I’m the architect of a gov tech startup named Force Metrics. And what we do is we help police departments and other community service organizations better understand how to help their, their communities through data. And that involves, but certainly not limited to just making their data more accessible, searchable, and, you know, just eventually more understandable. the premise of force metrics essentially is the, the sooner that our data, or, or excuse me, the sooner that our clients can extract value from their data, the sooner they know how to do a better job helping and not hurting their communities.

So the police data problem, or at least one of them, because there are a lot. So at the surface, police data is mostly what you would expect. It follows the expected like chronology of an event in that it typically starts with the 9 1 1 call. So, you know, there’s all sorts of data around calls, but arguably one of the most important is, is call notes, which is freeform text entered by the 9 1 1 dispatcher at the time. And then depending on the severity and the nature of the call, it escalates into you know, an incident or a case. And in which case the incident essentially has to do with officer written case reports, details around arrest charges, information about suspects, offenders, you know, victims, you know, and other type of, you know, involvements like property. So essentially, if it gets to that point, you have a lot more data, but if it doesn’t get to that point and you only have a call to work with, you have a lot less data.

And then just in general, you have people, and, you know, people data as, as we all know, there’s, there’s a lot of it these days. So you have address information, you know, you have notes and alerts, and you know, this could be anything from trespass information to just help helpful information, let’s say the individual’s death or something like that, that the officers need to be aware of before interacting with them. And then geospatial data. So geodata is complex, but especially in policing, there’s all different forms of it. So you have, you know, addresses in general, but you also have districts and beats and counties and jurisdictions and areas and all those sort of things that you, you need to know about if you’re a crime analyst or something like that. Just trying to understand like, where did this happen in our data? And, you know, this is this is a complex thing, but we’re, we’re not gonna talk too much about location stuff this day today.

But each each system really is police cad, which is computer aided dispatch, and then rms, which is record management systems. so there’s a lot other, there’s a lot of other police systems, but these systems vary and sometimes wildly. So but there’s also additional data sources that, you know, could include, you know, federal, you know, state and federal criminal data integrations, you know, body-worn cameras or, or dash cameras, license plate readers and et cetera, et cetera. So it gets it gets deceivingly complex very soon. But the data problem, so as I said, every system is different and, but that’s not just on a technical level. So the terminology and meaning, even some of the things that I just mentioned, like say, incidents and cases, like that’s already not consistent across systems and across the four different departments, even if they’re in neighboring agency from one another.

So incident at one department or one system might mean something completely different than case or incident in another. Some systems have both, some have one or the other narrative, you know, like a, an officer written report. Sometimes they’re called narrative, sometimes they’re called reports, sometimes they’re called supplements. things that you wouldn’t expect, like people might be called jackets, . So you know that there, there’s all these sort of nuances system to system that each department has to learn to operate within their, their own realm. So and then aliases, you know, like an alias for a person like this could be, you know, somebody falsely gives, you know, a name that you know is not really themselves to the police department, but also sometimes they just mistype things. So there’s linking of aliases between these things. So then just the data structure itself is also different.

So CAD and RMS are separate systems, or they’re not, sometimes they’re not or they’re partially replicated into one another or, you know, there’s just a single system with no distinction at all, you know. So all of these just surface level things that you would expect to be consistent aren’t and some systems use a unique and proprietary strange database that’s just a derivative of a very old database, and that could be hard to work with. So that, that definitely complicates things like you do a basic SQL query and it’s like, I don’t know how to run that. So working with these systems is intrinsically difficult. and some construct the notion of something like an incident from say, 15 tables or 20 tables, and then other systems it could be two or three, you know, so the complexity and, and just like the logical relationships is not what you, you would think because of that.

So reports, you know, they could be directly written into their RMS system and therefore it’s just a big text blob in a, in a database. But they could also be written in Microsoft Word or WordPad or a text and then uploaded as an actual file to a case or to an incident, or to both, or to some completely different place, . So the, the point is that everything is different and the logical relationships reflect that. So the records could be related to each other via a foreign keys or just a string. And then some systems have one to many, some have many to many, some have none to many. So even these sort of things are, are very inconsistent. And because of that, the same address or person could be entered a number of ways. So the consistency is, is really one of the biggest challenges with, with any of this.

And then the human part because of all of these inconsistencies, but with slow cadences of upgrading or changing systems, you might have the same software in a department for say, five years or 10 years, and that makes it so through time everyone learns how to build processes around avoiding the, the downsides of, of these systems, or, you know, they just learn the terminology associated with one. And it just makes it difficult because it’s garbage in and garbage out, right? It’s the typical garbage in, garbage out. So, you know, searches because of this are, are difficult. If you figure out how to search police data with one software or one agency, that doesn’t necessarily mean that you’ve figured out how to do it with others, right? So how do we actually make this better? How do we leverage Drio to try to solve this problem?

And, you know, first of all, it really makes it makes sense in our case what we’re about to talk about. But every single department is different. Every business is different, every project is different. So, you know, I really want to just emphasize like, you, you really have to establish data principles and priorities that will actually inform your decision. So our approach stems from that we’re business government, and that we are a SaaS product, and that we have extremely tight security controls. So, you know, secure by default. That’s, that’s without, without saying, but, you know, keeping data lineage predictable, repeatable and understandable is very important for, for us. So we need to be able to trace back in our product where did that exact piece of data actually come from in their system? And depending on how you design your data pipeline, that could be very difficult.

A lot of people aren’t thinking about lineage holistically, they’re thinking about lineage within their own pipeline, but you need to be able to attribute that. Then a little shameless covid joke here, you know, flatten the curve for data consumers to, to learn and extract value. I mean, for, we’re a startup, we need to move fast. And, you know, ger helps us do that. And then it should be easy to ask the same question of all of our data, and this is really why we’re talking about standardizing today. But how many assault incidents occurred in client A? It should be just as easy and almost identical to ask that question to client B or client C or say all of our right, and then never introduce mystery or magic. You know, a lot of, a lot of what we’re about to talk about with how we standardize things really rolls into, into that is, you know, it should be explicit and very, very clear what every space serves a purpose and et cetera, et cetera.

So then, you know, last but not least, the client data really should be completely isolated from each other, yet universally analyzable. And really what that means is, at the data lake level, being, in our case a w ss three, we need a, the client data separate from one another. We can’t combine data. Soo actually makes this easy. Yes, I’m sure most of you know, and most of you use term for the benefit of us. Well, so then let me go to space sanity. Come on. There we go. So the most, I I have to quote Elon Musk here, the most common error of a smart engineer is to optimize the thing that should not exist. So the reason I put that here is when we first started out, like we spent a lot of time thinking through just how to actually structure this in Drio, because you can make it very complicated and you can make it unnecessarily complicated in a way that just adds an unnecessary amount of friction to your team and to your data consumers and whatnot.

So, you know, architect your germ space, such a can’t talk today space structure such that it is easily explainable to your least technical audience. And really what that means is a good foundation makes all the, the difference, but everything should be downstream of what is previous to it and serving an explicit and irrefutable purpose. So source could be a space, it could be a, you know, an actual, you know, data lake itself, but source is sourced. It’s where the data actually arrives, right? That should be, should be very clear cleaned. Is the next space really in the chain here, where it should be nearly identical to source tables whenever possible. But this is also where you could do manipulations improvements completed, you know, lookups. Like I said, if an incident table has 20 tables to construct it, then you might have to have some lookups and stuff that, that make your life easier and standardized and then standardized.

Just, you know, that it’s normalize where possible. If you can normalize the data, you normalize it there, but it’s still available where it’s not. So what that means is you’re never going to have an actual common data model that is holistically including every single bit of every single client’s data. In our case, it just won’t happen, right? But, so what we don’t want to do is avoid standardizing. So we standardize what is possible, and then things that say only exist for one client or, or only for some clients. We also include that in standardized as just extra data. So that way it’s still available for our, our data consumers to extract value from, but it’s, they’re not deprived of it. So then curated just slices of standardized data, meaning it could be sourced from either one or multiple clients.

Now there’s a little lag here that is making my head hurt when I click on the next slide, and I have to wait for it. There we go. all right, so the first space I talked about being cleaned. So, you know, we’re gonna skip source because it is just out of the, the scope of this talk. But, you know, cleaned it should be, you know, we like to say just enough to get it into A V D S. And really what that means is just minimal field manipulation. And if you aren’t yet on iceberg, which you should be and there’s a lot of good talks on that, then you might need to do like cast typing, you know, type casting here and, and just little manipulations like that. But it’s important in our case to avoid manipulating data in place. So what you can see in purple on that slide, the FMS or force metrics underscore, and then the f the source field names.

So what we’ve learned to do here is manipulate data in parallel. bad data is bad data. And what you may think is predictable, like say the structure or the format or the syntax, it can suddenly change and it can also be easy to make too many assumptions on, you know, say building logic around how to clean data and the way that you think you’re making it better, which then actually makes it so that you can’t do the same sort of joins that you’d otherwise be able to do, because their source system, no matter how subpar, right is expecting that, right? So if you manipulate data in parallel instead of in place, meaning you just have another copy of a field which Drio is very good at, then you avoid that sort of issue. And you know, some fields if you’re, if you’re doing a lookup, it’s not necessarily the worst thing to do it in place for certain things, but use your judgment, right?

And then you know, lookups are a constant thing, so it makes sense to actually do them here and to have them available here as a parallel field. Because if you had to do it in standardized, and you had say, 10 different standardized sets all looking at the same clean to be the s, then you need to repeat that logic 10 times, right? And that goes, goes against some just typical coding principles and whatnot. So onto standardized, right? So in a way, it’s kind of the simplest space to understand the point of, right? So, you know, think I made a halfway decent diagram here, but essentially the VDS is in standardized, should only be downstream of cleaned or anything else that is in standardized. And, you know, join to construct what you need and not actually look up what you’ve selected. And what that means is if I need to construct a standardized set from this data and that data and that data that’s worthy of a join.

But if I’m trying to change the value of that via a lookup, the, it’s back to the clean space logic of like, why was that not done in clean, right? Because if you do it in cleaning, you don’t need to repeat it in all of these different standardized sets and by it simplifies your SQL code, right? So you don’t actually need to have extremely complex CTEs or, you know, 19 joints to construct, you know, five fields. It, it, overall, it makes everything cleaner. And then in the bottom, bottom left of the slide, I, I kind of gave an example of, of what it I, I mean by including everything. So if you can’t actually standardize something, then you can still include it and make it available to your downstream consumers or, or maybe slightly change the field name or two to make it more understandable, but you don’t necessarily need to deprive them of that extra data, right?

And then curated. So curated is overall I, I lied. It’s actually the easiest pla space to explain. So downstream of standardized and and or only other curated VD s’s. And it’s, it’s a really handy field. So like, let’s say you wanted to create something like, you know, a group of incidents, like you’re doing some kind of project where you need to analyze incidents that were classified in a certain manner, right? That’s a, that’s a good use case for curated, but so is something like all of the incidents across all of our clients say in Texas, right? Like there might be a use case for that. So you can think of curated as just, just merely slices of of data that, that you would otherwise get. So that’s the structure that, that we’ve been sticking with for quite a while now.

And it works out pretty well. But through this lessons learned. So really the important part here eight chefs can’t cook an omelet any faster than one, right? So I, I really should have put that bullet as the, the first one, but avoid too many chefs in the kitchen, right? Make your decisions, design your, your solution around your requirements. But if you have every single chef in the kitchen at the same time trying to flip an omelet or cook an omelet, right? Like you’re just going to have problems, right? So make those decisions, stick to it, but actually have an understanding in place of, you know, as, as the data engineering team or whatever team you are, you know, you, you know, your data consumer’s priorities and you know their requirements, and that you’re actually tailoring your decisions to them. and keep it simple.

You know, don’t rationalize data loss. So, you know, like I was saying with with the clean layer, so manipulating in place, let’s say if you’re not manipulating data in place, or if you’re not manipul manipulating data in parallel, but you’re manipulating in place, it could be easy to get into the habit of, you know, you do a joint, you do a joint, you, you like, oh, that, that had 1.846 million fields. Well now it has 1.845 million fields, or, or sorry, million rows. Well, it’s close enough, , just move on. Don’t, don’t do that, right? Don’t rationalize data loss because that could be a small difference on the row count, but a massive difference, absolutely massive difference on the contents of those rows. And obviously this talk isn’t really about, you know, data quality monitoring and, and all that. But it’s an important point to make that if you have sanity in your space design, you will eventually be able to more easily achieve sanity in your data quality checks because it’s, it’s more explicit, it’s more understandable, and it’s more consistent.

And then, you know, build a vds change management process, but don’t drown in it. So, you know, death by process is a, is quite a thing across all of tech, as I’m sure most of us know. but you can’t necessarily avoid it. So you need a change management process. and Arctic I I believe you know, which is an extension of Nest, which STR has been talking a lot about, is a fantastic way to help with that because the, the goal should be to be as get driven as possible but don’t have so much process where you can’t move fast and make changes with semantic representation of your data. Like, I can’t tell you how many times we’ve been like, well, let’s just redo it. Just make a new folder and then change every query. Do your quality checks and just flip to that, you know?

So don’t be afraid of, of quick change, but don’t make it so it’s a mystery why something is getting a different result, because team member number 14 went and changed something and no one really knew about it, right? And you know, fix your pain points. You know, you gotta be able to make new mistakes. If you’re, if you’re making the same mistakes or if you’re experiencing the same pains, then fix it. Because as obvious as that advice sounds, a lot of people will again, optimize the thing that should just not exist. You know, we, we actually spent a good amount of time optimizing reflections at one point, and then we threw away a lot of how we were doing the underlying data because we were doing a migration to iceberg. Well, with Iceberg, we find that the vast majority of our data sets don’t even need reflections cuz it’s so fast querying iceberg tables from Germany.

So we, the amount of time we used to spend dealing with reflections if we put that time into migrating to iceberg, we no longer have to spend that time at all. Right? And, you know, so start from a solid foundation of underlying data really is the next point. And you know, that I pretty much just covered there with the iceberg stuff, but iceberg, iceberg, iceberg, iceberg is your friend. andos tags and catalog are your friend. So this is something that I think, frankly, we still underutilize, but since we’ve started to properly utilize them, it makes life easier because if you start to put tags on VD s’s, you can find those VD ss much more easily. But you can also do very clever things with API driven automations. So you could say, you know, extra check or you know, whatever, you know, whatever you want to name the tag.

And then you could just have an airflow tag that’s looking at any VDS that has tagged that and doing some other extra automation that you actually weren’t thinking of originally. So you can get clever with tags, but you don’t want to have 500 tags in every p d s, but utilize them and then, you know, kind of like I covered before, get, get, get gets important be as, get driven as possible. And then as a bonus disable S3 versioning for your distributed store buckets. we didn’t do this once, and if you’ve ever seen what an exponential graph looks like, that would be the data the data size of a, of a certain S3 bucket that had versioning enabled that we use for our distributed store buckets. So if you do have versioning enabled for your distributed store buckets and you want to have a pretty easy win for vastly reducing your AWS costs, take a look at disabling that or at least putting a lifecycle policy on it.

And that’s that’s about it. So I haven’t looked at the chat yet, maybe I should, but if there’s any questions, feel free to to ask. And then my email is in that slide. Brendan Force Feel free to email me if anyone wanted it to. Hey, right on cue, huh? if anyone wanted to talk about, you know, your implementation or just some more details, be happy to to go through it. I think the more we learn about each other’s environments to a degree the more we can avoid the mistakes other people are making.