March 1, 2023

10:45 am - 11:15 am PST

What’s a Database in 2023?

Over the last several years, databases have undergone a series of enormous transformations. There are now hundreds of different database-shaped things, from traditional warehouses to data lakes to cloud data platforms to much more complicated webs of storage systems and compute engines. There are databases for nearly every vertical, use case, and architecture. What used to be a straightforward question — ”Is that a database?” — can now only be answered in one frustrating way: “Well, it depends…”

This talk explores the different ways to answer this question by first outlining some of the changes in the industry that make defining what a database is so difficult. Then, explaining how this affects what people expect from their databases and the things they want to do with them. And finally, offering some guesses as to what that implies about the future of the database — whatever that word means.

Sign up to watch all Subsurface 2023 sessions


Note: This transcript was created using speech recognition software. It may contain errors.

Benn Stancil:

Cool. Alright I’m gonna talk about what’s database in 2023. I am Ben. I am one of the founders of Mode there. It’s cool mode, if you’re not familiar, mode is a modern BI tool designed around data teams. It looks like this this is the data team side. this is the BI side kind of dragging droppy visualizations and stuff like that. The sort of thing that you would stick on top of a drio to, to do your analysis. so cool. So I also I have a blog where I kind of scout the internet about data things. if you wanna check it out, it’s here. This is kind of the general vibe of the whole thing. and like I said, I’m gonna talk about what is a database in 2023. there’s one other kind of useful piece of context about this, about where I’m coming from and who I am is I identify as an analyst.

 I am not an engineer, for instance. so this means that I like, think about data and databases as a tool for doing analysis and presenting them to my boss. like Jonah Hill here I write like overly complicated and messy SQL queries. I argue about SQL formatting on the internet. I make charts. I make charts about SQL formatting. I ask obnoxious questions like, why to various business stakeholders all the time. this is what I do. I occasionally write code but when I do it looks like this these are what my pull request look like. so this is kind of my level of engineering. I have no idea what I’m doing. I hard code loops. it’s, this is kind of again, my, my perspective on all these things as, as not an engineer. and so I, I, I bring this up because when I’m asking like, what is a database in 2023, I’m not talking about it as an engineer.

I’m not trying to answer questions like this. so this was a blog post from Notion where they said, Hey, we’re gonna like shard our production database in some really fancy way. I don’t, not entirely sure I even know what sharding means. so we’re not gonna talk about this. We’re not talking about like what is a database to build production applications. when we talk about databases, again, we’re talking about the way that like Jonah Hill would run queries against it and this sort of beautiful Photoshop thing that I’ve done. the other thing that I don’t care that much about is like, what is the technical definition of a database? I’m sure there is some like consortium of people in Geneva that come up with this and say, a database is these very technical specifications. I don’t really care about that either.

 we can all have fun arguing about that on Hacker News and have 168 comment threads about whether or not Kafka is a database. I don’t really wanna do that here. I care about more of it as like colloquially. What do we think of as a database? so this is Frankenstein. To me. I get technically it is Frankenstein’s monster. I don’t care when we say Frankenstein, we mean this thing. so my question is like, when we say a database, what do we mean? Not what do the people in Switzerland say that we should think? and to me that matters because as an analyst,

My boss might come to me and say like, Hey, I think we should invest in looking into a new database. we should potentially buy a new database. Or think about like rearchitecting the way that we, we do things with a new database. my first question on that is going to be why my second question on that is gonna be like, okay, what does he mean? what specifically is a database? Like what are the specifications I care about? like, I have to come up with some matrix of requirements for things like that to think about, okay, what do we need to go buy and what does it need to do? and that is not something I’m gonna get, again, from some technical definition that I find online. It’s a thing I’m gonna get from my friends. I’m gonna like ask people around about like the kinds of experiences they have.

And so that’s how I’m gonna kind of put together the requirements for what a database is to me. and so really like the title of the talk probably should be what Will Data Teams and other Analytical Database buyers expect from the product they buy in 2023? but that does not make a terribly compelling talk. So went with what is a database but this is more of the question that, that I’m trying to answer here. to answer this though, I, I think it’s useful to go back in time and kind of like where not we started, but where were we a decade or so ago? so I wanna kind of start with trying to answer this question which is, what did data teams and other analytical database buyers expect when they, from the products they bought in 2013? and kind of walk through a little bit of how we went from where we were then to where we are now.

 and for, for people who have particular flavors of databases they really like and wanna see products or, or vendors. this is not a comprehensive walkthrough of all of the history that goes into this. there’s obviously lots of things that have happened in the sort of database and database adjacent space over the last decade. This is more of kind of illustrative of the things that have happened, some of the changes that have happened. It is not meant to be like, these are all the things that have happened. So the way this kind of worked back in 2013 or even before that, this is probably more like back in the early or late 2000 aughts or whatever we call that decade is as an analyst, I would say, okay, I’m gonna ask a question of data. I’m gonna do that by asking for data with a sequel query.

 me as an analyst, again, obviously as Jonah Hill the way I would do that is say like, say I’m using Postgres, which I think most people would say yes, Postgres is a database that’s one that doesn’t seem terribly controversial. if I’m using Postgres, the way this would look would be I’d like, all right, I’m gonna ask for some data. I will write a query. That query will gets sent after Postgres. The Postgres will compile it from Sequel into some like lower level language that I do not understand. it will send that commands off to a giant calculator. that calculator will then run against some tables that live inside of Postgres. the calculations we can apply to those tables. Postgres will produce another table, it will spit it back out and me as Jonah Hill will be pumped up about it.

 so this is kinda what we had, like this is the simplest definition probably of what a database was to me as an analyst I guess still today, but also, you know, 10, 15 years ago. So some things have changed since then. Obviously there’s a handful of stuff that has changed. And so one of the changes for instance, was things like Vertica and other column store databases came along and said, instead of storing these things as tables, we’re gonna store them as columns roughly. essentially like new file formats and new ways of thinking about how we store this data. okay, it’s all very exciting and good, makes your queries fast. stuff like Redshift then said, okay, actually what if we put the whole thing in the cloud? so instead of running this database on a machine that you own, we’ll just all run it in in some public cloud.

 so Redshift did this this like, you know, bubbly thing is to show it’s a cloud, but it’s kind of obnoxious, so we’ll change it back to a square. But you can imagine everything over there being in, in the cloud. and then’s some other stuff started happening. So, so companies like Snowflake started and other companies did this before, but Snowflake obviously has had a lot of success with it split out the concept of like the calculator and the data into being separate entities that you can nail scale these things independently. So if I want a giant calculator applied to smaller tables, I can do that and it’ll run really fast. or I can apply some giant, giant tables and store things and some really big files without having to make my calculator super huge too. so now we have a system that kind of looks like this in Snowflake, and then Snowflake and other folks as well also said, Hey, what if we start actually instead of just querying the files inside of Snowflake, start pointing it to other things too.

 and so now we can say, okay, let’s run our giant calculator on stuff in s3 or other, or file stores in other various places. DRIO as, as we’ve seen earlier today and I’m sure a lot of you know, kind of removed the, the data in the database at all. And now drio essentially applies to just the files in s3. So now that data doesn’t actually live in kind of the database shaped thing the data actually all is external or can be. and so we have just like a calculator that runs on top of external files. but some other things have also happened. So some other things have come along here too, where there’s other pieces of this picture we can kind of fill in as well. So if y’all are familiar with what D B T is D B T is a transformation tool.

A couple years ago they said, Hey, in addition to us, like applying basically like batch jobs where we transformed data in your warehouse, DBT is actually gonna in like insert itself in between your BI tools and the database where it actually compiles queries for you. And then it sends those queries back to, to the database underneath it. and they introduce this through this concept of metrics. So now we have an architecture that looks more like this, where I write a query, D B T actually compiles it into a different query and then it gets handed to drio and Dremeo compiles it into something else. and then we have BI tools like mode like this. So mode is actually built on top of a caching layer that data gets passed to and additional computation happens there. All these visualizations and stuff are computed on top of that layer, a similar architecture to like Tableau’s hyper if you’re familiar.

 so now we have a diagram that that adds this piece too. So now we have some more complication here. And so now if we’re looking at this, the question is like, where’s the database? what’s the database in this picture? You could say it’s here cuz this is the thing that compiles the SQL queries or the things from the analyst the first time. You could say it’s here cuz this is the thing that compiles like from queries to some lower level language. It could be the calculator it could be the piece that maps the calculator to the files that understands like how to run that computation, even if it’s not the thing that’s actually executing it. it could be the places the results get delivered and the people consume data. So all these things could be places we can conceivably like call the database or it could be the files themselves or it could be these things here.

So the question is like, again, where’s the database? if you ask me this, like if I look at this picture, my kind of guess here is like, it’s probably Dremio. I don’t have a great reason for answering that way other than the kind of splint at it and say, that seems like the right thing seems the most data basey of all these things. but it’s not really a great, a great answer. in addition, there’s been some more other things or other things that have been added here that have kind of complicated this picture even further. And we’ve added more things to kind of, again, the database shaped object in this diagram. So for instance, last year, snowflake introduced the idea of running apps directly inside of Snowflake. You can now run a lot of this stuff inside of like snowflake’s own kind of infrastructure.

 there’s obviously tools now this is from Databricks, but other warehouses and types things as well that can run. you can actually execute things that aren’t just sql. so Databricks for instance, can run Python and r directly inside of their kind of giant calculator. so databases could potentially do that. Tools like BigQuery have programmatic APIs. so you can query them through, through tools like REST or through frameworks like rest and Snowflake as well. Also introduce the idea of like combining transactional data and analytical warehouses. So actually we can do stuff like this now. and so with all these questions, again the, the question really is like, okay, what then is a database in 2023? and so the best answer I can come up with here is actually stolen from this guy. this is a Supreme Court justice from the 1960s.

His name is Stewart Potter. He was, had the issue of ruling on like basically banning things that were obscene. He was essentially trying to like say like, what is porn? the answer he came up with was like, he doesn’t have a definition for it, but I’ll know it when I see him. and that’s like basically the best answer I can come up with with what is a database now is like someone says, Hey, we need to evaluate a new database. my answer is like, I don’t know what that is, but I’ll kind of know them when I see him. And so that’s like a pretty answer. and so kind of for the second half of this, wanna walk through a few different ideas of, of things where like how we could potentially come to define it in slightly better ways and, and if they’ll work, there’ll be some like little poles that y’all could tell me what y’all think.

 okay, so the first question here is it could be where it sits. It could be like a spot in this diagram that a database could be defined by its location in this picture here. and so we could like make this picture a little bit more generic. we could remove D B T and say instead of D B T, it’s just a thing that kind of compiles business logic, compute metrics, does kind of relational modeling, that kind of stuff. then in that case the drio thing compiled sql. We could get rid of Drio and say actually all of these brown boxes are different tools or different applications. and again, if we ask the question here of like, what is the database here, we have a bunch of different options. it could be the place where this business logic lives. it could be the place where the sequel actually compiles again into some lower level language that most humans don’t understand.

 it could be the giant calculator, it could be the thing that maps the calculator to the files underneath it, wherever those files may live. So it’s kind of this like metadata layer of sorts. it could be where the data’s actually physically stored, so S3 or Postgres or whatever. and it could actually be the results are delivered. all these things are like potential answers for a database has to do one of these six things. so if we do this, we can run like a whole thing. Ashley, is this possible to do the online fanciness? I don’t know what’ll happen. I don’t know what they see. Anyway, so while we’re waiting for the internet to catch up oh, there’s stuff happening here. Cool. folks here who has a vote for number one, for two for three, four? Nobody’s got any votes? for five. Oh, okay. we’ll go to six. Nobody thinks the BI tool great. okay, so basically it’s like maybe four or maybe five where data is physically stored, which is kind of interesting because is four or five.

Ah, okay. That’s kind of interesting then cuz that means like in the drio world it’s actually like potentially just S3 is the database and not drio itself. I don’t know. Or if you have Snowflake and you’re querying external tables, snowflake is not a database. It is your external tables. It’s a database. that’s kinda not what I expected. okay. anyway, so this is one way you can define it. a second way you can define it is the require like a required set of features that databases are things that need to have like this certain features. It’s gotta check this box. If it doesn’t check that box, it ain’t a database. I don’t have any slides really for this other than this one. which is because I’ve already shown some of these things. Like there could be things that are saying these are what we expect the database to be.

I’m not gonna buy a database unless it checks one of these boxes. It could be something where it’s in the cloud, it has to separate storage from compute in this fancy way. it could be it has to read, be able to read from external storage, has to run multiple languages, has to have programmatic APIs be this kind of analytical transactional hybrid thing. it can run apps, it can run visit or contains business logic. None of these things, by the way, are things that are like, certainly it’s not like it’s a thing that runs queries. These are sort of more of the things that would get tacked on. And so my question with this is like for the databases we buy in the future, what are the things we definitely expect for any future database we would consider to do? does anybody think it will be 1, 2, 3, 4.

Got a little bit. 3, 5, 6. all right. Seven, eight. What do we got? Six. Wow. All right. This is good. Completely differently than I expected. Wow. That doesn’t even exist today. . So, okay, so that’s cool. So I guess we’ll expect databases. It could be one in this like all in one thing. that’s kind of cool. and that in some ways may answer the next question. which is the third like way potentially you could kind of classify a database. Is this actually just the full experience? It’s not a single list of things. It’s not an individual like feature checks box. It’s more of the, the combined experience and how it feels to use. and so the way I I would define this is like suppose that you are a like bratty elementary school person. I don’t know how old she was supposed to be.

 and instead of wanting a new Oompa Lupa, she wants a new tv and you’re like, okay, great. I am going to give you a tv. You could say, okay, here’s a tv. this is a Samsung smart tv. It’s like wifi enabled and has all these apps downloaded on it and it can do a bunch of fancy stuff. And I guess you can browse the internet and like, it’s like computer essentially it’s a shape, like a tv. if you gave, gave that girl this thing and would she be like, it’s a tv, she’d be like, yes, that’s probably a tv. She’s happy with that. but you could also probably say, okay, what about this? this is like a Samsung dumb tv. it doesn’t connect to the internet, it’s just a screen with some plugs in the back. but you can stick things into it, like run video games or Rokus or whatever else.

 would she say that’s a tv? Like Yeah, probably. Like that’s still probably a tv. Like the thing that makes a TV is it’s a screen that you can show stuff on. however, if she said, I want a new phone you can be like, well here, here’s a new phone. Like, no, that’s not a phone. You can be like, here’s a camera. That’s not a phone either. You can be like, here’s a computer. You can use stuff on the internet. That’s not a phone. You can be like, here’s all these things together. Still probably not a phone. you want this like a phone? Is this a phone? Is none of these individual features. Steve Jobs has made us discover that we have to have phones like this and they have to do a million different things. And if it doesn’t do a million different things, it’s not a phone.

 so the question is if she had was real weird and she’s like, I want a new database. I don’t know what her deal is, but the question is like, is this count? Like, can I say Greg here Is Postgres Plano Postgres running on the internet? Like, do I still think that’s a database? Am I happy with that or do I want something like this? and Snowflake, which I don’t think markets themselves as a database, I think they now call themselves like a data cloud because they’re all this stuff and there’s like a database kind of inside of it, but it does a million other things. like do we expect databases to be the first one or the second one? We just want the really powerful engine, or do we want the thing that can do a million different things? So that is my last question which is, will we expect our databases to be this super fast piece of core technology that we can layer a bunch of other things on top of? Or do we think it will be an entire platform with lots of bells and whistles? This is a simple poll that apparently has options. One and one. I messed that up. who has option? Top . What about bottom? Ah, alright, so we’ve just got, it’s just like a fancy engine and all these other bells and whistles are not necessary. We don’t want an iPhone. Ashley, what do we got?

Ooh, okay. all of these things have gone weird. which is useful because my next last slide is this basically like in trying to put this together, I was like, okay, great, I’m gonna have this like, clean definition of what a database is. And this is essentially where I landed was like, I don’t actually think I know anymore. and so I think like, kind of to wrap this up with a little bit of a sort of way trying to answer this question to me, I think like the takeaway is basically like there’s probably not a definition. we probably don’t actually have a clean way to talk about this anymore. Just sort of odd, given how central databases are to everything that we do it’s probably now become a term that’s, that’s like a little bit hard to pin down. Everybody’s gonna have slightly different definitions of it. It’s kind of a hand wavy thing. And so that may be like what a database is in 2023 is, is not a particularly useful term but something that we all kind of like know what it is when we see it but don’t have any way to define it. So with that I will stop there. thanks y’all.