Subsurface LIVE Winter 2021
Introducing InfluxDB IOx, a Federated In-Memory Columnar Store Backed by Object Storage
This talk introduces InfluxDB IOx, the future open source core of the InfluxDB time series database. Written in Rust with Apache Arrow, InfluxDB IOx is an in-memory columnar store that uses Parquet files in object storage for persistence. It includes built-in functionality for defining replication and flexible cluster topologies with total operator control over partitioning and replication scheduling. This talk will introduce the architecture and concepts behind InfluxDB IOx. We'll talk about the technology choices, Arrow and Parquet and the API of the project.
Paul Dix, Chief Technical Officer, InfluxData
Paul Dix is the creator of InfluxDB and the CTO and founder of InfluxData. He has helped build software for startups, large companies and organizations like Microsoft, Google, McAfee, Thomson Reuters and Air Force Space Command. He is the series editor for Addison Wesley’s Data & Analytics book and video series. In 2010 Paul wrote the book Service-Oriented Design with Ruby and Rails for Addison Wesley. In 2009 he started the NYC Machine Learning Meetup, which now has over 13,000 members. Paul holds a degree in computer science from Columbia University.
All right, everyone. Thank you so much for being here with us. I hope that you have been enjoying the conference and all the sessions so far. Before we begin, there are a couple of housekeeping items that I want to run by the audience. We are going to have a live Q&A session at the end of this presentation. So if you have any questions and you want to participate on the live Q&A go ahead and… I will prompt [00:00:30] you, and then you will have to enable your microphone and your video so we can interact. Also, if you have any questions, feel free to additionally post them on the chat window that you will have on your right-hand side on the presentation. And in addition to that, don’t forget to go to the Slido tab on your screen to provide feedback about the presentation and the conference in general. Without further ado, I want to introduce Paul, the CTO of InfluxData, and his presentation. [00:01:00] Paul, the time is yours.
Great. Thank you. Let me just do the screen share really quick. All right. That’s it. I just moved this browser window out of the way. All right. I think we’re ready to go. So thanks for joining me today. [00:01:30] Today, I’m going to talk about InfluxDB IOx, which is a new columnar database that we’re building at InfluxData. So as I mentioned, it’s columnar database for historical reference. This is Michael Stonebraker, who as an encore to creating PostgreSQL, went on to create something called C-store in academia, which he then spun out as Vertica, which was one of the first columnar [00:02:00] databases that became commercially popular.
IOx is written in Rust, which we’ll talk about in a little bit using Apache Arrow. It is dual licensed under MIT and Apache 2. And really what IOx is it’s the future core of InfluxDB, which is an open-source MIT licensed time series database. For those that don’t know, I created InfluxDB in [00:02:30] 2013. It’s written in Go. Is a time series database that supports multiple data types and can store raw events in addition to summary data. So even though InfluxDB has a goal of being useful for events in irregular time series, most users use it for metrics. This irregular time series is going to be system metrics, sensor metrics, any of that kind of stuff.
I’ll jump into the InfluxDB data model really quickly, [00:03:00] just in case you haven’t seen it before. But the bulk of this talk is really about IOx in the new system. So we have this line protocol, which is a text-based protocol that we invented for feeding data into the database. It consists of a measurement name, which is a string, tags, which are key value pairs where the values are strings, this kind of metadata that describes the time series. There are fields which are key value pairs where the fields can be in 64, float 64, [00:03:30] boolean or string. And then lastly, there’s nano second epic. So all of this data is organized in the database into individual time series where a series can be represented by keyed off of the measurement name, the tag set and the field name. So basically you have that as a series and it’s essentially just value time temples ordered by time.
So InfluxDB as a database, as a design is essentially two [00:04:00] databases in one. There’s the raw time series data store and then there’s an inverted index. Now what the inverted index does is it maps metadata, like a measurement name or a tag key value pair to an underlying like series ID, which you can then query across. So usually you use an inverted index for mapping document terms to documents and search. We’re using it for mapping metadata to individual time series. [00:04:30] And the truth is this structure, the structure of measurement, name tags and all this other stuff, really, that’s just kind of a collaborative pack to try and get users to organize their data in such a way that we can do automatic indexing for it and retrieve results quickly.
But despite all that, there are a number of pain points that we’ve experienced over the years. So the first thing is it’s highly indexed, right? We index all the tags and their individual unique [00:05:00] values. We index all the data by specific time series. And what this means is it can be expensive to write data in because we’re indexing a lot on ingest. It also means that [inaudible 00:05:11] data where you have say containers spinning up and down, you have a tracing data point-to-point networking data or anything where you have metadata that describes things that doesn’t live for a long time, It makes the index larger and larger over time. It means we’re spending a lot of time [00:05:30] and effort to build and store these indexes and not necessarily a lot of time using them.
The other thing those means is that because of the way data is structured in the system queries across many individual unique series are quite expensive. They can blow up memory, they can be really slow to execute. And then this is just kind of an artifact of how the database was built was we use memory map files extensively, which was great to get started, but [00:06:00] in modern containerized environments, we’re finding that there’s some pain associated with that because we need more fine grain control over how the system is using memory and paging stuff in and out.
So basically with the inverted index structure and the memory map and those things we want to get rid of all this is pointing towards a new core of the database that we kind of had to start from scratch on.
When I thought about [00:06:30] rebuilding the core of the database, I still thought about the requirements we need, because we wanted obviously to be useful for our existing use cases, which is largely metrics, but we also want to be great for analytics use cases because [inaudible 00:06:45] many of the operations that you want to do on time series is basically analytic style operations. We need support for unbounded cardinality, which allows us to have great support for tracing use cases or network [00:07:00] monitoring, or really just giving users the ability to write data in, without having to worry about whether or not something is a tag or a field.
As a system, we wanted to be able to separate compute from storage. And more specifically, we wanted to be able to separate ingest from indexing, from replication and subscribing to data, actually sorting data and persisting it from the actual raw [00:07:30] underlying storage layer. We want to be able to import data and export data in bulk. We want to be able to support real-time subscriptions. Operators should be able to have fine grain control over how and when data is replicated and how it is partitioned across a number of servers. We need embeddable scripting so that in addition to basic query support, we can have users of the database inject more and more complex business logic, both for data [00:08:00] as it arrives and periodically.
And What we really wanted to do was wanted to get a broader ecosystem compatibility. We wanted Influx to be useful, this project to be useful with a larger set of open-source tools than previous versions of InfluxDB was. So there are some insights we’ve had along the way with users of InfluxDB and also as we started the research [00:08:30] to build up to this project. So the first is that the vast majority of the queries that our users send to influx are on recent data. Stuff that’s written in within the last 24 hours. And then all the other data is dormant almost all of the time.
The other thing we learned is that partitioning the data into different chunks and blocks plus brute force querying of columnar [00:09:00] performance is actually good enough for our use case. So that was one of the key things is that we learned that the columnar database techniques would probably be useful for this. And sometimes depending on what the data looks like, columnar database compression is actually better than what we can achieve with the custom storage engine that we actually wrote for InfluxDB years ago. So this was basically the most important insight with this whole thing was that columnar database [00:09:30] would work well for time series data given that it has the right enhancements.
So next we had to go to, okay, what are the build choices? Well, should we pick up an existing columnar store? Obviously there are other open-source columnar databases that are quite mature, they’ve been around for awhile. And as a company and as a product, we’re about a lot more than just the database. We’re trying to help users solve problems in time series data. So maybe they’d be better served [00:10:00] if we just picked up a columnar database off the shelf. And what we found when we started looking at other systems was one, they didn’t quite have the support that we wanted and needed for dictionary types. In our use case, dictionary types and having first-class support for that is absolutely critical. We need far better windowed aggregate support than many of them have.
And the other thing is we need to be able to operate on compressed in-memory data and use techniques like late [00:10:30] materialization and stuff like that so that we can have a bigger block of our data in memory without having to pay a super high price. And then one of the things we knew we wanted was want a system that was designed to operate in an environment where storage is [inaudible 00:10:46] and object storage is your long-term persistence layer. And none of the systems that we found were actually designed around that. They were all built in previous times where you have long lived [inaudible 00:10:58] or long lived machines [00:11:00] or whatever.
And then the last bit was most of these projects that we found were written in either C++ or Java, which ultimately… I’m not a purist when it comes to an application that I’m adopting [inaudible 00:11:15] language it’s written in. But if our development team is going to have to write a significant amount of code and take ownership of it over time to be able to get the functionality we need, then that’s a bit of a detractor [00:11:30] for me that it’s in C++ or Java. I wanted something that was written Rust.
So I’ve talked a bit about my excitement for the Rust language before. I’ve written about this. I really do think that it’s a future of system software. For us, it gives us the control we need. Like fine-grain control over memory that we’re looking for, but with the safety of a higher-level language. I found that the model that uses for concurrent applications, which is basically almost all server software eliminates data [00:12:00] races, which eliminates a whole class of bugs. In our Go code base, data races frequently caused bugs that were very, very hard to track down and only surfaced in high contention, high load production environments.
Another nice thing is it’s embeddable in other languages and systems, which means we can take this and embed it into InfluxDB or into other parts of our stack. We could even compile it down into web assembly if we wanted to. So this talk isn’t about Rust [00:12:30] obviously, but I think as a project, IOx aims to be something that’s long lived, that’s going to be the basis of InfluxDB for a very long time. And so I wanted something that was designed to be long lived. And I think Rust as a language choice speaks to that. I think it is a solid foundation upon which to build projects that you intend to maintain for years.
So that’s why the project is named IOx, which [00:13:00] is short for iron oxide. So the other big thing, the other big choice was that we based it on Apache Arrow. So Apache Arrow is an in-memory columnar data specification, but it’s also a persistence format via Apache Parquet, which is used widely both inside and outside the Arrow ecosystem. Most data warehouses can actually read and write Parquet data. But Arrow is also Arrow Flight, [00:13:30] which is an RPC specification and high performance cluster server framework for transferring large data sets over the network, which is exactly one of the things we want to be able to do.
So Arrow has a bunch of different native language implementations. Within the Rust section of the Arrow project is another project called data fusion, which is a columnar SQL execution engine. So basically we’re building on top of Arrow, but we’re also contributing to it. And we’re [00:14:00] contributing heavily to data fusion because that is the execution engine for IOx. Using all these tools together basically means that this version of the InfluxDB database is something that’s being created by people all over the world in projects outside of just projects that we’re doing, which is very nice. We’ve already seen other people contributing to Rust’s implementation of the Arrow through data fusion. That was donated by somebody a few years ago. [00:14:30] And thankfully we’re able to just add to it and help move it forward.
So, the other thing about IOx is really it’s not just a columnar data store. It’s actually not really a store at all. Object storage is the actual store. IOx is really just a system for managing the data life cycle, which in time series, I think is a very, very important because you have real-time ingests that happens all the time, you have to buffer it up and [00:15:00] determine when you need to ship it off to object storage, when to index it, when to sort it. It’s also a system for being able to specify when and how data should be replicated or sent to other systems for subscriptions or replication. And ultimately it’s a system, not just for querying, be a SQL and our other languages, Flux and InfluxQL, but also for processing the data.
So [00:15:30] it’s columnar database. So one part of the structure should be really familiar, which is basically you have a database name, you have tables, those tables have columns, and those columns have data types.
So, here’s how data is organized within it, right? So at the top you have a database, and then below that data split into partitions. So here I have a partition for, it’s supposed to be two separate hours. It’s typo on the slides. And then [00:16:00] once data is broken out into a partition, it is then split up into chunks. So chunks are basically immutable blocks of data. And within each chunk are individual Parquet files where a Parquet file obviously stores an individual table. So you can have tables exist in multiple chunks, and you have a shared kind of schema that exists that you can roll up to the database level.
So from a high level, when you’re thinking about ingesting [00:16:30] data, it looks like this. You have an inbound write and that write can be sent to one or both of the write ahead log buffer, the mutable buffer. And then from the mutable buffer data is gathered up into larger blocks. And then it can be sent out to the read buffer, which is essentially a read optimized compressed in memory database, or it can be sent out to object storage in the form of Parquet [00:17:00] files and basically metadata and a catalog that describes those Parquet files. So the important thing is the mutable buffer is essentially like an area that’s [inaudible 00:17:11] write. So it’s not optimized for query efficiency and speed, but the goal is to have data evicted from the mutable buffer out to the read buffer frequently enough so that the mutable buffer can be queried quickly because there’s just not that much data there. And then the read buffer [00:17:30] is where the real performance happens.
So the wall buffer is actually just an in-memory cue for collecting up writes. I’ll talk about that in a second. So here’s a more detailed look at the life cycle. With markers for which operations are synchronous and which are asynchronous. So by synchronous, I mean, when a write request comes in, those things can happen before a response is sent back to the writer. So input is [00:18:00] taken in and each of the top three boxes are optional. So that input right now is complexity to be aligned protocol. We’ll be adding support for JSON and probably per [inaudible 00:18:10] That’s converted into flat buffers, which is the structure that IOx uses to talk to other IOx servers to replicate data or to store write ahead logs in object storage. So the top three boxes replicate out to other servers while buffering [00:18:30] or receiving writes into the mutable buffer. Any one of those is optional. So what that means is an IOx server can fill all three roles or any one of those roles individually.
When you’re querying data, that happens here in the mutable buffer or in the read buffer, which is again in memory or against Parquet files that are cashed in the local file system from object storage. So currently the only way where we [00:19:00] intend to use the local file system is basically as a cache for files and object store. IOx is meant to be run or it’s designed to be run without any locally attached storage. So you can run it exclusively off of object storage.
So the wall buffer exists on a per database level. It puts writes into memory into what’s called a segment, and then periodically either by time or by the size of the segment. Those can be closed [00:19:30] and then optionally persisted into object storage. And because that’s optional, what that means is an IOx server can have rights buffered in memory strictly for the purpose of sending those rights out to subscribers. And because those segments are actually sent to object storage, another way of doing a replication or subscribing to the data is simply to spin up an IOx server and pointed at object storage and pull those segment files directly out of it. So you kind of have this out of band [00:20:00] mechanism that doesn’t even touch your ingest pipeline.
All right. So I’ll talk really quickly about the wall object store location, which is probably a bit too much detail, but I think it calls out a couple of things there we’re pointing out. So here’s what the objects path looks like. At the top, you have a writer ID, which is a unique [inaudible 00:20:26] 32, the database name, wall, then [00:20:30] sub numbers three to different layout levels. So here’s an example, writer ID one, my database and the first segment. So what that means is we can only have a billion minus two segments. We might increase that. We might change that.
But the structure we’ve chosen, so one, we can do one list operation to get the databases and three list operations to get to whatever the last segment is. And the key thing about the way [00:21:00] this data is organized in object storage is that the writer ID is at the top. The semantics we have is that we wanted a single writer, many reader. So any reader can point to any spot in object storage and pull data out, but only one server can write to be given a writer ID. So for our data that eventually land in object storage, in Parquet files in a more indexed form, this will pass from the mutable buffer into chunks, [00:21:30] which then get persisted to objects storage.
So the structure there is that data is partitioned. As I mentioned before, the user can define how this partitioning works. Our default is obviously going to be based on time, say every two hours. Rows can only exist within a single partition and each partition has chunks of data. So those chunks are used to persist blocks of data within a partition. And they’re also meant to be [00:22:00] lots of data that are immutable. So closing a chunk and persisting, it can be triggered on size, time or by an explicit API call.
Chunks can have overlapping data. So the important property of partitions is that the data is not overlapping. So you don’t have to do things like de-duplication and all sorts of other stuff. While chunks can have that. And that’s because in some cases we need partitions to be able to accept writes long after we would have expected them to be closed. So [00:22:30] the idea is that a chunk is, sorry, immutable. But then if you accept new data, you can create a new chunk. And then over time, if you want, you can compact chunks together into new ones still.
So what that looks like in the object store, we have Parquet files, one file per table or per measurement. Within a chunk, we have the metadata file, which describes the tables, the columns and their types, and the summary [00:23:00] statistics for the data in those columns. And then we have the writer’s statistics, which keep information about where these writes came from in terms of what other IOx servers.
And then lastly, there’s a catalog file, which stores what partitions exist, what chunks. And this catalog file gets updated when chunks have persisted. So the catalog data, as I mentioned, consists of partitions [00:23:30] chunks, miss summary statistics, and the writer’s statistics for open chunks. So this catalog data is basically what the query planner uses to determine which partitions to query and within those which chunks to query. So the idea is we want to at query time eliminate as much of the data as possible, so we’re only brute forcing the smallest possible amount of data that we need to to answer the query.
So the catalog has some interesting properties [00:24:00] that gives us. One, we get schema information instantly, which was a problem in InfluxDB. We can do schema renames, which we couldn’t do previously in InfluxDB, not easily. And we can actually do point in time recovery. Any other server can point to the object store, we’ll look at the catalog for a specific writer and be able to recover the state from a previous catalog version, assuming you’re not removing the data from object storage.
So replication and federation, [00:24:30] basically what we wanted is total operator control over it. But one something that was basically shared nothing, individual IOx servers should be able to operate by themselves and only know what they’re told. So you don’t need a fully connected cluster. You can have IOx servers that are loosely connected. The write and the reads can be asymmetric. So you don’t have to write and read from the same servers. Obviously, their partitioning is something that the operator [00:25:00] can define. And there’s an API for doing all of these things. All this should be controllable over the network.
So there are flexible replication rules that allow the operator to find synchronous or asynchronous replication, push and pull, request by request in batch or bulk. You can partition to use servers or groups of servers. So this is an example of one possible cluster configuration. You have a server at the top that requests come into, [00:25:30] you have wall buffer. The query server is read from the wall buffer server, and then separately, you can have servers that are responsible for taking the data from object storage and rewriting it in a sorted indexed format in Parquet.
So one thing [inaudible 00:25:50] the example I just gave is a data center centric view. And it assumes that all your data is getting pushed up to some central location. But I think the future is [00:26:00] federated and operates at the edges as single nodes, as groups of nodes. And you’ll likely with time series have high precision data that doesn’t make it all the way up, that you want a process in place or push down towards the edge. With time series data, I think there’s really no limit to the amount of data that will be generated. So I wanted to have the flexibility to change the typologies. And this is an example of that at the edge
[00:26:30] IOx is meant to be a building block for more complex architectures. It’s an open-source data plane that can manage the data life cycle of data in and out of itself or to other servers and object storage. It’ll be able to query and combine data from many IOx servers based on configuration or per request. But how you control many IOx servers is actually up to you. Those are deliberate choice, which is [00:27:00] we wanted the control plane to be separate from the data plane. I think the control plane is highly dependent on your operating environment and what kind of problem you’re trying to solve. So we chose to separate it both for technical reasons, as well as business reasons.
So this gets back to the licensing choices. Given all the changes in open-source recently, I think it’s actually important to highlight how we plan to commercialize all of this. So we want to keep IOx permissively licensed [00:27:30] open-source, which means we need a plausible plan for how to create a business around it. So our business is basically an operating and management of many IOx servers. We’re building a control plane for our own cloud environment, which is obviously a closed source. So our cloud is where we’re going to be launching IOx. But our intention is to run the open-source bits exactly as is in our production environment, because we want any improvements in our production environment to immediately [00:28:00] result in improvements in the opensource.
So the other side of that is that the commercial offering is complimentary to the open-source offering. It’s not fork of it. It’s not like an open core kind of situation, which I think is a better way to create the whole thing. So I’ll close with a quick project status and then we’re good.
So it’s still very early on. We’re not producing builds or documentation yet. [00:28:30] The repo is up on GitHub. It’s totally open. People can look at it. Look and play with the code, file issues. The mutable buffer, wall buffer, and read buffer all exist. Though as individual pieces they’re currently being tied together. So the next bit is a bunch of plumbing to pull those pieces together, pull the database catalog and Parquet and recovery together. We’re going to be dogfooding this for our own usage and our own cloud environment where we monitor our production [00:29:00] systems with our own stuff. And then we’re going to be working on some more of the end-to-end functionality to pull all these pieces together.
My hope is that we start producing builds in late February. It’s not going to have everything that we’re trying to get to, but those builds will actually be enough for people to start testing it out, playing with it. At least the people who don’t want to actually dig in the code directly. And then we intend to launch an invite only alpha of it as [00:29:30] a product ourselves in our own cloud environment in Q2 of this year. That’s all I’ve got. Thank you.
Excellent, Paul, thank you so much for such a great talk. So we still have about two minutes for questions. So I’m going to go ahead and queue one of the participants here. I have Brian in the queue. Let’s see if that works. It didn’t. Let’s try Nelson, [00:30:00] Dominic, [inaudible 00:30:04] All right. So, okay. So yeah, no live questions from the Q&A feature. But I don’t know if you can see Paul there, some of the questions that came in through the chat at this point. So we really don’t have the many. There is one that came [00:30:30] very early in the presentation. I’ll read it to you just in case we want to address it. He says, what happens or what can be done in case of clock skew on some of the data sources? Is this possible to correct or easy or fast to correct?
Paul: What can be done in case of what?
Clock skew on some of the data sources. Again, it came very early in the talk. So I don’t know if it was related to something that you may have shown at the beginning.
Paul: [00:31:00] I’m not quite sure I get the question [inaudible 00:31:06]
Okay. We can skip it. There’s another one that came also early in the presentation. It reads, how do you do the compression?
So the compression is basically… I mean, so obviously on the file, its Parquet compression. On [inaudible 00:31:24] memory, it’s a bunch of dictionary coding with RLE. We’re not compressing [00:31:30] the actual, like if you’re going to enter flow values at the moment, because that just isn’t the biggest source of data in our use cases. So we’re looking at… We may be doing other things like… We can do RLE on timestamps. We can do RLE on the other ones. We’re trying to get as far as we can with those things. But basically, the thing is with Apache Arrow, there’s a requirement that access [00:32:00] to any individual element you need constant time lookup, which basically means a lot of compression techniques don’t work with it. We don’t have that requirement for our read buffer section. We actually want to get good compression, but there’s a trade off between achieving good compression and achieving query performance, so.
Excellent. And I’m going to put one more in before we go. And those IOx scale horizontally for both writes and reads.
Yeah. So [00:32:30] basically you can partition the data any way you like. So your partitioning scheme determines how it scales writes and for reads again, they’re partitioning, but also if you just want to spin up read replicas, if the design is such that you’ll be able to do so just by pointing it at object storage and really being able to answer queries.
Excellent. Well, Paul I think that is all the time we have today. Thank you so much for such a great presentation and all the information that you shared. [00:33:00] To the rest of the audience, thank you for being here today. I hope everybody’s having a wonderful day and continue enjoying Subsurface 2021. And we’ll continue to see you out there. Thank you so much. Bye-Bye.
Paul: Thank you.