March 2, 2023

10:10 am - 10:40 am PST

Cell Encryption with Apache Parquet

While Parquet column encryption is broadly adopted by the industry, there are use cases that require finer grained control than column encryption, including varying privacy, access control, and retention policies at the cell level. Cell encryption for Apache Parquet is designed to give organizations more granular access control.

In this talk, we will share the challenges, the design, the involvement of the open source community, and the progress towards adding a powerful tool for applying arbitrary policies on cells. We will deep dive into this new feature and how it works under the hood. We will also present performance and space overhead ,and how we implemented masking semantics to enable crypto-shredding of cell encrypted data.

Topics Covered

Open Source

Sign up to watch all Subsurface 2023 sessions


Note: This transcript was created using speech recognition software. It may contain errors.

Xinli Shang:

Hello everybody. my name is Xinli. my colleague Pav will join me today. We talk about the cell level encryption in APA perk pave he a software engineering at the Uber data. previously he work on table and the column level access control in perk. recently moved to work on the sale encryption. myself, I also work in Uber data infer. I’m leading the Apache Perk community. Next slide, please. in today’s topic we’ll cover a couple of things. One is let’s briefly talk about a part perk, and then we’ll have an introduction to cell level encryption why we do it, and what is the design challenges. And then we’ll talk about the several potential solutions to solve these challenge with the design approaches. And then we’ll talk about the current status in the open source and we’ll, the benchmarking and the performance result. and then we will open for the Q eight. Next test, please. let’s start with the big data storage fail format. in the big data work, basically there’s two categories of the fail format column orientated storage fail format, and the rule oriented Apache Perk and APA O c are the two major fair formats for the column oriented on the rule level. I a few like Apache, sorry, Apache G csv. so today we are focusing on the Apache perk which is a wet,

Okay let’s go this. Let directly this is just give you a very rough idea how the column oriented storage versus the blue oriented. let’s suppose you have a sample table. Very, very simple. You have only three columns and three row a, b, C. now for the rule oriented, it just store their data on the disk, or, you know, send to a network by serializing it rule by row. In this case, we see the A one B one, C one A two B two C two A three B B three c3 In the column ed, it is serialized in the column by column, in this case, A 1 8, 2 83 B one b2b, three c2 C c1, C two c3. Next test, please. this is a project per case structure. so so each fell has one or more rule groups and a fail footer. So each role group is divided into the different column chunks. So in the, in the previous example, we see the three columns. In this case, we have the three column chunks, and then each column chunk is further divided into pages, which is smaller, unique. It is a unique for the encoding, compressing, and encryption. Next slide, please.

Yeah. Now I will transition to Pav to talk about the sale level encryption.

Pavi Subenderan:

Yeah, yeah. So for this talk, we’re mainly talking about selling encryption, but before we get there, I kind of wanna kind of give everyone a motivation on how we got to selling level encryption. So previously the thing that we’ve been kind of working on for the past few years, and that have slowly gained more and more adoption is column encryption in Apache Parquet. And this was a feature that released in Parquet one 12. and it basically allowed us to encrypt columns and sometimes, which we call modules independently with separate keys. So essentially this led us control access to columns through the access to encryption keys, and allows us to also do other things like data retention, encryption, right? And delete things like this to basically control, finally control the columns in our data set. and then we talked a lot about this previous work in our one bird, one stone, three birds blog post on the Uber engineering block.

 but yeah, that’s where we kind of came from. and now I’m gonna go more into cell encryption. So we came from column encryption, but we found that sometimes we wanted even more finer grain encryption than just being able to encrypt and control access to the columns of your data set. so this is an example here that kind of a contrived example to provide the motivation for this. so what we found sometimes is that you might have some tables that have mixed data from different regions, which could have different requirements for each region in terms of pi, non p i retention requirements, et cetera. So take this example table down here, for example we have some rows that are from country A, some from country B, country C, country D and then we have different kinds of data.

Like we can have location data like latitudes, longitudes, email data, and of course we can have a bunch of non p i data as well. and what we found is that when we want we’ve often had requirements where we had to satisfy the policies of different countries uniquely. And so we might have some situation where country A requires that location data needs to be encrypted and deleted after X days, but then country B might require that data needs to be deleted after Y days. And then some countries might have some policies that email data should be restricted to a very a specific subgroup of users. And then other countries might have other sets of users who should be able to read the same kind of data. And so this, these kind of requirements required us to be more fine grained than the blunt tool that was column encryption. So we can’t just say that all email addresses should be read by X group of users. We have to say like, emails in these rows, or these records can be read by these users, and emails in other records can be read by other users. And then deletion. Similarly, we can delete keys to remove, permanently, remove our access, or delete the data essentially through crypto deletion by deleting the appropriate keys.

Yeah, so basically so hopefully that gave you some motivation on why we want something more finer grain than column encryption. now I’ll talk a little bit about the technical challenges in implementing cell encryption and the kind of approaches we went through. so first of all, like as Chinley mentioned parque is a column, their storage format. So this kind of means that column encryption is kind of natural to parque, where Parque is oriented towards columns. So encrypting columns is something that kind of is congruent with the design, but field and record level encryption, where we’re going to the individual values or records within columns and across column chunks is more incongruent with the design of a column oriented storage format. And also another challenge is that key metadata and algorithm info needs to be stored somewhere. And if we are encrypting on a record level there’s a lot more key metadata and algorithm info that could potentially need to be stored. and and some challenges in from encryption side is that encryption works best as a block operation. so encrypting individual values for cell encryption has its own challenges and it works best when we have huge chunks of values that we wanna encrypt. and then also encryption does might not imply as well for some data types, like, for example, the bullion data type. and these were also some challenges we had to consider.

Yep. So basically there are three technical approaches we kind of considered for the implementation of cell encryption and the design of it. the first approach is f P E format, preserving encryption in place encryption. The second one is column splitting and then, and then reusing column encryption. And the third one is adding some string column and then doing record level encryption. So I’ll go over each of these approaches and the pros and cons of each and what we ended up going with.

Okay, so solution solution one or option one is format preserving in place encryption. so fpe, which is format preserving encryption lets you encrypt data while preserving the original data type. So you can kind of think of this like you encrypt a double, and the encrypted double is still can be stored as a double and similarly string to string, which is a more easier case and potentially boo to boo, which is kind of more difficult. like every, every basically data type needs to encrypt into its own data type. And the reason for this that we might want this is it allows encryption to be done in place where the plain text data and the encrypted data can be stored in the same cell. So within a column, the, the, the original values and the encrypted values can be stored within the same column. So this, this is kind of the idea behind this approach. And then

Here we have some pros and cons. So the, basically the main, the main advantage of this approach is that with in place encryption is that you, we don’t need some extra overhead to store the encrypted data. We don’t need to have some extra place, some hidden column, anything like that. but there are some significant cons. one is that we need to have some way to record or keep track of which are the cells within the column that are encrypted and which are not. because encrypted cells, if they just look like doubles, we might might not be able to tell the difference between the encrypted ones and the original ones. So this kind of thing needs to be tracked. And this could, this kind of thing requires a specification change, like a spec change to parquet, which can impact multiple versions. And so we saw this as a pretty big con, and there was also some ongoing concerns we found with popular F P E algorithms. for example, FFF two and FFF three which are some common ones are not considered cryptographically secure.

Yeah. So solution two is basically using column splitting and then trying to reuse column encryption, which we talked about earlier, to implement cell encryption in a way. And so the idea behind solution two is that we create these hidden columns, which is a concept from column encryption to store the cell encrypted data. And so basically in this example here we have example of a cell encrypted table where the original column here is column name, which has some values which aren’t selling encrypted, like three and a hundred. And then we create a hidden column for for some of the values like five and two where we can basically move the values we wanna sell encrypted with key into this hidden column. And then we can move the and then if we have another key we want to use to only encrypt value eight, we can move the data of the, the the, from the last record into this other hidden column and encrypt it with key two.

So the idea is that for each for each encryption key we wanna use for a sound encryption you create a hidden column and store those cells into that hidden column. and then of course you have to do some like overhead on the writing and reading side, where on the writer’s side, we have to split the rights into these hidden columns appropriately based on the policies where we want encrypt, which records. And then we have to do the merging back again inside par a as well, where when you read, you should read this column, call the original column as one unified view, where all of these hidden columns are kind of merged back together into one like logical column of sorts. and then the, and then this lets us, and then the way we do the encryption is we reuse column encryption where the entire hidden column is using column encryption and encrypting that whole column. And so we are able to achieve cell encryption while reusing the concepts of column encryption.

Yeah. So for the pros of this one is that we get to reuse what we already have in column encryption, which is already using AEs. And this is some China tested encryption library. we already know that column encryption is adopted, at least at Uber, it’s widely adopted. And across the industry it’s adopted. And it’s more, we consider this more stable and mature now. And this wouldn’t really require a spec change for parque, unlike the option one. and on the con side, there’s an overhead of splitting and merging back, which we didn’t have an option one. And then it’s also challenging to track like filters, like dictionary and statistics across columns. But this is still doable. This is a con, but it’s still doable. Okay.

And then option three is another thing we considered which is very similar to column splitting, except instead of adding n number of hidden columns for each key, we add one physical string column. And basically the string column will store will be used to encrypt and store the encrypted cells. and then this encrypted string can also contr control contain the key metadata, algorithm info, et cetera, as well as the cipher text. And when you write, you split into this string column, and then you merge back this string column. So at most you only have one extra column you have to add. and then this column will have a relationship with the original column where encrypted data sell, encrypted data moves in there. And then no matter how many sell encrypted keys you want to use, they can all get moved into this

So the big pro of this is that we don’t scale the number of columns with the number of selling keys as opposed to approach two. But we found a bunch of cons, which is that the string column has more overhead. We’re basically doing like some data type conversion to store things into the string column. and this new string column is needs to be present in the schema, unlike option two, where we’re using the hidden column. and each record needs to carry the key metadata and encryption algorithm, unlike an option two, where it can be stored in the column metadata here, each cell needs to carry its key metadata and encryption info because each cell might be encrypted with a different key. And this adds a lot of space overhead. and then lastly, we’re it hurts our compression ratio and encoding efficiency because we’re encrypting these individual values. and then we might still yeah, we’re encrypting individual values, which makes it harder to encode and have good compression after. So our current status is we are basically recommending column splitting the approach to which, which we went over earlier to the community, and we have a ticket open and a design doc that we’ve written and in internally we’ve rolled this out to production and we’re planning to open a PR shortly. And we found we’re pretty happy with the results so far.

Yeah. So we also have a bunch of benchmarks from our from our implementation and then overall our benchmarking. I’ll go over the actual kind of graphs in the next few slides. But essentially we found some space overhead time overhead are of course coming from cell encryption. and that we found that the hidden columns add some more size and then cha and they adding these hidden columns cause sometimes causes the order of the data to change in the original column, which can result in some size increase. and then on terms of time overhead, there’s added time to split the rights. When you’re writing, selling encrypted data, you’re, you’re splitting across the, the different hidden columns and original columns and merging them back during reads also adds time overhead. And the increased base overhead of the worse encoding compression can also impact the processing time overhead.

Okay, so here’s like some raw data we have of the space overhead. here we did some tests using these specifications. We had some tests where we have one long column and four string columns. We did a cell, one cell encrypted column. and then we found basically as we increased the percentage of the data that is cell encrypted the, and we track the file size overhead. And then here’s with different compression algorithms, here are the results we found. And then we find that with low percentage of selling encrypted columns or high percent of selling encrypted columns, we have pretty low overhead. But when there’s a big mix and the, the columns are getting split a lot there can be quite substantial overhead. and this is with sorted data, but on ran random data, it’s pretty uniform that the size overhead is not too bad. Yeah.

Yeah. And very similarly with time overhead we find that like sa same exact scenario, same specs one long column for string columns. we find that the time overhead kind of scales up with the percentage that’s that’s encrypted, that are still encrypted. But as we get higher and higher, it kind of trends back down again. very similar to space overhead. It kind of tracks with space overhead, actually. We can see it has space overhead increases. We see time overhead increases and it space overhead decreases, time overhead comes back down again. And then with random data, the results are much better and more uniform as well.

Okay. One thing we quickly wanted to go over here is that it’s important when we release these kind of like cell encryption column encryption features, we’re mindful of like the backwards and forward words compatibility. so for backwards and compatibility data written by this, these, the newer version, which will have this implementation. the newer version is able to still continue to read data written by older versions. So of course we maintain backwards compatibility and no change in behaviors expected. And in terms of forward compatibility, which means that older version of parque can read data by newer versions. there’s no spec change. So the older version can still read the newer versions, but it will not be able to interpret the cell encrypted data or or, or the column encrypted data. So this is very similar to column encryption as well. And so here we see that basically if an old version of Parquet is reading the new data, but there’s no selling encrypted column involved, there’s no change in behavior and you’ll read as normal. But if you attempt to read sun encrypted data, you’ll get an exception similar to trying to read column encrypted data in older version and encrypted data is now readable.

And then, yeah, so to summarize yeah, basically what we want you to take away from this is an introduction to cell level encryption. Why we wanna do it, what the motivation behind it was and the evolution of call encryption. some of the approaches we considered and the approach that we’re kind of recommending to the community. and then the benchmarking results and compatibility results in the current status.