Subsurface LIVE Summer 2020
The Future of Intelligent Storage in Big Data
Learn about the challenges and motivations behind building Apache Iceberg (incubating) as the next generation of big data analytical storage. Then share in the current state and roadmap for production deployments. Finally, explore the future of automated storage optimization and compute-based enhancements built on machine learning algorithms at Netflix.
Daniel Weeks, Big Data Compute Team Lead, Netflix
Daniel Weeks leads the Big Data Compute team at Netflix which is focused on enhancing the state of open source distributed data processing.
Welcome everybody. My name is Dan Weeks and I lead the big data compute team at Netflix. Today, I'm here to talk to you about the future of intelligent storage in big data. Netflix has actually been operating a cloud-based data platform for around a decade, and what I want to do today is share some of the evolution of that platform and the insights that we've gained along with some of the recent decisions and changes we've made in the direction we're going.
First, I want to take a moment to talk about data at Netflix. Netflix is actually a very data-driven company. We use data to influence decisions around the business, around the product content, increasingly studio and productions, as well as many internal efforts, including A/B testing experimentation, as well as the actual infrastructure that supports the platform. So a quick overview of what we're going to be covering today is the evolution of our data warehouse at Netflix, and specifically targeting things at the storage layer that impacts some of the decisions that we've recently made about the direction we're taking the platform, which will take us into how we leverage Iceberg at Netflix. Finally, I have a few topics that I think individuals would be interested in that we're working on currently, and then I'll touch on just a few key takeaways from this presentation.
So around a decade ago, when Netflix started migrating from the data center to the cloud, we started to formulate some architectural principles around how we want to approach operating in the cloud. These are very specific to the data platform in this case, but they really drive some of the thinking and approach that we took towards migrating to a cloud-based data platform. The first of these is actually separating compute and storage. Now, 10 years ago, this was kind of a radical idea because big data at the time was all about bringing your data to your compute, gaining efficiency through being able to scale out horizontally and reducing how much data is passed around.
So I fielded a lot of questions about how we can possibly do this. Doesn't this just kill our performance? But ultimately, it came down to a series of trade-offs versus benefits that we get from operating in the cloud, and ultimately, we really build our data platform around this concept. Today, this really is a well accepted practice, not just in the cloud, but also in data centers where companies are disaggregating their storage and compute because the life cycles of those products change independently. You may want to refresh your compute resources much faster than you refresh your storage. So this is well accepted now, but it actually is fundamental to how we started out our platform.
Another principle is building our data warehouse as a single source of truth. Now we do this on top of AWS S3, but ultimately this solves a lot of problems around data being siloed within disparate systems. We wanted to bring everything together so we could leverage data across lots of different areas, and this really became fundamental to how we structured our entire platform. So rather than having to worry about consistency of moving data around where it needed to be, we have a single place where we were.
Finally, we have this concept of cloud native. It would be really easy to consider taking a data center type deployment and just moving it into a cloud-based offering and deploying it exactly as you would in a data center. But that really doesn't take advantage of many of the features in the cloud, and what we did is we really wanted to think about our approach in terms of what we can leverage from the cloud, expand upon the elasticity in many of the services that are provided as part of those offerings to enhance our data plan.
So what I want to do now is kind of walk through some of the evolution of how we did data warehousing, and this will really kind of set us up for some of the decisions that we made recently about approaches that we're going with in the future. So the first thing of course is when you move to the cloud, you need a way to operate with your underlying data, and in the Hadoop ecosystem, there's a number of ways to do this on top of S3. They started very early on with like the S3n file system implementation, which was later replaced by the S3a file system. EMR had a file system that we used for quite a while. We helped build out the Presto S3 file system, and ultimately we resulted in building a file system that was taking features of all of these things to strike the right balance between contract behavior of the Hadoop and Spark ecosystem, with the efficiency that we need to operate a really large data warehouse on top of S3.
With that, you can kind of start to operate in a cloud-based environment with a modern data platform. At the same time, there's lots of things that underpin that that make this a little bit difficult because we're relying in many cases on file system contract behavior like rename operations in order to achieve functionality. So if we need to be able to commit data by doing a file system rename and S3 doesn't have a rename option, it's actually a copy and delete or something like that, this impacts both the atomicity of those operations as well as the performance. So what we did is we actually looked into how we can modify the Hive meta score in order to achieve the same functionality, and we were able to create atomic operations by swapping partitions.
So we always write data out to new locations in S3, and then we swap the location in an atomic operation. So this gives us some ability to actually operate in the cloud with an S3 data warehouse and have similar functionalities to what you would get out of HDFS. But then there's always the question about the consistency. S3 is an eventually consistent file system, and while they do provide read after write consistency, that's pretty narrow. In fact, you can negatively cache those objects and so they may not show up in either a listing or a get operation, listing is actually not consistent. So in the read path, when you're doing any sort of split calculation or planning a job, you're susceptible to issues with consistency.
So of course, we saw this as an engineering challenge and we said, "Hey, we can build something to provide a consistent index on top of that inconsistent file system." This is where we built a system called Semper. It was kind of a precursor to the EMR consistent file system, as well as more recently the S3 Guard. So with that kind of handled, we were able to move on to other challenges.
Another one is actual commit behaviors. So when we made those choices around how we were going to atomically update our datasets, we made some trade-offs, one of those actually kind of surfaced to the users in a strange way in that now you can't insert data into a dataset. You can't actually append to an existing partition because we've just said we're going to atomically swap those. So insert operations actually become overwrite operations, and we have to kind of train our users to figure out how to effectively use this within our data warehouse.
But what we were able to do is we were actually able to build new S3 based committers that leveraged some more cloud specific functionality like S3 multi-part upload, and this allows us to use the multi-part upload commit to kind of behave as a insert into operation. Unfortunately, there are still some challenges there because if you have lots of multi-part uploads that you have to do, there's still a window where things could go wrong and you could get partial data. So users weren't particularly confident with that approach.
Next, we have what is a very fundamental problem with a lot of the ways that we treat datasets and that is the DDL semantics or schema evolution. This directly ties to the underlying file format that's being used. So for example, if you have a CSV file, it's fine to rename columns, but you can't reorder columns, you can't insert columns into the middle of them, you can't drop columns. There's lots of kind of behaviors that are directly associated with how the data is stored. When you get to more advanced file formats like Parquet or ORC, this gets even more complicated because it really depends on a combination of how the engine is configured in many cases.
So you can do name-based resolution and that's going to behave differently than you would have column-based resolution, but engines often didn't have a toggle to change between ordinal or index-based resolution of columns and name-based. So you can get inconsistencies across engines and across file types, and worst case, you have mixed tables that have partitions in different file formats and then it's really unpredictable what those behaviors are going to be when you perform some sort of schema evolution operation. So within our platform, we said, "Okay, let's consolidate down to set the work kind of consistently across all of these variants and then kind of restrict what users can actually do." So we allow for only ordinal-based resolution and users just kind of get used to the operations that they can perform and they can't perform.
The next thing we had to do is scale our metadata. All this while, as we're building out the platform, our data warehouse is growing from petabytes to tens of petabytes, to hundreds of petabytes, and the tables within the warehouse are also growing across multiple dimensions. So you can actually shard the data warehouse if you have lots of tables, but within a table, if you have many, many partitions and some of our largest datasets were getting into the millions and tens of millions of partitions, this has a real impact on how people can actually use those datasets. You need to make sure that you craft your filter statements in a way that it can be effectively pushed down.
If not, you may not be able to handle all the partitions on a client and that ultimately restricts how users can interact with data and it becomes somewhat complicated and you need to know these things when using these datasets. So we went through and we optimized a number of areas of our metadata service. We did some amount of sharding to separate some of the concerns and offload some of the impact to different areas. So this was part of growing the data warehouse, but it still became a key point because in some dimensions it's really difficult to scale.
Next, over time, we started to see that data is becoming much more applicable across lots of different areas. It's not just analytics within a big data system. You want to use it in production systems, you're using it for all sorts of reporting and machine learning and areas that may not have the traditional Hadoop Java stack. So in order to figure out how to interact with these datasets, take the metadata information corresponding with the actual physical data, get access to it, and then reconstruct something in a way that's going to look like it does in an actual processing engine is quite difficult and it's error prone and it means that you might end up with different variants that result in different results and how people actually interact with the data.
Finally, maintenance. Because of all these changes, all of the specialization, this creates an environment that is really hard to maintain and keep up to date. Anytime that we want to integrate a new system, a new processing engine, a new file format, there's a lot of complexity involved, and what this does is it really slows down our team. We want to be working on new features and new innovation, but if we have to re-base these changes and make sure everything is behaving consistently and scaling across lots of different systems, it gets really complicated.
So at some point we really started to question whether we were headed down the right road, like is this the right approach? Should we be continuing to kind of take the existing infrastructure and overlay features in order to get the behavior that we want out of it? It really highlighted some things that we were doing for a long time. We, as a infrastructure group, continue to expose complexity and implementation all the way up to the user. The DDL example that I gave where renaming or adding or dropping columns, any sort of scheme of evolution operation is a good example. Another one is actually how we do physical partitioning. We actually expose the data layout all the way up through a schema.
If you have a fact-based dataset or an event based dataset, typically you'll have columns like a date and an hour and those are usually integers, and then you have a timestamp field, and this is really difficult for users to work with because, one, you need to know if that timestamp actually correlates to those partition fields, and when you craft queries and filter statements, you need to set up the proper ranges to cover the date and hour partitioning, as well as your timestamp values, and this gets really complicated because it's just a manifestation of the actual physical layout all the way up to the user.
Another thing that we do is we continue to force the adaptation of technologies, and a good example of this is what we've been doing with the file system. We continue to try and push the file system to behave like HDFS, because that's what a lot of these systems were built around initially. But fundamentally, it's a very different system, and so by adding all of this complexity on top of it, we make it much harder and we're not really leveraging it in that native way that we originally intended to do as we move to the cloud.
The next thing is of course incurring these high maintenance costs. We're adding all of this functionality and it's still coming back to our team in terms of keeping it up to date and maintaining it across new versions of our processing engines, new versions of our platform, as well as now integrating all of these other systems around it. So at this point we had a very serious discussion around should we invest in rethinking this entire storage layer and how we interact with it. It was a very hard decision. I think around three years ago, we're at a pivotal point where we said, "Hey, should we just invest heavily in making a product that'll actually fix these issues? Or should we continue doing what we're already doing?"
Ultimately, what we came out with was we felt it was necessary and a strategic investment for us to actually come up with a solution to address these issues, and that's where Iceberg came from. So Iceberg is an Apache project, and I'm not going to go into a lot of detail about the specific features. There are a number of talks later today that are going to go into much more detail about how these things actually work, and there's a great documentation on the Apache Iceberg website. So if you're interested in those things, please go take a look at them.
But what is Iceberg? Iceberg is actually an open table format for huge analytic datasets, and what we're really trying to do fundamentally is address a lot of these concerns that have been coming up over time. Another thing is that it's an open community standard with a specification to ensure compatibility across the languages and implementations. So this is something that really addresses some of those concerns about unexpected behaviors, your not quite clear how you build the data out of the metadata and the physical data. This is really important.
So as we had architectural principles when we went to the cloud, I want to extract a few storage principles. This isn't everything, but it does touch on some of the key points. One of them is we want to separate user and infrastructure concerns. Users are going to be most effective when they're actually working with the data and not fighting the infrastructure. So a data science engineer or a data engineer, they don't want to spend their time fighting infrastructure problems. They want to be extracting value out of data and that's what we want them to be doing as well. At the same time, the infrastructure teams want to be optimizing and improving the actual infrastructure and not worrying about how that might impact users. So we want to draw a nice line between these two.
Next, we want to have strong contracts for data and behaviors. So this really means a strict understanding of what schema means, how you can evolve it, the actual data types that are supported. Once these things are clear, you can apply them across many different systems in many different environments consistently. Lastly, we don't want surprises or side effects. We don't want things to bubble up to users in the way that we've seen with the existing systems. We want things to behave as you would expect them to, and this is really key in order for people to have confidence in your data warehouse and trust the results they're getting out of it.
So now what I'd like to do is take you through Netflix's infrastructure and how we're rethinking our platform and building it around this storage layer. So the first thing I want to touch on is compute. So we spend a lot of time integrating processing engines like Spark. Spark 3.0 actually has pretty much a drop-in functionality for Iceberg, and we have support as well for 2.4.4. take a look on the website for information about that. Presto has a native connector and there's ongoing work in many other areas, including a talk later today about integration with Hive. There's open discussion about integration within Impala. So these are really fundamental in order to expose all of this information in a consistent way to processing engines.
Another thing that we've actually done is exposed a lot of the metadata about these tables in ways that you can interact with it with queries. So your DML can be used to join metadata against physical data, and this is particularly helpful in a lot of cases and we'll touch on a couple of those in a minute. The next thing is integrating this with the streaming path. At Netflix, we have data pipelines bringing in event data and we're transitioning all of those over to writing through Flink into Iceberg directly. This makes data immediately available to any sort of processing engine, and it also allows for creating sources that can read that data back into those streaming engines for stream processing or backfill replay kinds of activities that you might want to do. It's also a great way to have longterm durable storage in a streaming ecosystem.
Next, I want to move on to access. So this is something that has kind of a combination of open source features as well as Netflix internal things that we're working on. So first of all, I mentioned that lots of systems now want access to data. A good example of that internally at Netflix is we have systems that are caching based systems like UBCache or Memcached, and they need to be able to pre-populate some of these caches and warm them up and we can do that now directly from Iceberg through things like our Java client. This means that systems that don't want to pull in the entire Hadoop, Spark, whatever stack have access to data in a consistent way to what everything else is getting in.
We have a Python client to support use cases that are in machine learning, as well as data science in the Python world. There's a huge growing contingent people that want access to it for those kinds of use cases and they get the same kind of guarantees that you would in the Java world. Lastly, at Netflix, we actually built a data access service building on these things to provide a GRC protocol that users can actually request data in a format that they want to consume it. So JSON data, Avro data, or even Aero memory buffers, and this makes a really convenient to get access to data, which is really critical as it becomes more prevalent across lots of different areas of your business.
The next thing is actually janitors. So this is infrastructure that we built around keeping the data warehouse clean, and this leverages many of those metadata features because you can look at the metadata to TTL data in your data warehouse for regulatory or compliance reasons, snapshoting data to keep track of how many versions you want for time travel type operations, as well as bucket janitors that we have to make sure that we are cleaning up the actual S3 buckets and that we don't have dangling files. It works kind of like a garbage collector for our entire warehouse. But these are all built on top of the functionality we provide for the metadata and allows us to actually scale this very well.
The next thing is actually where we get into that first title slide, the intelligent part of our data warehouse, so tuning these datasets. This is something that we've really started to invest in. First is just relocating datasets. We actually put data into multiple regions as we collect it, and we want to relocate that back to our data warehouse and we have a system that will actually move this data and replace it within the dataset transparently to users. So they don't need to know whether they're reaching long line across the region to access the most recent data, or if it's been copied back already, and we can do that very intelligently and eventually we can do it based off of cost profiles and performance.
Another thing that we can do is we can compact tables. So we have a service that does this as well. It can look at entropy within partitions, target file sizes and determine when is the best time for us to actually compact the datasets. We can even use excess compute resources to do these kinds of things and eventually we get to the point where this is being automated and handled and you don't rely on users to have to go back and fix datasets that are possibly incorrectly structured. Lastly, we want to get to a point where we can actually take some changes and we can say, "Hey, we want to go back and restate all of this data in time to correct for things within the datasets."
The final piece of this is where we get the real intelligence. We've started to build analysis around the actual data, and this is the thing that really makes the data warehouse to be optimized and perform it, and it's all built on top of Iceberg. So we want to be able to look at the actual physical layout of the data, apply different kinds of treatments to it. So we want to be able to change parameters around the compression, around how we sort the datasets, even the internal to file formats, changing parameters to see what is optimal. Just as an anecdote, we have one of our largest datasets that was very heavily managed by an engineering team and they had already sorted the dataset based on usage patterns for optimal stats based pruning.
Ultimately, we did some analysis on it and we found that by changing the compression a little bit and by introducing a secondary sort, we could reduce the size of the entire data set by 60%, and that really translate to some serious investment in cost. So there's a lot of value that we think we can get out of this, and anecdotally, we've had other cases that have exceeded that. So it's really hard for a data engineer to both understand how a data set is going to be used as well as how it's going to be stored in order to come up with an optimal solution and make that persist over time because data changes and use cases change.
So what we want to do is build this system to really automate this entire process so that we can look at how data sets are being accessed, as well as how we're tuning them and apply machine learning and various other approaches to come up with an optimal storage. The last thing is actually to make changes to the tables themselves, table properties themselves, and have that feed back into the compute engines. So if you decide that a table needs to be sorted a little bit differently, you can add that to the spec for the table, and the processing engine can pick that up and introduce that operation without a data engineer needing to be involved. So all of these things together kind of make up how we're rethinking how we build out our data warehouse and the services around our storage layer.
Next, I'd like to talk about just a few projects that we're actively working on and people may be interested in. So the first one of these, I just went into Iceberg a little while ago. We have the Parquet vectorized read path, as well as the ORC vectorized read path. At Netflix, we're going to be integrating this into our platform very soon. There's still some work and so we're going to kind of evaluate what performance benefit we get out of it and see if we want to invest in more advanced things like the complex type support within vectorization.
The next one is a really interesting project that we've been working on with the Presto SQL foundation. So materialized views is something that's under review right now, and it looks like it's pretty close to going in. We're building an underlying implementation that uses Iceberg as the storage layer for those materialized to views. Iceberg makes it really nice to go and determine whether your data is fresh or not, and we can even make those determinations if the dataset has been modified. So there may be cases where you can actually still use materialized data, even though people have appended to the table or change data around it, or even the infrastructure has changed.
Then the last item is row-level deletes. This is something that's gotten a lot of traction and we're currently investing in for Iceberg. Many companies are interested in it for regulatory and legal purposes, but it is under active development. There is a spec out there and there's work going on to support delete with within Iceberg.
So with all of that, I just wanted to add two key takeaways. If you take away nothing else from this presentation, remember these. So the first one is that data is really the foundation of a warehouse or a data platform. It's what we build everything around, and so if you don't have really good foundation for that, it's going to leak into all of the systems and services that revolve around it and that is going to be hard to maintain and manage and scale going forward.
The second one is you want strong contracts and open standards. For integrating internally as well as with vendor products and any other system, you want to have something that is very clear, concise, and you can have strong guarantees about the correctness of your data and how it is actually accessed. So with these things, you can build a data warehouse that is going to scale into the future. A lot of the technologies around it may change. They have in our platform, we've pretty much switched over every processing engine and technology at some point or another. So we expect that'll happen as well. We want to make sure that we have a strong foundation to go forward. With that, I want to thank everybody for joining and I will be around in a Slack channel later to answer any questions anybody has. Have a great day and enjoy the sessions.