Subsurface LIVE Winter 2021
Migrating to Parquet - The Veraset Story
Veraset is a data-as-a-service (DaaS) company that delivers PBs of geospatial data to customers across a variety of industries. We build and manage a central data lake, housing years of data, and operationalize that data to solve our customers’ problems. I recently gave a talk about the specifics of file formats at Spark+AI Summit 2020 that generated a lot of questions about my company’s migration from CSV to Apache Parquet. As CTO of a DaaS company, I saw firsthand how this migration had a drastic effect for all of our customers. This session will drill into the operational burden of transforming the storage format in an ecosystem and its impact on the business.
Vinoo Ganesh, Chief Technical Officer, Veraset
Vinoo Ganesh is Chief Technology Officer at Veraset, a data-as-a-service startup focused on understanding the world from a geospatial perspective. Vinoo previously managed the compute team at Palantir Technologies, tasked with managing Spark and its interaction with HDFS, S3, Parquet, YARN and Kubernetes across the company. Most recently, this team was closely involved in pushing forward a number of open source Spark initiatives, including a DataSource V2 implementation and the External shuffle service.
Welcome everyone. Thank you for joining this morning session. My name is Emily and I’m from Dremio. I’ll be moderating this presentation today, and I’m glad to welcome Vinoo who will present, migrating to Parquet, the Veraset Story. If you have any questions during the presentation, please type them into the chat at any time. We will address the questions live during the Q&A session at the end of today’s presentation. And now I’ll turn it over to our speaker Vinoo.
[00:00:30] Thanks Emily. Hi everyone. And welcome to the first session on day two of Subsurface. My name is Vinoo Ganesh and I’m the Chief Technology Officer at Veraset. Today I’ll be talking about the process of migrating our central lake from CSV to Parquet and some of the lessons we learned along the way
Before I say anything else, I want to define a file format migration. A file format migration is simply [00:01:00] changing the underlying data format that your data is stored in on disc. I say simply, but there isn’t really anything simple about it. For that reason, I call this the most complex one-line code change you’ll ever make.
This entire presentation is really just about how to go from that first line there to the line right below it, with over three petabytes of data and all of the implications that it had to our pipelines, costs, and business as a whole.
[00:01:30] Specifically, we’ll cover a bit about why this talk is particularly interesting and what I hope you’ll get out of it. I’ll talk through some of the data primitives that we hold near and dear at Veraset, and dive through some quick overview of various open source file formats available for use in the ecosystem.
Finally, I’ll introduce the framework that we developed for performing these types of migrations, as well as some lessons we learned every step of the way I’ll then of course, leave some time at the end for questions that you may have [00:02:00] both live and on Slack.
Let’s start with some background. Again, my name is Vinoo and I run engineering at Veraset. Previously, I led the Spark and compute work at Palantir Technologies. Veraset is a data-as-a-service or DaaS startup in the geospatial data space. Our goal is to create the highest quality anonymized geospatial data available to power innovation across a number of verticals.
Most recently, our data [00:02:30] sets have been heavily used during the COVID-19 investigations to measure the effectiveness of non-pharmaceutical interventions and social distancing policies. Each year we process optimize cleanse and deliver over two petabytes of data. As a DaaS company, data itself is our product. We don’t build analytical tools or any kinds of fancy visualizations. We’re just data. Meaning every day we think about the bottlenecks associated with operationalizing our data and how we can alleviate [00:03:00] them. To achieve our mission, it is pivotal that our data is optimized from a workflow, storage, retrieval and processing perspective.
So why this topic? I chose to focus on this topic after experiencing a number of follow-ups from a talk that I gave at Spark Summit 2020. In that presentation, I described various open source formats as the ones you see below, some of which we’ll cover again today and did a deep dive into each one of these formats. In the [00:03:30] days and weeks that followed that session. I got over 20 follow-up emails asking various versions of, so what format should I migrate to?
That taught me two things. First, it’s generally not a great idea to put your personal email in public presentations. And second, there seems to be a distinct lack of knowledge in this space. There are many great talks and presentations, including ones here at Subsurface that cover the creation of a data lake, but few focus on the actual format and its effect on the consumers [00:04:00] of that data lake.
So here’s what I hope you get out of this session. First, I’d like to leave you with a high-level understanding of the dominant file formats in this ecosystem and their unique features. I’d also like to encourage you and your data teams to think critically and concretely about your selection of file format, specifically in terms of optimizing your data lake. Finally, I’d like to leave you with an understanding of a potential framework for running migrations, coupled with real- [00:04:30] world stories of our own migration. But to accomplish any of these goals, we have to start with the first principles.
If you’re at this conference, you probably work with data. Likely tons of it, but unless you’re in the data-as-a-service space, you probably don’t think of it as your product. The data-as-a-service space represents a substantial departure from the SaaS space on a number of dimensions, but with the most important dimension, being that data [00:05:00] is only valuable when compounded with historical data.
Let me say that again, data is only valuable when it can be interpreted with historical context. For the SaaS folks out there, imagine if every new version of your product depends on every previous historical version, historical release of your product, including that [inaudible 00:05:22] you have to gather that one time as a proof of concept for that VC.
That’s our life. For that reason, we developed some primitives about [00:05:30] data that govern our business and how we think about data as a whole. First and foremost, data is our API.
As a company delivering data to over a hundred customers, we have to follow a strict and disciplined release process, or we risk breaking downstream customer pipelines. For us, schema changes, adding or removing a column must be treated as a major API break. Along these same lines, changing the file format that the data is stored in must also be treated as a [00:06:00] major version bump.
Second, data is inherently opinionated. I don’t just mean the contents of the data. I mean the data presentation layer itself. Choosing a format says a lot about you as a company and as a provider. And it says a lot about the types of customers that you intend on working with. You can choose open or proprietary formats, restrict your data to be used by certain tools by choosing a proprietary technology specific format, or even restrict the workflows possible [00:06:30] with your data. All with just the choice of your file format. In short, your choice of data presentation reflects your ethos as a company.
Third, data is only as useful as it is easy to use. If it’s difficult to use your data, it simply won’t be used. That may sound obvious, but it’s often overlooked. There is such a thing as a graveyard for data, and it’s a really sad place to be. For that reason, we found it better to be used and expensive, [00:07:00] than be unused and cheap. These permanents underlie our product, govern our business and reflect our values as a whole.
I’m not going to spend too much time on this slide, largely because I think a lot of this information can actually be covered at Sparks Summit 2020, but starting off at the top, there are two types of predominant work streams in the distributed computation ecosystem. First is OLTP or online transaction [00:07:30] processing. OLTP workflows are generally row-based, heavy processing workflows with frequent updates or deletes.
Other type of predominant workflow is OLAP or online analytics processing. These workflows are column-based analytics workflows. Tabular data is generally saved in one of a few dominant file formats. These include CSV and JSON, both of which are human readable, flexible formats or Avro, ORC and [00:08:00] Parquet, which are more complex binary formats each of which has a unique set of characteristics for optimal data storage processing and retrieval. Just on this topic, there are a few things I want to highlight.
First, you’ll notice that JSON, Avro, ORC and Parquet are all self-describing formats. That means the schema of the file is encapsulated within the file itself. Which means it’s easy to treat each file as a standalone encapsulated entity.
Second, you’ll [00:08:30] notice some of these are row-based and some are what I call hybrid based. Row-based formats lay data out contiguously in rows on disc, and hybrid formats lay data out as contiguous columns grouped into chunks of rows.
As you can see, there’s a lot going on here. And as I mentioned earlier, your choice of file format can speak volumes about your business. The conclusion of that presentation at Spark Summit 2020, highlighted how the attributes of your dataset and workflows [00:09:00] can inform your choice of file format. I discussed the importance of testing and monitoring your format selection as well as keeping libraries up to date. But that wasn’t the full picture. Actually running the migration can be complex, tedious, error prone and overall difficult, but it can have an immense reward.
To illustrate let’s kick off the story of Veraset’s migration. How it started. Rewind about a year ago, Veraset’s pipeline [00:09:30] was cumbersome, highly error prone, and failed with truly alarming frequency. Our daily three terabyte geospatial pipeline ran for approximately seven hours, required one i3.16xlarge box, and over a 100 i3.8xlarge boxes just to run successfully.
We had over 800 terabytes of data in S3 with over 80 million objects. Our pipeline then costs $650,000 a year at best [00:10:00] to run. At the time we use CSV as our internal data format for reasons that actually predated me. Now that fun little graph on the right-hand side is a screenshot of our PagerDuty alert frequency. At our peak, we had about four production instances a week for an engineering team of one, me. It was not a fun period for us.
How it’s going. Today, post our migration to Parquet, our pipeline runs in about three and a half hours. Half the time it previously [00:10:30] took to run. Our resource requirements dropped from one, i3.8xlarge box and 80 i3.4x large boxes representing a $100,000, $127,000 a year pipeline run cost, and a 500K, half a million dollars a year of annual savings.
Our object count and storage have understandably gone up. Data is our service, but the most exciting part of this is the PagerDuty graph on the right. We dropped to about one incident per month if even [00:11:00] that. Just on the selection of file format choice. We ended up choosing, as I mentioned before, Parquet, but not just Parquet, Snappy compressed Parquet.
Interestingly enough, the compression codec itself was actually a hot topic of conversation internally before we converged on Snappy. The migration was a huge one for our business, but it wasn’t easy. So without further ado here was our process and the lessons that we learned along the way.
As anyone who’s ever [00:11:30] managed a project will tell you, the first step to any process is exhaustively gathering requirements. I’ve listed a partial set of our requirements here and highlighted the key phrases. For this particular step, I want to focus on only internal requirements. And I always like telling a narrative around how we focus on these requirements.
So we ingest data that has a strict compliance and strict modeling requirement. We then cleanse the data hoping to minimize costs while maximizing uptime. We deliver this [00:12:00] to customers in both daily and historical methods and need to support the future expansion and evolution of our dataset. There simply isn’t a one size fits all format. So it’s going to be very unlikely, or very likely that you’re going to have to stack rank your requirements and knowledge that you may not be able to meet them.
The biggest lesson that we learned here is differentiating between hard and soft requirements. It’s so easy in this process for nice-to-haves, to become must-haves. And it’s pivotal to differentiate between [00:12:30] these two in the beginning. An example from our experience, one of the requirements that we thought would be great was to have full support for schema evolution. Meaning explore something like Avro. In our case, the feature actually operated in contradiction with a bunch of other requirements. So we unfortunately had to de-scope it. It’s these trade-offs that really make the internal gathering requirement process so important.
Second. Before speculating about which format you want, and there are biases out there. [00:13:00] I will admit I was very biased. Or any format that you want to commit to. It’s pivotal that you have a deep understanding of both your data and your workflows. Everything from column level cardinality to data distribution will inform your file format decision. Gathering the information is not trivial and can be extraordinarily computationally expensive, and monetarily expensive to get.
Once you have that information, think about how your data will be operationalized and what the [00:13:30] workflows will be. This is an area where we learned the importance of data profiling firsthand a little too late. Our initial beliefs were largely driven by unstated assumptions about our data. Spark had been processing the data seamlessly thus far. So there was no reason to believe anything would materially change just by changing the format the data was saved in.
During one of our test runs we underestimated the skew present, and one of our columns ended up writing an unintentionally heavily skewed dataset. Meaning 99% of [00:14:00] the Spark tasks had finished, but one job ended up hanging due to one, still ever computing task. Once we realized this, we went back and did a full suite of the due diligence that I’ve listed here around the data.
If you’re curious what skew looks like in practice, I have a notional example at the bottom of the slide right there. Unfortunately don’t have our exact example because it’s been purged from Databricks, but you can see in this example, we have 199 tasks successfully completed with the job running for about 47 hours and one just forever [00:14:30] hanging task.
Step three, is this worth it? There comes a time in every large scale feature development process where someone asks, is this really worth doing? In this case more so than any other, I would stress how important this question is to ask at this point. The screenshot on the right shows Amazon’s attempt to calculate the number of objects in our bucket.
This is our real data. It is really just a cheeky way of saying we have huge amounts of data, just like your firms do. As I’ve [00:15:00] said before, migrations are complicated and error prone and can be extraordinarily expensive. Moving petabytes and terabytes of data is no easy task. Before pursuing this as a path, ask yourself, is this really worth it? And what features or workflows am I unblocking by pursuing this migration? Some would argue that this is a step that should come earlier in the process, but to truly assess whether the migration is worth doing you’ll want the statistics and information that you would’ve gotten from the previous steps.
[00:15:30] In our case, we were halfway through what I’ll describe as step four. When we almost gave up. The number of moving pieces and potential pain made the migration less and less appealing, but luckily we pushed through and we’re happy we did. Step four is customer due diligence. As I mentioned before, the file format you present your data in speaks volumes about you as a business. The distributed computation ecosystem is a vast ecosystem with a number of players. A format change is a major breaking [00:16:00] change, and you’re going to need support from your customers throughout the process.
I found the following questions to be helpful. First, who will consume the data? Is it data scientists, engineers, devs, who will consume it? Second, what will consume the data? Is it Databricks, Snowflake, Dremio, Dataproc, some proprietary tool they built internally or something else? Three, where will the data be stored? It could be an S3, HDFS, potentially on prem, potentially on a hard drive. [00:16:30] I’ve seen that before. It’s really important to consider. And four, when is the data accessed? In the world of batch versus real-time computation, it really depends on when and how frequently you’re accessing your data to make a sane choice here. In Veraset’s case, our data is consumed by a number of external parties. All of whom use different tools to analyze the data.
It would have been impossible for us to pick a format that seamlessly works for everyone. At a minimum for most systems they’ll need to be a production [00:17:00] code change on our customer side. For that reason, it’s important to support your customer throughout the process, whether they’re an internal or external customer, and to not be surprised that there’s friction in the process. In our case, we’ve had customers that had never heard of Parquet or Avro. We had customers that complained Excel couldn’t open Parquet files.
One of my favorite stories was a customer that had a manual QA process of each data of each file, in a data set that we delivered every day. Someone would come into work [00:17:30] and look through each of the 300 plus files on a daily basis that we delivered in CSV. Opening each of them up to make sure they all looked right. After we moved to Parquet that person could no longer do that check, which represented a huge break in the customer’s process, which meant they had to do enhanced change management and push their transition back a few times.
You never know what’s going on in your customer ecosystem, and it’s important to understand their requirements as you scope your [00:18:00] migration. The last thing you’ll notice, I put step three, is this worth it? Before this step. There’s a simple reason for that. As I’ve described, this step is going to be ultra painful and may actually kill your motivation to move forward. It almost killed ours. The people who are in that scenario, I would say change is always hard, but if you’ve decided to move forward, don’t let the pain of a migration deter you.
Step five is to select the format and commit to it. That means once you’ve committed [00:18:30] to getting all the features, you’re committed to getting all the limitations of that format. You’re likely going to be stuck with it for a non-trivial amount of time as well. For that reason, I’d recommend generating data samples of your data, playing around with those samples and actually seeing if they work for you before rolling them out to your customers or proposing them as changes.
You’ll notice unquestionable changes in storage, compute, and even network utilization. This is probably the biggest lesson we learned. Migrations [00:19:00] need to be atomic. Either the entire dataset, current and historical should be migrated, or none of it should be. For DaaS companies that are delivering data daily, it may be tempting to make this as a forward looking change that only takes effect at one point in time and ignoring all the historical data.
However, I’d strongly advise against this. Trying to merge data sets of different formats is non-trivial and cumbersome. The trade-off here is really just the processing time and cost. Reading [00:19:30] in and writing out data, potentially petabytes of data is expensive from a time, resource, and cost perspective. In our case, we tried to be clever. Rather, I tried to be clever and decided to model our migration like a Cassandra read repair operation.
When a segment of data was asked for, we did the rewrite on an on-demand basis and persisted those results. The driving factor here is we’re a small startup and we had to reduce costs. That strategy sounded great in theory, but in practice, we ended up with a [00:20:00] fractured dataset that now exists in two different formats. At some point, we’re going to have to bite the bullet and migrate all of the remaining data segments over, but that isn’t going to be trivial to do. In short, learn from us and just migrate the entire data set at the beginning.
Six. The last step here is to retrospect your decision, and it’s probably the most important step. Did the migration go the way you expected? If so, why? If not, why? In [00:20:30] Veraset’s case we had an unexpected failure when writing historical data in the new format. A fixed data type column, all of a sudden had a number format exception that we didn’t expect. It’s kind of scary if you think about that. Is there something we could have done with our data to catch all of this before our production run, or is [inaudible 00:20:47] the only place to catch it?
Migrating your file format is not a static process either. File formats are code, need to be kept up to date for performance and stability reasons. Finally, make sure you get feedback from your customers [00:21:00] or consumers about how the format is working in their ecosystem. Many may actually give you surprising answers.
In conclusion, undertaking a file format migration can be a [inaudible 00:21:12] to your organization, but it isn’t without its challenges. Throughout the process, keep a focus on whoever will be consuming the data. It’s more important to create usable expensive data sets, than cheap unused data sets. It’s important to create fixed checkpoints to reevaluate your decisions and ensure they’re still working for you, especially [00:21:30] from a compliance perspective. Make sure to retrospect and learn from your decisions. These migrations are great opportunities to gain customer insight as well.
Finally, this ecosystem is changing rapidly. Many of you have heard of Apache Iceberg and I’ve seen Ryan Blue’s talk. I think he actually just joined this session from Netflix. Iceberg transactions… Or transitions the data file level abstraction to the table abstraction. It’s really interesting and represents a pretty interesting development in this community. I’d recommend checking out [00:22:00] those talks for more information.
Thank you so much for the time today. I’m happy to answer any questions you have both live here or on Slack afterwards.
All right. Thank you so much. So we’ve got a few questions coming in, just a reminder to the audience that if you have any questions, please type them in the chat. As Vinoo mentioned, he will also be available on the Slack community afterwards. So if you have any questions, you can just look up his name and find him there. [00:22:30] Also, before I get into the questions before you leave today, please go to the Slido tab at the top of your chat panel. There’s a three question survey there. So if you could complete that before leaving, that would be fantastic. All right, into the questions. So first one, “Was data mutability an issue in your storyline? If yes. How did you manage that?”
That’s a phenomenal question. The historical world has thought of data as these locked in units that don’t change. That’s just not the world we operate in anymore. But one of the things that [00:23:00] we decided early on in Veraset is that we don’t want mutable data. Our goal is to catalog history and the way it happened. And so one of the big assumptions that we were able to make in this format is that our data is not mutable.
Anytime that there is an update, we can actually deliver that in the form of new files or new additions, but it will forever be a growing dataset that represents… The tip of that data set will be the most recent information that we get. Now that sounds like a cop out [00:23:30] answer. And I almost think of it as a cop out answer too. And the reality is that the technologies really are developing right now for this to be mutable or for data sets of this size to really be mutable without creating a proprietary b-tree like indexing system to access individual rows, trying to index into a data set that’s three, four petabytes. It’s just not super feasible right now. Great question.
Great. Next question. “Any tips on blending joins from both Sequel DB tables [00:24:00] and Parquet files? What tools are best if not pandas?”
So this is a great question that’s illustrating another rapidly developing aspect of this ecosystem. Reading from a database and joining to Parquet right now represent two distinct operations. I need to create my JDBC, ODBC connection, pull that data in, actually pick a join key and do that join…
There isn’t a great story for the in-memory aspect of these joints until recently. So if you’ve seen the talk around Apache Arrow [00:24:30] or Arrow Flight, these are huge developments in the ecosystem that actually assume, well, how do I have a columnar format in memory optimized for these types of operations? Now, the reality is right now you have to get both of these data sets in memory to do any kind of operation on them.
So there isn’t really a strategy or a great tool that I’d recommend. It’s going to be expensive to load depending on the size of your data set all of this in memory. But I would almost say like pick one, if you can. [00:25:00] Parquet is actually great at storing small data in addition to big data. So it may be easy to transition your entire data to Parquet and join based off of that.
Great. Next question, “Any tips on how to properly partition data to optimize the storage Parquet file size and retrieval access files?”
Great question. And also one of the big questions in the community. So back in the olden days, Hadoop, everyone was using Hadoop before S3 and there was a notion of a block size. [00:25:30] I think it was 64 megs and became 128 megs. And the goal was to try and make a selection of a Parquet file size that would exactly be the Hadoop block size.
So you wouldn’t have these partially filled Hadoop blocks and have to do all these random reads. That would be unnecessary. With the migration to S3 and proprietary blob storage tools. We actually don’t have a lot of great insight into what the block size are. So I think it really depends on how your organization uses this data. The previous [00:26:00] question asked about pandas as well. Pandas is… So there’s Koalas now that Databricks has introduced, but a lot of this computation has to happen in memory.
And when you’re working in pandas, it’s going to be all on one driver and trying to load in all of this data into that one driver is going to be really hard. So I’m going to give you another kind of cop-out answer which is, it really just matters what the testing is. Depending on the number of files in your data set or even the cardinality, or the distribution, or the partition key, your Parquet [00:26:30] files can actually vary drastically in just size and shape.
So I would say if you’re in an on-prem world or in a Hadoop world, your answer is a lot easier. You have that block size metric to optimize on. So 128 megs or multiples of that. If you’re in an S3 or proprietary world, it’s going to be a lot of testing just because we don’t have a lot of great insight into the blob storage mechanisms for those technologies.
Great, “Any suggestions on handling schema changes?”
[00:27:00] So, yeah, these are phenomenal questions. I’ve heard everything from just pre-seed fake columns into your dataset and make sure that if you need to populate them down the line, that these columns exist. I would say I would not advise against this. There’s no reason to just bloat a bunch of [inaudible 00:27:19] into your data set, unless you truly have to.
One of the primitives that we’ve thought of is, data is our API, as I’ve mentioned, and data needs to be version controlled. For the longest [00:27:30] time you tag major releases with Git, everything is [inaudible 00:27:34]. We don’t really think about that for data, but with technology projects like Nessie, we’re actually thinking, what does change management look like on a data perspective?
And that’s actually a really complicated topic. And so what I would say is version your data as well. Treat schema changes as major version breaks, or at least different versions of your product. As you migrate from Spark 2 to Spark 3, it’s a major, major version change. You know there are going [00:28:00] to be challenges along the way. So I would say version your data and different customers may want to pick the tip of develop, depending on their risk tolerance. Some may want to just forever stay on one version rather than rewriting their whole systems.
What is the best way to create parquet files from streaming data like Kafka and making sure to have large block size/row groups and avoiding small files?
Great question. So the small follow-ups problem is a pretty well-known problem in Hadoop. [00:28:30] It’s kind of the anti problem to filling the block size in Hadoop. Now there’s a lot of mechanisms you can use. One that I’ve seen be particularly useful, so with Amazon Kinesis is creating an actual cache where you can cache these files with a write ahead log, making sure they’re actually persisted in some way, and running these compaction operations in memory before actually dumping them into your core dataset.
Now, I think the better answer I’m going to give you is, as [00:29:00] the technology ecosystem develops this notion of acid transactions on top of your dataset will become more of areal thing. I think Iceberg actually gets at that as well. So the advice I would say is, you can either home build your own solution where you’re trying to, as these files come in, whether they’re an Avro or something else, you can kind of coalesce them together and do these atomic batch rights to your ending system.
But right now you’re actually highlighting a more difficult part of this problem. And [00:29:30] it’s still an area that you need to test and actually make sure it will work well for you. One thing I’ll say also is, we worked so hard to avoid the small files problem in Hadoop. I totally get it. Again, we just don’t see the mechanisms in S3 or in any kind of blob storage from Google or Azure. So we don’t really know how it works behind the scenes. And a lot of the speculation that we have, you can try and bother your AWS S3 rep, but it’s not going to be easy to get those answers. So testing is actually going to [00:30:00] be really important.
Rob asks, “Why parquet instead of Avro, ORC, et cetera? What features pushed you in that direction?”
Great question. I actually really loved the summary metadata feature of Parquet. Our dataset… For those who don’t know what that is, you can actually write a Parquet summary file that introduces all of the indexing systems and mechanisms that exist on a per file basis, and on Parquet files as a whole. To [00:30:30] me, that feature was magical. And that’s because we had so much data that for us, our Spark job is to actually read the footers of each of our Parquet files. It was just super expensive.
Now the summary metadata file is actually largely deprecated, but it actually helped push our decision forward a lot. In terms of Avro or ORC, now this is really just a religious battle. I brought in the customer focus, and it just so happened that first there’s familiarity that I had internally with Parquet, [00:31:00] but just the ecosystem and the tools available were more prone to Parquet. And there were just more Parquet readers out there making our customer’s transitions a lot easier than if we were to go with something like Avro or ORC.
So it was a kind of an interesting decision that really came down to, well, what would be the most seamless migration experience for our customers and this informed that pretty heavily.
Great we’re at time so this is the last question, but just a reminder, you can go over to the Slack community [00:31:30] and ask your questions to Vinoo there. So, “Can one expect a spike in CPU from dictionary construction during write, and deconstruction during read? If yes, is this significant enough to be considered?”
Yes, you can consider. So the answer is a 100% yes. Things will look a little bit… They will be computationally expensive to actually build these indexing systems out. The question really comes down to, who are you optimizing [00:32:00] for? And so often I think data engineers including us, we optimize for the sake of ourselves. We’re like, we want the job to be done. We don’t want to be paged. We just want the data to be persisted. That’s not the best system to optimize for.
I would say optimizing for your customers or read time is actually a lot more important. So there is a CPU spike. There’s actually also a network spike. There’s a storage spike if you’re introducing summary metadata files. I’ve actually deprioritize that and almost [00:32:30] taken the hit and pipeline execution to make things easier for our customers to use.
And that’s a decision that, it remains to be seen if it was the trade-off. I personally think it is, because it means that our customers actually save also hundreds of thousands of dollars in compute costs. We’ve actually had customers save this much, but the spike is expected to be there. And I would almost say, it really depends on if you want to optimize on the write time of the job or the read time.
Great. Well, thank you so much everyone for [00:33:00] joining us, for participating. Thank you, Vinoo. This was a fantastic session. So yeah. Enjoy the rest of your day. Thank you everyone.
Thank you so much, guys.