8 minute read · April 16, 2024
Deep Dive into Better Stability with the new Memory Arbiter
· Senior Product Manager
· Senior Staff Software Engineer
· Staff Software Engineer
· Senior Software Engineer
· Principal Software Engineer
· Director, Software Engineering
Introduction
Memory Arbiter is a new and important feature that was released to Dremio Cloud and is coming to Dremio 25.0. MA enables executors to better utilize their direct memory by accurately tracking its usage across certain operators and, where possible, dynamically reducing the memory footprint of those operators to allow others to consume it. This, along with Spillable Hash Join, is the first of several features in the pipeline for Dremio to react better to diminishing resources and keep running where it previously would have failed.
How Memory Arbiter changes execution
Memory Arbiter changes execution for a particular set of operators once a fragment of work has landed on an executor and processing is underway. The way queries are planned and initial memory allocation remain the same and follow this flow:
- A query is submitted, and a plan is generated
- Based on the constituent operators of the plan and metadata regarding the source, a minimum viable estimate is generated
- For operators that do not hold or expand as data is processed, this is a small static number
- For operators that do hold or expand as data is processed, formulas are leveraged to generate this initial allocation based on the stats of the queried datasets
- The initial memory allocation for a fragment of execution is made on the executors
- Further memory allocations are made throughout the execution of the fragment as needed
The key differentiator is the final point. Previously, all operators would request as much memory as needed until query completion or memory exhaustion. Existing guard rails did offer some fault tolerance but would often intervene too late and not be able to prevent query failure. For example, spillable operators had their spill trigger point defined during planning; however, if one encountered an out-of-memory exception, it would start spilling instead. This would prevent failures, but such an exception signals there is little headroom available to continue execution. With Memory Arbiter, executors are now able to react more quickly to changes in allocatable memory instead of relying on arbitrary spill points set in the plan or an out-of-memory exception.
How Memory Arbiter Works
As mentioned above, Memory Arbiter only changes the execution of four operators (as of the time of writing):
- Hash Join
- Hash Agg
- External Sort
- TopN
These operators are often the highest memory consumers and, importantly, also have a mechanism to shrink their memory usage. Hash Join, Hash Agg and External Sort can spill in mid-operation, and the TopN can discard data structures that were used while processing and are no longer needed for the output. Each executor has its own Memory Arbiter that tracks each instance of these operators, the amount of memory they’re using and, importantly, the amount of memory they’re able to give up. If memory usage on an executor crosses a threshold, Memory Arbiter will start requesting that these operators shrink their usage, starting with the one with the most to give. Memory Arbiter also acts as a middleman for new memory allocation requests for these operators. Therefore, requests to shrink can be paired with temporarily blocking new memory allocations, thus reducing existing allocations and growth rate.
Memory Arbiter only tracks Direct memory since data processing happens in Direct memory. Not all operations in Direct memory are being tracked, despite Memory Arbiter's best efforts, Direct memory exhaustion remains possible.
Testing Memory Arbiter and Spillable Hash Join
The addition of Memory Arbiter and Spillable Hash Join is a significant change that requires expensive testing. We used TPC-DS, our Regression suite, and a new suite based on Customer queries. The customer queries selected were either very complex or were prone to failures before Memory Arbiter. All of these were compared to runs of the same version of Dremio with the features disabled.
Test Suite | Data | Run style | Number of executors |
TPC-DS | 1TB partitioned and non-partitioned | In-series | 2, 4, 8 |
Concurrent 10, 20, 50 | |||
10TB partitioned and non-partitioned | In-series | 16 | |
100TB partitioned | In-series | 32 | |
Regression | Synthetic - relatively small to allow for faster cycles | In-series | 8 |
Customer Queries | Synthetic - Same size as the original data set | In-series | 8 |
We targeted 0 failures with a 5% performance delta for all executor counts.
With the inclusion of Memory Arbiter and Spillable Hash Join, Dremio could complete a full 100TB TPC-DS run without failure, which was previously impossible.
Initial Customer Testing
Multiple EE Cloud customers took part in the private beta. Some had been experiencing memory-related issues, while others were more to gauge any perceived performance impact. Comparing the same-sized time window before and after the features enablement (close to two months on either side), we saw some very positive results for customers facing memory-related issues.
The highest saw a 90% decrease in Direct memory allocation errors, with the lowest seeing a 56% decrease. This graph shows the customers that saw the 90% decrease:
MA = Memory Arbiter, SHJ = Spillable Hash Join. The two spikes were driven by an automated process, which repeatedly submitted the same query, even if it had not succeeded before and was destined to fail again.
The graph shows direct memory errors occurred with a relatively constant background rate with some spikes. After the enablement, these errors dropped immediately and lastingly.
The results for customers testing for performance degradation during day-to-day operation were also positive, with one asking if we’d forgotten to switch it on.
Recommended changes to WLM
This only concerns customers using Dremio self-managed once it launches in version 25.0. If you have set queue memory limits, we recommend removing them. The queue and job memory limits help self-managed customers avoid the noisy neighbor problem between workloads of varying priority. However, the queue and job memory limits are relatively heavy-handed and may intervene despite Memory Arbiter having the situation under control, thus diminishing its effectiveness.
Closing
With the Cloud general availability release behind us, we’ve been very happy with the results and are looking forward to the imminent 25.0 software release. We are already planning how to enhance the feature further with additional shrinkable operators, a better shrink selection algorithm, prioritized operators, integration with other monitoring systems, and others.