[YARN-11073] Avoid unnecessary preemption for tiny queues under certain corner cases - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.10.1
Fix Version/s: 3.4.0
Component/s: capacity scheduler, scheduler preemption
Labels:
- pull-request-available

Description

When running a Hive job in a low-capacity queue on an idle cluster, preemption kicked in to preempt job containers even though there's no other job running and competing for resources.

Let's take this scenario as an example:

cluster resource : <Memory:168GB, VCores:48>
- queue_low: min_capacity 1%
- queue_mid: min_capacity 19%
- queue_high: min_capacity 80%
CapacityScheduler with DRF

During the fifo preemption candidates selection process, the preemptableAmountCalculator needs to first "computeIdealAllocation" which depends on each queue's guaranteed/min capacity. A queue's guaranteed capacity is currently calculated as "Resources.multiply(totalPartitionResource, absCapacity)", so the guaranteed capacity of queue_low is:

queue_low: <Memory: (168*0.01)GB, VCores:(48*0.01)> = <Memory:1.68GB, VCores:0.48>, but since the Resource object takes only Long values, these Doubles values get casted into Long, and then the final result becomes <Memory:1GB, VCores:0>

Because the guaranteed capacity of queue_low is 0, its normalized guaranteed capacity based on active queues is also 0 based on the current algorithm in "resetCapacity". This eventually leads to the continuous preemption of job containers running in queue_low.

In order to work around this corner case, I made a small patch (for my own use case) around "resetCapacity" to consider a couple new scenarios:

if the sum of absoluteCapacity/minCapacity of all active queues is zero, we should normalize their guaranteed capacity evenly
```
1.0f / num_of_queues
```

if the sum of pre-normalized guaranteed capacity values (MB or VCores) of all active queues is zero, meaning we might have several queues like queue_low whose capacity value got casted into 0, we should normalize evenly as well like the first scenario (if they are all tiny, it really makes no big difference, for example, 1% vs 1.2%).
if one of the active queues has a zero pre-normalized guaranteed capacity value but its absoluteCapacity/minCapacity is not zero, then we should normalize based on the weight of their configured queue absoluteCapacity/minCapacity. This is to make sure queue_low gets a small but fair normalized value when queue_mid is also active.
```
minCapacity / (sum_of_min_capacity_of_active_queues)
```

This is how I currently work around this issue, it might need someone who's more familiar in this component to do a systematic review of the entire preemption process to fix it properly. Maybe we can always apply the weight-based approach using absoluteCapacity, or rewrite the code of Resource to remove the casting, or always roundUp when calculating a queue's guaranteed capacity, etc.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

YARN-11073.tmp-1.patch
11/Feb/22 20:26
10 kB
Jian Chen

Issue Links

is depended upon by

YARN-11149 Add regression test cases for YARN-11073

Open

links to

GitHub Pull Request #4110

Activity

People

Assignee:: Jian Chen

Reporter:: Jian Chen

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 11/Feb/22 11:38

Updated:: 13/May/22 16:16

Resolved:: 13/May/22 16:12

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

50m