[YARN-11067] Resource overcommitment due to incorrect resource normalisation logical order - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.4.0
Fix Version/s: 3.4.0
Component/s: capacity scheduler
Labels:
- pull-request-available

Target Version/s:

3.4.0
Hadoop Flags:

Reviewed

Description

A rather serious overcommitment issue was discovered when using ABSOLUTE resources as capacities. A minimal way to reproduce the issue is the following:

We have a cluster with 32 GB memory and 16 VCores. Create the following hierarchy with the corresponding capacities:

1. root.capacity = [memory=54GiB, vcores=28]
2. root.a.capacity = [memory=50GiB, vcores=20]
3. root.a1.capacity = [memory=30GiB, vcores=15]
4. root.a2.capacity = [memory=20GiB, vcores=5]
Remove a Node from the cluster (this is not even an unusual event), eg. a Node with resource [memory=8GiB, vcores=4]
Due to the normalised resource ratio is calculated BEFORE the effective resource of the queue is recalculated, it will create a cascade which results in an overcommitment in the queue hierarchy (see https://github.com/apache/hadoop/blob/5ef335da1ed49e06cc8973412952e09ed08bb9c0/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/ParentQueue.java#L1294)

Attachments

Issue Links

links to

GitHub Pull Request #3919

Activity

People

Assignee:: Andras Gyori

Reporter:: Andras Gyori

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 24/Jan/22 08:40

Updated:: 11/Feb/24 23:34

Resolved:: 10/Mar/22 21:24

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

1h 10m