Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
Description
The headroom check in `ParentQueue.canAssign` and `RegularContainerAllocator#checkHeadroom` does not consider the DRF cases.
This will cause a lot of "Failed to accept allocation proposal" when a queue is near-fully used.
In the log:
Headroom: memory:256, vCores:729
Request: memory:56320, vCores:5
clusterResource: memory:673966080, vCores:110494
If use the DRF, then
Resources.greaterThanOrEqual(rc, clusterResource, Resources.add( currentResourceLimits.getHeadroom(), resourceCouldBeUnReserved), required);
will be true but in fact we can not allocate resources to the request due to the max limit(no enough memory).
2021-07-21 23:49:39,012 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt: showRequests: application=application_1626747977559_95859 headRoom=<memory:256, vCores:729> currentConsumption=0 2021-07-21 23:49:39,012 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.LocalityAppPlacementAllocator: Request={AllocationRequestId: -1, Priority: 1, Capability: <memory:56320, vCores:5>, # Containers: 19, Location: *, Relax Locality: true, Execution Type Request: null, Node Label Expression: prod-best-effort-node} ..... 2021-07-21 23:49:39,013 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Try to commit allocation proposal=New org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.ResourceCommitRequest: ALLOCATED=[(Application=appattempt_1626747977559_95859_000001; Node=xxxx:8041; Resource=<memory:56320, vCores:5>)] 2021-07-21 23:49:39,013 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.UsersManager: userLimit is fetched. userLimit=<memory:7077376, vCores:1277>, userSpecificUserLimit=<memory:7077376, vCores:1277>, schedulingMode=RESPECT_PARTITION_EXCLUSIVITY, partition=prod-best-effort-node 2021-07-21 23:49:39,013 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: Headroom calculation for user xxxxx: userLimit=<memory:7077376, vCores:1277> queueMaxAvailRes=<memory:0, vCores:0> consumed=<memory:0, vCores:0> partition=prod-best-effort-node 2021-07-21 23:49:39,013 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue: Used resource=<memory:7077120, vCores:548> exceeded maxResourceLimit of the queue =<memory:7089920, vCores:1278> 2021-07-21 23:49:39,013 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Failed to accept allocation proposal
Attachments
Issue Links
- links to