Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-10903

Too many "Failed to accept allocation proposal" because of wrong Headroom check for DRF

VotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    Description

      The headroom check in  `ParentQueue.canAssign` and `RegularContainerAllocator#checkHeadroom` does not consider the DRF cases.

      This will cause a lot of "Failed to accept allocation proposal" when a queue is near-fully used.
      In the log:
      Headroom: memory:256, vCores:729
      Request: memory:56320, vCores:5
      clusterResource: memory:673966080, vCores:110494
      If use the DRF, then

      Resources.greaterThanOrEqual(rc, clusterResource, Resources.add(
          currentResourceLimits.getHeadroom(), resourceCouldBeUnReserved),
          required); 

      will be true but in fact we can not allocate resources to the request due to the max limit(no enough memory).

      2021-07-21 23:49:39,012 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt: showRequests: application=application_1626747977559_95859 headRoom=<memory:256, vCores:729> currentConsumption=0
      2021-07-21 23:49:39,012 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.LocalityAppPlacementAllocator:  Request={AllocationRequestId: -1, Priority: 1, Capability: <memory:56320, vCores:5>, # Containers: 19, Location: *, Relax Locality: true, Execution Type Request: null, Node Label Expression: prod-best-effort-node}
      .....
      2021-07-21 23:49:39,013 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Try to commit allocation proposal=New org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.ResourceCommitRequest:
               ALLOCATED=[(Application=appattempt_1626747977559_95859_000001; Node=xxxx:8041; Resource=<memory:56320, vCores:5>)]
      2021-07-21 23:49:39,013 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.UsersManager: userLimit is fetched. userLimit=<memory:7077376, vCores:1277>, userSpecificUserLimit=<memory:7077376, vCores:1277>, schedulingMode=RESPECT_PARTITION_EXCLUSIVITY, partition=prod-best-effort-node
      2021-07-21 23:49:39,013 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: Headroom calculation for user xxxxx:  userLimit=<memory:7077376, vCores:1277> queueMaxAvailRes=<memory:0, vCores:0> consumed=<memory:0, vCores:0> partition=prod-best-effort-node
      2021-07-21 23:49:39,013 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue: Used resource=<memory:7077120, vCores:548> exceeded maxResourceLimit of the queue =<memory:7089920, vCores:1278>
      2021-07-21 23:49:39,013 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Failed to accept allocation proposal
       

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            jackwangcs jackwangcs
            jackwangcs jackwangcs
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 50m
                50m

                Slack

                  Issue deployment