Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-8771

CapacityScheduler fails to unreserve when cluster resource contains empty resource type

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 3.2.0
    • Fix Version/s: 3.2.0, 3.1.2
    • Component/s: capacityscheduler
    • Labels:
      None
    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      We found this problem when cluster is almost but not exhausted (93% used), scheduler kept allocating for an app but always fail to commit, this can blocking requests from other apps and parts of cluster resource can't be used.

      Reproduce this problem:
      (1) use DominantResourceCalculator
      (2) cluster resource has empty resource type, for example: gpu=0
      (3) scheduler allocates container for app1 who has reserved containers and whose queue limit or user limit reached(used + required > limit).

      Reference codes in RegularContainerAllocator#assignContainer:

          // How much need to unreserve equals to:
          // max(required - headroom, amountNeedUnreserve)
          Resource headRoom = Resources.clone(currentResoureLimits.getHeadroom());
          Resource resourceNeedToUnReserve =
              Resources.max(rc, clusterResource,
                  Resources.subtract(capability, headRoom),
                  currentResoureLimits.getAmountNeededUnreserve());
      
          boolean needToUnreserve =
              Resources.greaterThan(rc, clusterResource,
                  resourceNeedToUnReserve, Resources.none());
      

      For example, resourceNeedToUnReserve can be <8GB, -6 cores, 0 gpu> when headRoom=<0GB, 8 vcores, 0 gpu> and capacity=<8GB, 2 vcores, 0 gpu>, needToUnreserve which is the result of Resources#greaterThan will be false. This is not reasonable because required resource did exceed the headroom and unreserve is needed.
      After that, when reaching the unreserve process in RegularContainerAllocator#assignContainer, unreserve process will be skipped when shouldAllocOrReserveNewContainer is true (when required containers > reserved containers) and needToUnreserve is wrongly calculated to be false:

          if (availableContainers > 0) {
               if (rmContainer == null && reservationsContinueLooking
                && node.getLabels().isEmpty()) {
                    // unreserve process can be wrongly skipped when shouldAllocOrReserveNewContainer=true and needToUnreserve=false but required resource did exceed the headroom
                    if (!shouldAllocOrReserveNewContainer || needToUnreserve) { 
                          ... 
                    }
               }
          }
      

        Attachments

        1. YARN-8771.001.patch
          7 kB
          Tao Yang
        2. YARN-8771.002.patch
          7 kB
          Tao Yang
        3. YARN-8771.003.patch
          7 kB
          Tao Yang
        4. YARN-8771.004.patch
          7 kB
          Tao Yang

          Activity

            People

            • Assignee:
              Tao Yang Tao Yang
              Reporter:
              Tao Yang Tao Yang
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: