Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-9992

Max allocation per queue is zero for custom resource types on RM startup

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • None
    • None
    • None
    • None

    Description

      Found an issue where trying to request GPUs on a newly booted RM cannot schedule. It throws the exception in SchedulerUtils#throwInvalidResourceException:

      throw new InvalidResourceRequestException(
          "Invalid resource request, requested resource type=[" + reqResourceName
              + "] < 0 or greater than maximum allowed allocation. Requested "
              + "resource=" + reqResource + ", maximum allowed allocation="
              + availableResource
              + ", please note that maximum allowed allocation is calculated "
              + "by scheduler based on maximum resource of registered "
              + "NodeManagers, which might be less than configured "
              + "maximum allocation="
              + ResourceUtils.getResourceTypesMaximumAllocation());

      Upon refreshing scheduler (e.g. via refreshQueues), GPU scheduling works again.

      I think the RC is that upon scheduler refresh, resource-types.xml is loaded in CapacitySchedulerConfiguration (as part of YARN-7738), so when we call ResourceUtils#fetchMaximumAllocationFromConfig in CapacitySchedulerConfiguration#getMaximumAllocationPerQueue, it's able to fetch the yarn.resource-types config. But resource-types.xml is not loaded into the conf in CapacityScheduler#initScheduler, so it doesn't find the custom resource when computing max allocations, and the custom resource max allocation is 0.

      Attachments

        1. YARN-9992.001.patch
          1 kB
          Jonathan Hung

        Issue Links

          Activity

            People

              jhung Jonathan Hung
              jhung Jonathan Hung
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: