Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-11067

Resource overcommitment due to incorrect resource normalisation logical order

    XMLWordPrintableJSON

Details

    • Reviewed

    Description

      A rather serious overcommitment issue was discovered when using ABSOLUTE resources as capacities. A minimal way to reproduce the issue is the following:

      1. We have a cluster with 32 GB memory and 16 VCores. Create the following hierarchy with the corresponding capacities:
        1. root.capacity = [memory=54GiB, vcores=28]
        2. root.a.capacity = [memory=50GiB, vcores=20]
        3. root.a1.capacity = [memory=30GiB, vcores=15]
        4. root.a2.capacity = [memory=20GiB, vcores=5]
      1. Remove a Node from the cluster (this is not even an unusual event), eg. a Node with resource [memory=8GiB, vcores=4]
      2. Due to the normalised resource ratio is calculated BEFORE the effective resource of the queue is recalculated, it will create a cascade which results in an overcommitment in the queue hierarchy (see https://github.com/apache/hadoop/blob/5ef335da1ed49e06cc8973412952e09ed08bb9c0/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/ParentQueue.java#L1294)

       

      Attachments

        Issue Links

          Activity

            People

              gandras Andras Gyori
              gandras Andras Gyori
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 10m
                  1h 10m