Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-7560

Resourcemanager hangs when resourceUsedWithWeightToResourceRatio return a overflow value

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Duplicate
    • Affects Version/s: 3.0.0
    • Fix Version/s: None
    • Labels:
      None
    • Target Version/s:

      Description

      In our cluster, we changed the configuration, then refreshQueues, we found the resourcemanager hangs. And the Resourcemanager can't restart successfully. We got jstack information, always show like this:

      "main" #1 prio=5 os_prio=0 tid=0x00007f98e8017000 nid=0x2f5 runnable [0x00007f98eed9a000]
         java.lang.Thread.State: RUNNABLE
              at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:182)
              at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSharesInternal(ComputeFairShares.java:140)
              at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSteadyShares(ComputeFairShares.java:66)
              at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeSteadyShares(FairSharePolicy.java:148)
              at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeSteadyShares(FSParentQueue.java:102)
              at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getQueue(QueueManager.java:148)
              - locked <0x00007f8c4a8177a0> (a java.util.HashMap)
              at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getLeafQueue(QueueManager.java:101)
              at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.updateAllocationConfiguration(QueueManager.java:387)
              at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$AllocationReloadListener.onReload(FairScheduler.java:1728)
              at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:422)
              - locked <0x00007f8c4a7eb2e0> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService)
              at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1597)
              at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.serviceInit(FairScheduler.java:1621)
              at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
              - locked <0x00007f8c4a76ac48> (a java.lang.Object)
              at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
              at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:569)
              at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
              - locked <0x00007f8c49254268> (a java.lang.Object)
              at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:997)
              at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:257)
              at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
              - locked <0x00007f8c467495e0> (a java.lang.Object)
              at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1220)
      

      When we debug the cluster, we found resourceUsedWithWeightToResourceRatio return a negative value. So the loop can't return. We found in our cluster, the sum of all minRes is over int.max, so resourceUsedWithWeightToResourceRatio return a negative value.

      below is the loop. Because totalResource is long, so always postive. But resourceUsedWithWeightToResourceRatio return int type. Our cluster is so big that resourceUsedWithWeightToResourceRatio will return a overflow value, just a negative. So the loop will never break.

          while (resourceUsedWithWeightToResourceRatio(rMax, schedulables, type)
              < totalResource) {
            rMax *= 2.0;
          }
      

        Attachments

        1. YARN-7560.001.patch
          5 kB
          zhengchenyu
        2. YARN-7560.000.patch
          5 kB
          zhengchenyu

          Issue Links

            Activity

              People

              • Assignee:
                zhengchenyu zhengchenyu
                Reporter:
                zhengchenyu zhengchenyu
              • Votes:
                2 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: