Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Duplicate
-
3.0.0
-
None
-
None
Description
In our cluster, we changed the configuration, then refreshQueues, we found the resourcemanager hangs. And the Resourcemanager can't restart successfully. We got jstack information, always show like this:
"main" #1 prio=5 os_prio=0 tid=0x00007f98e8017000 nid=0x2f5 runnable [0x00007f98eed9a000] java.lang.Thread.State: RUNNABLE at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:182) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSharesInternal(ComputeFairShares.java:140) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSteadyShares(ComputeFairShares.java:66) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeSteadyShares(FairSharePolicy.java:148) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeSteadyShares(FSParentQueue.java:102) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getQueue(QueueManager.java:148) - locked <0x00007f8c4a8177a0> (a java.util.HashMap) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getLeafQueue(QueueManager.java:101) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.updateAllocationConfiguration(QueueManager.java:387) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$AllocationReloadListener.onReload(FairScheduler.java:1728) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:422) - locked <0x00007f8c4a7eb2e0> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1597) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.serviceInit(FairScheduler.java:1621) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) - locked <0x00007f8c4a76ac48> (a java.lang.Object) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:569) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) - locked <0x00007f8c49254268> (a java.lang.Object) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:997) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:257) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) - locked <0x00007f8c467495e0> (a java.lang.Object) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1220)
When we debug the cluster, we found resourceUsedWithWeightToResourceRatio return a negative value. So the loop can't return. We found in our cluster, the sum of all minRes is over int.max, so resourceUsedWithWeightToResourceRatio return a negative value.
below is the loop. Because totalResource is long, so always postive. But resourceUsedWithWeightToResourceRatio return int type. Our cluster is so big that resourceUsedWithWeightToResourceRatio will return a overflow value, just a negative. So the loop will never break.
while (resourceUsedWithWeightToResourceRatio(rMax, schedulables, type)
< totalResource) {
rMax *= 2.0;
}
Attachments
Attachments
Issue Links
- duplicates
-
YARN-9173 FairShare calculation broken for large values after YARN-8833
- Resolved