Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-3251

Fix CapacityScheduler deadlock when computing absolute max avail capacity (short term fix for 2.6.1)

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 2.6.0
    • Fix Version/s: 2.6.1
    • Component/s: None
    • Labels:
    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      The ResourceManager can deadlock in the CapacityScheduler when computing the absolute max available capacity for user limits and headroom.

      1. YARN-3251.1.patch
        7 kB
        Craig Welch
      2. YARN-3251.2.patch
        4 kB
        Craig Welch
      3. YARN-3251.2-6-0.2.patch
        6 kB
        Craig Welch
      4. YARN-3251.2-6-0.3.patch
        3 kB
        Craig Welch
      5. YARN-3251.2-6-0.4.patch
        3 kB
        Craig Welch

        Issue Links

          Activity

          Hide
          jlowe Jason Lowe added a comment -

          Sample stack trace:

          Found one Java-level deadlock:
          =============================
          "IPC Server handler 71 on 8032":
            waiting to lock monitor 0x00000000037f9120 (object 0x000000023b060ad8, a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue),
            which is held by "ResourceManager Event Processor"
          "ResourceManager Event Processor":
            waiting to lock monitor 0x0000000002c4b7d0 (object 0x000000023aecf620, a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue),
            which is held by "IPC Server handler 71 on 8032"
          
          Java stack information for the threads listed above:
          ===================================================
          "IPC Server handler 71 on 8032":
          	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.getQueueInfo(LeafQueue.java:451)
          	- waiting to lock <0x000000023b060ad8> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue)
          	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.getQueueInfo(ParentQueue.java:214)
          	- locked <0x000000023aecf620> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue)
          	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.getQueueInfo(ParentQueue.java:214)
          	- locked <0x000000023af36e70> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue)
          	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.getQueueInfo(ParentQueue.java:214)
          	- locked <0x000000023b0d9478> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue)
          	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.getQueueInfo(CapacityScheduler.java:910)
          	at org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getQueueInfo(ClientRMService.java:832)
          	at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getQueueInfo(ApplicationClientProtocolPBServiceImpl.java:259)
          	at org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:413)
          	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
          	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
          	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2079)
          	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2075)
          	at java.security.AccessController.doPrivileged(Native Method)
          	at javax.security.auth.Subject.doAs(Subject.java:415)
          	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1694)
          	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2073)
          "ResourceManager Event Processor":
          	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue.getParent(AbstractCSQueue.java:185)
          	- waiting to lock <0x000000023aecf620> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue)
          	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CSQueueUtils.getAbsoluteMaxAvailCapacity(CSQueueUtils.java:177)
          	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CSQueueUtils.getAbsoluteMaxAvailCapacity(CSQueueUtils.java:183)
          	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.computeUserLimitAndSetHeadroom(LeafQueue.java:1033)
          	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.checkLimitsToReserve(LeafQueue.java:1341)
          	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainer(LeafQueue.java:1611)
          	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignOffSwitchContainers(LeafQueue.java:1399)
          	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainersOnNode(LeafQueue.java:1278)
          	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignReservedContainer(LeafQueue.java:893)
          	- locked <0x000000023b060ad8> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue)
          	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:758)
          	- locked <0x000000023ceb53e0> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp)
          	- locked <0x000000023b060ad8> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue)
          	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:992)
          	- locked <0x000000023ae2fbd0> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler)
          	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1059)
          	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:114)
          	at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:680)
          	at java.lang.Thread.run(Thread.java:722)
          
          Found 1 deadlock.
          
          Show
          jlowe Jason Lowe added a comment - Sample stack trace: Found one Java-level deadlock: ============================= "IPC Server handler 71 on 8032": waiting to lock monitor 0x00000000037f9120 (object 0x000000023b060ad8, a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue), which is held by "ResourceManager Event Processor" "ResourceManager Event Processor": waiting to lock monitor 0x0000000002c4b7d0 (object 0x000000023aecf620, a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue), which is held by "IPC Server handler 71 on 8032" Java stack information for the threads listed above: =================================================== "IPC Server handler 71 on 8032": at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.getQueueInfo(LeafQueue.java:451) - waiting to lock <0x000000023b060ad8> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.getQueueInfo(ParentQueue.java:214) - locked <0x000000023aecf620> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.getQueueInfo(ParentQueue.java:214) - locked <0x000000023af36e70> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.getQueueInfo(ParentQueue.java:214) - locked <0x000000023b0d9478> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.getQueueInfo(CapacityScheduler.java:910) at org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getQueueInfo(ClientRMService.java:832) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getQueueInfo(ApplicationClientProtocolPBServiceImpl.java:259) at org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:413) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2079) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2075) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1694) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2073) "ResourceManager Event Processor": at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue.getParent(AbstractCSQueue.java:185) - waiting to lock <0x000000023aecf620> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CSQueueUtils.getAbsoluteMaxAvailCapacity(CSQueueUtils.java:177) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CSQueueUtils.getAbsoluteMaxAvailCapacity(CSQueueUtils.java:183) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.computeUserLimitAndSetHeadroom(LeafQueue.java:1033) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.checkLimitsToReserve(LeafQueue.java:1341) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainer(LeafQueue.java:1611) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignOffSwitchContainers(LeafQueue.java:1399) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainersOnNode(LeafQueue.java:1278) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignReservedContainer(LeafQueue.java:893) - locked <0x000000023b060ad8> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:758) - locked <0x000000023ceb53e0> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp) - locked <0x000000023b060ad8> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:992) - locked <0x000000023ae2fbd0> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1059) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:114) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:680) at java.lang.Thread.run(Thread.java:722) Found 1 deadlock.
          Hide
          jlowe Jason Lowe added a comment -

          It looks like this is fallout from YARN-2008. CSQueueUtils.getAbsoluteMaxAvailCapacity is called with the lock held on the LeafQueue and walks up the tree, attempting to grab locks on parents as it goes. That's contrary to the conventional order of locking while walking down the tree, and thus we can deadlock.

          Show
          jlowe Jason Lowe added a comment - It looks like this is fallout from YARN-2008 . CSQueueUtils.getAbsoluteMaxAvailCapacity is called with the lock held on the LeafQueue and walks up the tree, attempting to grab locks on parents as it goes. That's contrary to the conventional order of locking while walking down the tree, and thus we can deadlock.
          Hide
          sunilg Sunil G added a comment -

          Jason Lowe
          Recent getAbsoluteMaxAvailCapacity changes cause this.

          Show
          sunilg Sunil G added a comment - Jason Lowe Recent getAbsoluteMaxAvailCapacity changes cause this.
          Hide
          sunilg Sunil G added a comment -

          Its better to compute the available capacity during the call to root.assignContainers. In that scenario, a simpler get will retrieve the available capacity.

          Show
          sunilg Sunil G added a comment - Its better to compute the available capacity during the call to root.assignContainers. In that scenario, a simpler get will retrieve the available capacity.
          Hide
          jlowe Jason Lowe added a comment -

          YARN-3243 could remove the need to climb up the hierarchy to compute max avail capacity.

          Show
          jlowe Jason Lowe added a comment - YARN-3243 could remove the need to climb up the hierarchy to compute max avail capacity.
          Hide
          leftnoteasy Wangda Tan added a comment -

          Thanks for reporting this, Jason Lowe!

          Since this is a blocker for 2.7, I will create a patch for this using method described in YARN-3243 first before working on other related refactorings, I added this as a sub task of YARN-3251.

          Show
          leftnoteasy Wangda Tan added a comment - Thanks for reporting this, Jason Lowe ! Since this is a blocker for 2.7, I will create a patch for this using method described in YARN-3243 first before working on other related refactorings, I added this as a sub task of YARN-3251 .
          Hide
          cwelch Craig Welch added a comment -

          It looks like this can occur when a call which walks down the queue tree (in this case, getQueueInfo()) happens at the same time as an assignContainers call which does not start from the root queue, which is specifically one for a reservedContainer where scheduleAsynchronously is false. Essentially, it isn't safe to hold a lock on a queue while locking on a parent queue (as I now see noted in other methods in LeafQueue :/).

          YARN-3243 is potentially a long term fix, but it would be nice to fix this right away as it clearly is already problematic. Also, YARN-3243 depends on a number of other sizable changes which have gone in recently, meaning it will be difficult to apply it as a fix to older codebases, for which it would be very nice to have a fix.

          I've attached a patch somewhat along the lines suggest by Sunil G, it simply moves the acquisition of the absoluteMaxAvailCapacity outside the lock on the leaf queue - it will lock parent queues individually as it ascends, but it never holds a parent and child lock simultaneously, which is the unacceptable state. It follows the pattern for other methods in LeafQueue like recoverContainer which access parent queues - they all are careful to make sure the parent queue access occurs outside any lock on themselves.

          Unfortunately it's not possible to just do this in root.assignContainers because of the reservedContainer case which will not invoke assignContainers on the root queue at any point. Instead, absoluteMaxAvailCapacity is determined outside any lock on the leaf queue in assignContainers before entering the synchronized method which continues the logic as it is today.

          This looks to me to be the way to fix the issue with the smallest code change today pending other changes coming down the line.

          Show
          cwelch Craig Welch added a comment - It looks like this can occur when a call which walks down the queue tree (in this case, getQueueInfo()) happens at the same time as an assignContainers call which does not start from the root queue, which is specifically one for a reservedContainer where scheduleAsynchronously is false. Essentially, it isn't safe to hold a lock on a queue while locking on a parent queue (as I now see noted in other methods in LeafQueue :/). YARN-3243 is potentially a long term fix, but it would be nice to fix this right away as it clearly is already problematic. Also, YARN-3243 depends on a number of other sizable changes which have gone in recently, meaning it will be difficult to apply it as a fix to older codebases, for which it would be very nice to have a fix. I've attached a patch somewhat along the lines suggest by Sunil G , it simply moves the acquisition of the absoluteMaxAvailCapacity outside the lock on the leaf queue - it will lock parent queues individually as it ascends, but it never holds a parent and child lock simultaneously, which is the unacceptable state. It follows the pattern for other methods in LeafQueue like recoverContainer which access parent queues - they all are careful to make sure the parent queue access occurs outside any lock on themselves. Unfortunately it's not possible to just do this in root.assignContainers because of the reservedContainer case which will not invoke assignContainers on the root queue at any point. Instead, absoluteMaxAvailCapacity is determined outside any lock on the leaf queue in assignContainers before entering the synchronized method which continues the logic as it is today. This looks to me to be the way to fix the issue with the smallest code change today pending other changes coming down the line.
          Hide
          leftnoteasy Wangda Tan added a comment -

          Craig Welch,
          Some comments,
          1) Since the target of your patch is to make a quick fix for old version, it's better to create a patch in branch-2.6. and the patch you created will be committed to branch-2.6 as well. I noticed some functionalities and interfaces being used in your patch are not part of 2.6. And patch I'm working on now will remove the CSQueueUtils.computeMaxAvailResource, so it's no need to add a intermediate fix in branch-2.
          2) I think CSQueueUtils.getAbsoluteMaxAvailCapacity doesn't hold child/parent's lock together, maybe we don't need to change that, could you confirm?
          3) Maybe we don't need getter/setter of absoluteMaxAvailCapacity in queue, a volatile float is enough?

          Thanks,

          Show
          leftnoteasy Wangda Tan added a comment - Craig Welch , Some comments, 1) Since the target of your patch is to make a quick fix for old version, it's better to create a patch in branch-2.6. and the patch you created will be committed to branch-2.6 as well. I noticed some functionalities and interfaces being used in your patch are not part of 2.6. And patch I'm working on now will remove the CSQueueUtils.computeMaxAvailResource, so it's no need to add a intermediate fix in branch-2. 2) I think CSQueueUtils.getAbsoluteMaxAvailCapacity doesn't hold child/parent's lock together, maybe we don't need to change that, could you confirm? 3) Maybe we don't need getter/setter of absoluteMaxAvailCapacity in queue, a volatile float is enough? Thanks,
          Hide
          leftnoteasy Wangda Tan added a comment -

          Attached ver.1 patch against trunk (YARN-3251.trunk.1.patch)

          Show
          leftnoteasy Wangda Tan added a comment - Attached ver.1 patch against trunk ( YARN-3251 .trunk.1.patch)
          Hide
          leftnoteasy Wangda Tan added a comment -

          Found few typos, removed patch and will upload again.

          Show
          leftnoteasy Wangda Tan added a comment - Found few typos, removed patch and will upload again.
          Hide
          leftnoteasy Wangda Tan added a comment -

          Attached ver.1 patch against trunk

          Show
          leftnoteasy Wangda Tan added a comment - Attached ver.1 patch against trunk
          Hide
          vinodkv Vinod Kumar Vavilapalli added a comment -

          Since the target of your patch is to make a quick fix for old version, it's better to create a patch in branch-2.6. and the patch you created will be committed to branch-2.6 as well. I noticed some functionalities and interfaces being used in your patch are not part of 2.6. And patch I'm working on now will remove the CSQueueUtils.computeMaxAvailResource, so it's no need to add a intermediate fix in branch-2.

          How about we have a separate JIRA solely focused on 2.6.1 - as we have two separate patches and two different contributors?

          Show
          vinodkv Vinod Kumar Vavilapalli added a comment - Since the target of your patch is to make a quick fix for old version, it's better to create a patch in branch-2.6. and the patch you created will be committed to branch-2.6 as well. I noticed some functionalities and interfaces being used in your patch are not part of 2.6. And patch I'm working on now will remove the CSQueueUtils.computeMaxAvailResource, so it's no need to add a intermediate fix in branch-2. How about we have a separate JIRA solely focused on 2.6.1 - as we have two separate patches and two different contributors?
          Hide
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12700874/YARN-3251.trunk.1.patch
          against trunk revision caa42ad.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 2 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          -1 findbugs. The patch appears to introduce 6 new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

          org.apache.hadoop.yarn.server.resourcemanager.scheduler.TestResourceUsage

          Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6743//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6743//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html
          Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6743//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12700874/YARN-3251.trunk.1.patch against trunk revision caa42ad. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 2 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. -1 findbugs . The patch appears to introduce 6 new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.scheduler.TestResourceUsage Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6743//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6743//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6743//console This message is automatically generated.
          Hide
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12700875/YARN-3251.trunk.1.patch
          against trunk revision caa42ad.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 2 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          -1 findbugs. The patch appears to introduce 5 new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

          org.apache.hadoop.yarn.server.resourcemanager.scheduler.TestResourceUsage

          Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6744//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6744//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html
          Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6744//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12700875/YARN-3251.trunk.1.patch against trunk revision caa42ad. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 2 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. -1 findbugs . The patch appears to introduce 5 new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.scheduler.TestResourceUsage Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6744//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6744//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6744//console This message is automatically generated.
          Hide
          vinodkv Vinod Kumar Vavilapalli added a comment -

          Actually let's use this JIRA for 2.6.1 and YARN-3243 for the trunk fix.

          Show
          vinodkv Vinod Kumar Vavilapalli added a comment - Actually let's use this JIRA for 2.6.1 and YARN-3243 for the trunk fix.
          Hide
          leftnoteasy Wangda Tan added a comment -

          +1 for the suggestion, moved it out of YARN-3243, changed target version and also updated title.

          Show
          leftnoteasy Wangda Tan added a comment - +1 for the suggestion, moved it out of YARN-3243 , changed target version and also updated title.
          Hide
          leftnoteasy Wangda Tan added a comment -

          Linked to YARN-3265.

          Show
          leftnoteasy Wangda Tan added a comment - Linked to YARN-3265 .
          Hide
          leftnoteasy Wangda Tan added a comment -

          Removed patch for trunk and uploaded the same one to YARN-3265, reassigned this to Craig Welch.

          Show
          leftnoteasy Wangda Tan added a comment - Removed patch for trunk and uploaded the same one to YARN-3265 , reassigned this to Craig Welch .
          Hide
          cwelch Craig Welch added a comment -

          Patch against branch-2.6.0

          Show
          cwelch Craig Welch added a comment - Patch against branch-2.6.0
          Hide
          cwelch Craig Welch added a comment -

          1) Since the target of your patch is to make a quick fix for old version, it's better to create a patch in branch-2.6

          done

          And patch I'm working on now will remove the CSQueueUtils.computeMaxAvailResource, so it's no need to add a intermediate fix in branch-2.

          I suppose that depends on whether anyone needs a trunk version of the patch before the other changes are landed - if someone asks for it I could quickly update the original patch to provide it

          2) I think CSQueueUtils.getAbsoluteMaxAvailCapacity doesn't hold child/parent's lock together, maybe we don't need to change that, could you confirm?

          it doesn't, the change there was to insure consistency for multiple values used from the queue, as previously it was occurring inside a lock and that was guaranteed, now it isn't. However, there's no need to lock on the parent, so I removed that

          3) Maybe we don't need getter/setter of absoluteMaxAvailCapacity in queue, a volatile float is enough?

          Yes, that should be safe, done

          Show
          cwelch Craig Welch added a comment - 1) Since the target of your patch is to make a quick fix for old version, it's better to create a patch in branch-2.6 done And patch I'm working on now will remove the CSQueueUtils.computeMaxAvailResource, so it's no need to add a intermediate fix in branch-2. I suppose that depends on whether anyone needs a trunk version of the patch before the other changes are landed - if someone asks for it I could quickly update the original patch to provide it 2) I think CSQueueUtils.getAbsoluteMaxAvailCapacity doesn't hold child/parent's lock together, maybe we don't need to change that, could you confirm? it doesn't, the change there was to insure consistency for multiple values used from the queue, as previously it was occurring inside a lock and that was guaranteed, now it isn't. However, there's no need to lock on the parent, so I removed that 3) Maybe we don't need getter/setter of absoluteMaxAvailCapacity in queue, a volatile float is enough? Yes, that should be safe, done
          Hide
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12700977/YARN-3251.2-6-0.2.patch
          against trunk revision dce8b9c.

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6759//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12700977/YARN-3251.2-6-0.2.patch against trunk revision dce8b9c. -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6759//console This message is automatically generated.
          Hide
          cwelch Craig Welch added a comment -

          Removing the csqueueutils

          Show
          cwelch Craig Welch added a comment - Removing the csqueueutils
          Hide
          cwelch Craig Welch added a comment -

          Sorry if that wasn't clear, to reduce risk removed the minor changes in CSQueueUtils

          Show
          cwelch Craig Welch added a comment - Sorry if that wasn't clear, to reduce risk removed the minor changes in CSQueueUtils
          Hide
          leftnoteasy Wangda Tan added a comment -

          LGTM +1, I will commit the patch to branch-2.6 this afternoon if opposite opinions. Thanks!

          Show
          leftnoteasy Wangda Tan added a comment - LGTM +1, I will commit the patch to branch-2.6 this afternoon if opposite opinions. Thanks!
          Hide
          leftnoteasy Wangda Tan added a comment -

          if opposite opinions -> if no opposite opinions

          Show
          leftnoteasy Wangda Tan added a comment - if opposite opinions -> if no opposite opinions
          Hide
          cwelch Craig Welch added a comment -

          Minor, switch to "Internal", seems to be more common in the codebase

          Show
          cwelch Craig Welch added a comment - Minor, switch to "Internal", seems to be more common in the codebase
          Hide
          cwelch Craig Welch added a comment -

          Attaching an analogue of the most recent patch against trunk. I do not believe that we will be committing this at this point as Wangda Tan is working on a more significant change which will remove the need for it, but I wanted to make it available just in case. For clarity, patch against trunk is YARN-3251.2.patch and the patch to commit against 2.6 is YARN-3251.2-6-0.4.patch.

          Show
          cwelch Craig Welch added a comment - Attaching an analogue of the most recent patch against trunk. I do not believe that we will be committing this at this point as Wangda Tan is working on a more significant change which will remove the need for it, but I wanted to make it available just in case. For clarity, patch against trunk is YARN-3251 .2.patch and the patch to commit against 2.6 is YARN-3251 .2-6-0.4.patch.
          Hide
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12701150/YARN-3251.2.patch
          against trunk revision dce8b9c.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          -1 findbugs. The patch appears to introduce 5 new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

          org.apache.hadoop.yarn.server.resourcemanager.recovery.TestFSRMStateStore

          Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6760//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6760//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html
          Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6760//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12701150/YARN-3251.2.patch against trunk revision dce8b9c. +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. -1 findbugs . The patch appears to introduce 5 new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.recovery.TestFSRMStateStore Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6760//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6760//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6760//console This message is automatically generated.
          Hide
          cwelch Craig Welch added a comment -

          Removing patch available as the patch is likely not going into trunk, was just to check build.

          None of the findbugs warnings are in code / related to change
          TestFSRMStateStore is not failing on my box with the change after several tries, assuming it is unrelated intermittent / build server issue

          Show
          cwelch Craig Welch added a comment - Removing patch available as the patch is likely not going into trunk, was just to check build. None of the findbugs warnings are in code / related to change TestFSRMStateStore is not failing on my box with the change after several tries, assuming it is unrelated intermittent / build server issue
          Hide
          leftnoteasy Wangda Tan added a comment -

          Checking this into branch-2.6

          Show
          leftnoteasy Wangda Tan added a comment - Checking this into branch-2.6
          Hide
          leftnoteasy Wangda Tan added a comment -

          Just compiled and ran all tests in CapacityScheduler, committed to branch-2.6.

          Thanks Craig Welch and also reviews from Jason Lowe, Sunil G and Vinod Kumar Vavilapalli.

          Show
          leftnoteasy Wangda Tan added a comment - Just compiled and ran all tests in CapacityScheduler, committed to branch-2.6. Thanks Craig Welch and also reviews from Jason Lowe , Sunil G and Vinod Kumar Vavilapalli .
          Hide
          vinodkv Vinod Kumar Vavilapalli added a comment -

          Closing old tickets that are already part of a release.

          Show
          vinodkv Vinod Kumar Vavilapalli added a comment - Closing old tickets that are already part of a release.

            People

            • Assignee:
              cwelch Craig Welch
              Reporter:
              jlowe Jason Lowe
            • Votes:
              0 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development