Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-4344

NMs reconnecting with changed capabilities can lead to wrong cluster resource calculations

    Details

    • Hadoop Flags:
      Reviewed

      Description

      After YARN-3802, if an NM re-connects to the RM with changed capabilities, there can arise situations where the overall cluster resource calculation for the cluster will be incorrect leading to inconsistencies in scheduling.

      1. YARN-4344.001.patch
        11 kB
        Varun Vasudev
      2. YARN-4344.002.patch
        12 kB
        Varun Vasudev
      3. YARN-4344-branch-2.6.001.patch
        12 kB
        Varun Vasudev

        Issue Links

          Activity

          Hide
          vvasudev Varun Vasudev added a comment -

          An example of a situation is shown below -

          2015-11-09 10:43:51,784 INFO  resourcemanager.ResourceTrackerService (ResourceTrackerService.java:registerNodeManager(345)) - NodeManager from node 10.0.0.64(cmPort: 30050 httpPort: 30060) registered with capability: <memory:5632, vCores:8>, assigned nodeId 10.0.0.64:30050
          2015-11-09 10:43:51,786 INFO  rmnode.RMNodeImpl (RMNodeImpl.java:handle(434)) - 10.0.0.64:30050 Node Transitioned from NEW to RUNNING
          2015-11-09 10:43:51,814 INFO  capacity.CapacityScheduler (CapacityScheduler.java:addNode(1193)) - Added node 10.0.0.64:30050 clusterResource: <memory:5632, vCores:8>
          2015-11-09 10:44:37,878 INFO  util.RackResolver (RackResolver.java:coreResolve(109)) - Resolved 10.0.0.63 to /default-rack
          2015-11-09 10:44:37,879 INFO  resourcemanager.ResourceTrackerService (ResourceTrackerService.java:registerNodeManager(345)) - NodeManager from node 10.0.0.63(cmPort: 30050 httpPort: 30060) registered with capability: <memory:10240, vCores:4>, assigned nodeId 10.0.0.63:30050
          2015-11-09 10:44:37,879 INFO  rmnode.RMNodeImpl (RMNodeImpl.java:handle(434)) - 10.0.0.63:30050 Node Transitioned from NEW to RUNNING
          2015-11-09 10:44:37,882 INFO  capacity.CapacityScheduler (CapacityScheduler.java:addNode(1193)) - Added node 10.0.0.63:30050 clusterResource: <memory:15872, vCores:12>
          2015-11-09 10:44:39,307 INFO  util.RackResolver (RackResolver.java:coreResolve(109)) - Resolved 10.0.0.64 to /default-rack
          2015-11-09 10:44:39,309 INFO  resourcemanager.ResourceTrackerService (ResourceTrackerService.java:registerNodeManager(313)) - Reconnect from the node at: 10.0.0.64
          2015-11-09 10:44:39,312 INFO  resourcemanager.ResourceTrackerService (ResourceTrackerService.java:registerNodeManager(345)) - NodeManager from node 10.0.0.64(cmPort: 30050 httpPort: 30060) registered with capability: <memory:10240, vCores:4>, assigned nodeId 10.0.0.64:30050
          2015-11-09 10:44:39,314 INFO  capacity.CapacityScheduler (CapacityScheduler.java:removeNode(1247)) - Removed node 10.0.0.64:30050 clusterResource: <memory:5632, vCores:8>
          2015-11-09 10:44:39,315 INFO  capacity.CapacityScheduler (CapacityScheduler.java:addNode(1193)) - Added node 10.0.0.64:30050 clusterResource: <memory:15872, vCores:12>
          

          In this case - NM's from 10.0.0.64 and 10.0.0.63 registered leading to a total cluster resource of clusterResource: <memory:15872, vCores:12>. After that 10.0.0.64 re-connected with changed capabilities(from <memory:5632, vCores:8> to <memory:10240, vCores:4>). This should have led to the cluster resources becoming <memory:20480, vcores:8> but instead it is calculated to be <memory:15872, vCores:12>.

          The root cause is this piece of code from RMNodeImpl -

          rmNode.context.getDispatcher().getEventHandler().handle(
                        new NodeRemovedSchedulerEvent(rmNode));
          
          if (!rmNode.getTotalCapability().equals(
                       newNode.getTotalCapability())) {
                     rmNode.totalCapability = newNode.getTotalCapability();
          

          If the dispatcher is delayed in its processing of the event, by the time the remove node is processed, rmNode.totalCapability = newNode.getTotalCapability() has already been executed and the resources that are removed are the changed capabilities and not the older capabilities of the node.

          Show
          vvasudev Varun Vasudev added a comment - An example of a situation is shown below - 2015-11-09 10:43:51,784 INFO resourcemanager.ResourceTrackerService (ResourceTrackerService.java:registerNodeManager(345)) - NodeManager from node 10.0.0.64(cmPort: 30050 httpPort: 30060) registered with capability: <memory:5632, vCores:8>, assigned nodeId 10.0.0.64:30050 2015-11-09 10:43:51,786 INFO rmnode.RMNodeImpl (RMNodeImpl.java:handle(434)) - 10.0.0.64:30050 Node Transitioned from NEW to RUNNING 2015-11-09 10:43:51,814 INFO capacity.CapacityScheduler (CapacityScheduler.java:addNode(1193)) - Added node 10.0.0.64:30050 clusterResource: <memory:5632, vCores:8> 2015-11-09 10:44:37,878 INFO util.RackResolver (RackResolver.java:coreResolve(109)) - Resolved 10.0.0.63 to / default -rack 2015-11-09 10:44:37,879 INFO resourcemanager.ResourceTrackerService (ResourceTrackerService.java:registerNodeManager(345)) - NodeManager from node 10.0.0.63(cmPort: 30050 httpPort: 30060) registered with capability: <memory:10240, vCores:4>, assigned nodeId 10.0.0.63:30050 2015-11-09 10:44:37,879 INFO rmnode.RMNodeImpl (RMNodeImpl.java:handle(434)) - 10.0.0.63:30050 Node Transitioned from NEW to RUNNING 2015-11-09 10:44:37,882 INFO capacity.CapacityScheduler (CapacityScheduler.java:addNode(1193)) - Added node 10.0.0.63:30050 clusterResource: <memory:15872, vCores:12> 2015-11-09 10:44:39,307 INFO util.RackResolver (RackResolver.java:coreResolve(109)) - Resolved 10.0.0.64 to / default -rack 2015-11-09 10:44:39,309 INFO resourcemanager.ResourceTrackerService (ResourceTrackerService.java:registerNodeManager(313)) - Reconnect from the node at: 10.0.0.64 2015-11-09 10:44:39,312 INFO resourcemanager.ResourceTrackerService (ResourceTrackerService.java:registerNodeManager(345)) - NodeManager from node 10.0.0.64(cmPort: 30050 httpPort: 30060) registered with capability: <memory:10240, vCores:4>, assigned nodeId 10.0.0.64:30050 2015-11-09 10:44:39,314 INFO capacity.CapacityScheduler (CapacityScheduler.java:removeNode(1247)) - Removed node 10.0.0.64:30050 clusterResource: <memory:5632, vCores:8> 2015-11-09 10:44:39,315 INFO capacity.CapacityScheduler (CapacityScheduler.java:addNode(1193)) - Added node 10.0.0.64:30050 clusterResource: <memory:15872, vCores:12> In this case - NM's from 10.0.0.64 and 10.0.0.63 registered leading to a total cluster resource of clusterResource: <memory:15872, vCores:12>. After that 10.0.0.64 re-connected with changed capabilities(from <memory:5632, vCores:8> to <memory:10240, vCores:4>). This should have led to the cluster resources becoming <memory:20480, vcores:8> but instead it is calculated to be <memory:15872, vCores:12>. The root cause is this piece of code from RMNodeImpl - rmNode.context.getDispatcher().getEventHandler().handle( new NodeRemovedSchedulerEvent(rmNode)); if (!rmNode.getTotalCapability().equals( newNode.getTotalCapability())) { rmNode.totalCapability = newNode.getTotalCapability(); If the dispatcher is delayed in its processing of the event, by the time the remove node is processed, rmNode.totalCapability = newNode.getTotalCapability() has already been executed and the resources that are removed are the changed capabilities and not the older capabilities of the node.
          Hide
          vvasudev Varun Vasudev added a comment -

          Uploaded a patch with the fix. zhihai xu, Jason Lowe - can you please take a look?

          Show
          vvasudev Varun Vasudev added a comment - Uploaded a patch with the fix. zhihai xu , Jason Lowe - can you please take a look?
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 6s docker + precommit patch detected.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
          +1 mvninstall 3m 6s trunk passed
          +1 compile 0m 21s trunk passed with JDK v1.8.0_60
          +1 compile 0m 23s trunk passed with JDK v1.7.0_79
          +1 checkstyle 0m 11s trunk passed
          +1 mvneclipse 0m 15s trunk passed
          +1 findbugs 1m 6s trunk passed
          +1 javadoc 0m 21s trunk passed with JDK v1.8.0_60
          +1 javadoc 0m 25s trunk passed with JDK v1.7.0_79
          +1 mvninstall 0m 27s the patch passed
          +1 compile 0m 21s the patch passed with JDK v1.8.0_60
          +1 javac 0m 21s the patch passed
          +1 compile 0m 23s the patch passed with JDK v1.7.0_79
          +1 javac 0m 23s the patch passed
          +1 checkstyle 0m 12s the patch passed
          +1 mvneclipse 0m 14s the patch passed
          +1 whitespace 0m 0s Patch has no whitespace issues.
          +1 findbugs 1m 17s the patch passed
          +1 javadoc 0m 21s the patch passed with JDK v1.8.0_60
          +1 javadoc 0m 25s the patch passed with JDK v1.7.0_79
          -1 unit 59m 49s hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_60.
          -1 unit 61m 1s hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.7.0_79.
          +1 asflicense 0m 21s Patch does not generate ASF License warnings.
          132m 2s



          Reason Tests
          JDK v1.8.0_60 Failed junit tests hadoop.yarn.server.resourcemanager.TestAMAuthorization
            hadoop.yarn.server.resourcemanager.scheduler.TestAbstractYarnScheduler
            hadoop.yarn.server.resourcemanager.TestClientRMTokens
          JDK v1.7.0_79 Failed junit tests hadoop.yarn.server.resourcemanager.TestAMAuthorization
            hadoop.yarn.server.resourcemanager.scheduler.TestAbstractYarnScheduler
            hadoop.yarn.server.resourcemanager.TestClientRMTokens



          Subsystem Report/Notes
          Docker Client=1.7.1 Server=1.7.1 Image:test-patch-base-hadoop-date2015-11-11
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12771722/YARN-4344.001.patch
          JIRA Issue YARN-4344
          Optional Tests asflicense javac javadoc mvninstall unit findbugs checkstyle compile
          uname Linux a9687b820f5f 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /home/jenkins/jenkins-slave/workspace/PreCommit-YARN-Build/patchprocess/apache-yetus-ee5baeb/precommit/personality/hadoop.sh
          git revision trunk / 23d0db5
          Default Java 1.7.0_79
          Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_60 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_79
          findbugs v3.0.0
          unit https://builds.apache.org/job/PreCommit-YARN-Build/9655/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.8.0_60.txt
          unit https://builds.apache.org/job/PreCommit-YARN-Build/9655/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.7.0_79.txt
          unit test logs https://builds.apache.org/job/PreCommit-YARN-Build/9655/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.8.0_60.txt https://builds.apache.org/job/PreCommit-YARN-Build/9655/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.7.0_79.txt
          JDK v1.7.0_79 Test Results https://builds.apache.org/job/PreCommit-YARN-Build/9655/testReport/
          modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
          Max memory used 224MB
          Powered by Apache Yetus http://yetus.apache.org
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/9655/console

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 6s docker + precommit patch detected. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 3m 6s trunk passed +1 compile 0m 21s trunk passed with JDK v1.8.0_60 +1 compile 0m 23s trunk passed with JDK v1.7.0_79 +1 checkstyle 0m 11s trunk passed +1 mvneclipse 0m 15s trunk passed +1 findbugs 1m 6s trunk passed +1 javadoc 0m 21s trunk passed with JDK v1.8.0_60 +1 javadoc 0m 25s trunk passed with JDK v1.7.0_79 +1 mvninstall 0m 27s the patch passed +1 compile 0m 21s the patch passed with JDK v1.8.0_60 +1 javac 0m 21s the patch passed +1 compile 0m 23s the patch passed with JDK v1.7.0_79 +1 javac 0m 23s the patch passed +1 checkstyle 0m 12s the patch passed +1 mvneclipse 0m 14s the patch passed +1 whitespace 0m 0s Patch has no whitespace issues. +1 findbugs 1m 17s the patch passed +1 javadoc 0m 21s the patch passed with JDK v1.8.0_60 +1 javadoc 0m 25s the patch passed with JDK v1.7.0_79 -1 unit 59m 49s hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_60. -1 unit 61m 1s hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.7.0_79. +1 asflicense 0m 21s Patch does not generate ASF License warnings. 132m 2s Reason Tests JDK v1.8.0_60 Failed junit tests hadoop.yarn.server.resourcemanager.TestAMAuthorization   hadoop.yarn.server.resourcemanager.scheduler.TestAbstractYarnScheduler   hadoop.yarn.server.resourcemanager.TestClientRMTokens JDK v1.7.0_79 Failed junit tests hadoop.yarn.server.resourcemanager.TestAMAuthorization   hadoop.yarn.server.resourcemanager.scheduler.TestAbstractYarnScheduler   hadoop.yarn.server.resourcemanager.TestClientRMTokens Subsystem Report/Notes Docker Client=1.7.1 Server=1.7.1 Image:test-patch-base-hadoop-date2015-11-11 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12771722/YARN-4344.001.patch JIRA Issue YARN-4344 Optional Tests asflicense javac javadoc mvninstall unit findbugs checkstyle compile uname Linux a9687b820f5f 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /home/jenkins/jenkins-slave/workspace/PreCommit-YARN-Build/patchprocess/apache-yetus-ee5baeb/precommit/personality/hadoop.sh git revision trunk / 23d0db5 Default Java 1.7.0_79 Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_60 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_79 findbugs v3.0.0 unit https://builds.apache.org/job/PreCommit-YARN-Build/9655/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.8.0_60.txt unit https://builds.apache.org/job/PreCommit-YARN-Build/9655/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.7.0_79.txt unit test logs https://builds.apache.org/job/PreCommit-YARN-Build/9655/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.8.0_60.txt https://builds.apache.org/job/PreCommit-YARN-Build/9655/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.7.0_79.txt JDK v1.7.0_79 Test Results https://builds.apache.org/job/PreCommit-YARN-Build/9655/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Max memory used 224MB Powered by Apache Yetus http://yetus.apache.org Console output https://builds.apache.org/job/PreCommit-YARN-Build/9655/console This message was automatically generated.
          Hide
          jlowe Jason Lowe added a comment -

          Thanks for the patch, Varun! I think the change will fix the reported issue, but I'm a bit skeptical of the vastly different handling of the event based on whether apps are running or not. For example, if the http port is changing when the node re-registers, why are we treating it as a node removal then addition if there aren't any apps running but not if there are apps running? Seems like that should be consistent.

          Comments on the patch itself:

          The comment about sending the node removal event at the start of the main block in the transition is no longer very accurate.

          Please don't put large sleeps (on the order of seconds) in tests. These extra sleep seconds quickly add up to a significant amount of time over many tests. If we need to sleep for polling reasons the sleep should be much shorter, like on the order of 10ms. Better than sleep-polling is flushing the event dispatcher and then checking since we can avoid polling entirely.

          Nit: isCapabilityChanged init can be simplified to the following, similar to the noRunningApps boolean init above it:

                boolean isCapabilityChanged =
                    !rmNode.getTotalCapability().equals(newNode.getTotalCapability());
           

          Nit: is this conditional check even necessary? We can just update the total capability with no semantic effect if it hasn't changed. Since this is just updating a reference with another precomputed one, it's not like we're avoiding some expensive code.

                  if (isCapabilityChanged) {
                    rmNode.totalCapability = newNode.getTotalCapability();
                  }
          
          Show
          jlowe Jason Lowe added a comment - Thanks for the patch, Varun! I think the change will fix the reported issue, but I'm a bit skeptical of the vastly different handling of the event based on whether apps are running or not. For example, if the http port is changing when the node re-registers, why are we treating it as a node removal then addition if there aren't any apps running but not if there are apps running? Seems like that should be consistent. Comments on the patch itself: The comment about sending the node removal event at the start of the main block in the transition is no longer very accurate. Please don't put large sleeps (on the order of seconds) in tests. These extra sleep seconds quickly add up to a significant amount of time over many tests. If we need to sleep for polling reasons the sleep should be much shorter, like on the order of 10ms. Better than sleep-polling is flushing the event dispatcher and then checking since we can avoid polling entirely. Nit: isCapabilityChanged init can be simplified to the following, similar to the noRunningApps boolean init above it: boolean isCapabilityChanged = !rmNode.getTotalCapability().equals(newNode.getTotalCapability()); Nit: is this conditional check even necessary? We can just update the total capability with no semantic effect if it hasn't changed. Since this is just updating a reference with another precomputed one, it's not like we're avoiding some expensive code. if (isCapabilityChanged) { rmNode.totalCapability = newNode.getTotalCapability(); }
          Hide
          zxu zhihai xu added a comment -

          Thanks for reporting this issue Varun Vasudev! Thanks for the review [~Jason Lowe]!
          Rohith Sharma K S tried to clean up the code at YARN-3286. Based on the following comment from Jian He at YARN-3286,

          I think this has changed the behavior that without any RM/NM restart features enabled, earlier restarting a node will trigger RM to kill all the containers on this node, but now it won't ?
          

          The patch may cause compatibility issue. Maybe we can merge the case rmNode.getHttpPort() == newNode.getHttpPort() with rmNode.getHttpPort() != newNode.getHttpPort() for noRunningApps.
          Thoughts?

          Show
          zxu zhihai xu added a comment - Thanks for reporting this issue Varun Vasudev ! Thanks for the review [~Jason Lowe] ! Rohith Sharma K S tried to clean up the code at YARN-3286 . Based on the following comment from Jian He at YARN-3286 , I think this has changed the behavior that without any RM/NM restart features enabled, earlier restarting a node will trigger RM to kill all the containers on this node, but now it won't ? The patch may cause compatibility issue. Maybe we can merge the case rmNode.getHttpPort() == newNode.getHttpPort() with rmNode.getHttpPort() != newNode.getHttpPort() for noRunningApps. Thoughts?
          Hide
          Naganarasimha Naganarasimha G R added a comment -

          Hi zhihai xu,
          Seems like the jira number is wrong as YARN-3286 is closed as wont fix ! Are you referring to any other Rohith Sharma K S's jira?

          Show
          Naganarasimha Naganarasimha G R added a comment - Hi zhihai xu , Seems like the jira number is wrong as YARN-3286 is closed as wont fix ! Are you referring to any other Rohith Sharma K S 's jira?
          Hide
          sunilg Sunil G added a comment -

          I think its a correct JIRA id. A discussion happened there regarding the removal of a node and adding it back again when we get a ReconnectNodeTransition, but that change has an impact on a existing behavior about killing all containers while removing a node.

          Show
          sunilg Sunil G added a comment - I think its a correct JIRA id. A discussion happened there regarding the removal of a node and adding it back again when we get a ReconnectNodeTransition , but that change has an impact on a existing behavior about killing all containers while removing a node.
          Hide
          Naganarasimha Naganarasimha G R added a comment -

          thanks for the clarification, earlier had interpreted zhihai xu's comment wrongly.

          Show
          Naganarasimha Naganarasimha G R added a comment - thanks for the clarification, earlier had interpreted zhihai xu 's comment wrongly.
          Hide
          jlowe Jason Lowe added a comment -

          Ah yes, the non-work-preserving NM restart case. The code is assuming that an NM registering without any active apps might be a non-work-preserving NM reconnecting, so we need to explicitly remove the node and add it back in so the scheduler will release any containers that were being tracked on that node.

          At first I thought YARN-3802 had an inherent race in it where it assumes that the node event will be processed before the capability is updated. That turns out to be true for the CapacityScheduler, but I think that's a bug in the CapacityScheduler. Note that node update path appears to have the same issue – RMNodeImpl updates the node's capability before sending the scheduler node updated event. So how can it work in that case? It works because the CapacityScheduler for node update isn't looking at what the resource was in the RMNode passed in the event. Instead it's looking up the scheduler node based on the RMNodeId and then referencing the total capability tracked there. Seems to me the bug here is that the scheduler is relying on the RMNode in the event directly rather than the SchedulerNode to handle the capability calculation. We probably should have limited a lot of these scheduler events to just having RMNodeId rather than the full RMNode to avoid the temptation to directly examine the RMNode when handling the event. As seen here, the RMNode can be "moving" while the scheduler is trying to examine it.

          Show
          jlowe Jason Lowe added a comment - Ah yes, the non-work-preserving NM restart case. The code is assuming that an NM registering without any active apps might be a non-work-preserving NM reconnecting, so we need to explicitly remove the node and add it back in so the scheduler will release any containers that were being tracked on that node. At first I thought YARN-3802 had an inherent race in it where it assumes that the node event will be processed before the capability is updated. That turns out to be true for the CapacityScheduler, but I think that's a bug in the CapacityScheduler. Note that node update path appears to have the same issue – RMNodeImpl updates the node's capability before sending the scheduler node updated event. So how can it work in that case? It works because the CapacityScheduler for node update isn't looking at what the resource was in the RMNode passed in the event. Instead it's looking up the scheduler node based on the RMNodeId and then referencing the total capability tracked there. Seems to me the bug here is that the scheduler is relying on the RMNode in the event directly rather than the SchedulerNode to handle the capability calculation. We probably should have limited a lot of these scheduler events to just having RMNodeId rather than the full RMNode to avoid the temptation to directly examine the RMNode when handling the event. As seen here, the RMNode can be "moving" while the scheduler is trying to examine it.
          Hide
          zxu zhihai xu added a comment -

          +1 for Jason Lowe's suggestion to fix the issue at scheduler side. Using SchedulerNode.getTotalResource() instead of RMNode.getTotalCapability() inside Scheduler can better decouple Scheduler from RMNodeImpl state machine. It may also fix some other potential issues. For example, CapacityScheduler#addNode uses nodeManager.getTotalCapability() after creating FiCaSchedulerNode, if nodeManager.totalCapability is changed by RMNodeImpl state machine right after FiCaSchedulerNode was created, similar issue may happen.

          Show
          zxu zhihai xu added a comment - +1 for Jason Lowe's suggestion to fix the issue at scheduler side. Using SchedulerNode.getTotalResource() instead of RMNode.getTotalCapability() inside Scheduler can better decouple Scheduler from RMNodeImpl state machine. It may also fix some other potential issues. For example, CapacityScheduler#addNode uses nodeManager.getTotalCapability() after creating FiCaSchedulerNode , if nodeManager.totalCapability is changed by RMNodeImpl state machine right after FiCaSchedulerNode was created, similar issue may happen.
          Hide
          vvasudev Varun Vasudev added a comment -

          Thanks for the feedback everyone. I've uploaded a new patch to use the SchedulerNode in the capacity scheduler and the fifo scheduler(which also has the same issue). I've also fixed the test to not use sleeps.

          Show
          vvasudev Varun Vasudev added a comment - Thanks for the feedback everyone. I've uploaded a new patch to use the SchedulerNode in the capacity scheduler and the fifo scheduler(which also has the same issue). I've also fixed the test to not use sleeps.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 7s docker + precommit patch detected.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
          +1 mvninstall 3m 4s trunk passed
          +1 compile 0m 21s trunk passed with JDK v1.8.0_60
          +1 compile 0m 23s trunk passed with JDK v1.7.0_79
          +1 checkstyle 0m 12s trunk passed
          +1 mvnsite 0m 30s trunk passed
          +1 mvneclipse 0m 15s trunk passed
          +1 findbugs 1m 8s trunk passed
          +1 javadoc 0m 20s trunk passed with JDK v1.8.0_60
          +1 javadoc 0m 25s trunk passed with JDK v1.7.0_79
          +1 mvninstall 0m 27s the patch passed
          +1 compile 0m 21s the patch passed with JDK v1.8.0_60
          +1 javac 0m 21s the patch passed
          +1 compile 0m 24s the patch passed with JDK v1.7.0_79
          +1 javac 0m 24s the patch passed
          +1 checkstyle 0m 11s the patch passed
          +1 mvnsite 0m 30s the patch passed
          +1 mvneclipse 0m 15s the patch passed
          +1 whitespace 0m 0s Patch has no whitespace issues.
          +1 findbugs 1m 15s the patch passed
          +1 javadoc 0m 21s the patch passed with JDK v1.8.0_60
          +1 javadoc 0m 27s the patch passed with JDK v1.7.0_79
          -1 unit 65m 6s hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_60.
          -1 unit 65m 27s hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.7.0_79.
          +1 asflicense 0m 22s Patch does not generate ASF License warnings.
          142m 52s



          Reason Tests
          JDK v1.8.0_60 Failed junit tests hadoop.yarn.server.resourcemanager.TestClientRMTokens
            hadoop.yarn.server.resourcemanager.TestAMAuthorization
          JDK v1.7.0_79 Failed junit tests hadoop.yarn.server.resourcemanager.TestClientRMTokens
            hadoop.yarn.server.resourcemanager.TestAMAuthorization



          Subsystem Report/Notes
          Docker Client=1.7.1 Server=1.7.1 Image:test-patch-base-hadoop-date2015-11-13
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12772165/YARN-4344.002.patch
          JIRA Issue YARN-4344
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 683bc3b1257c 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /home/jenkins/jenkins-slave/workspace/PreCommit-YARN-Build@2/patchprocess/apache-yetus-fa12328/precommit/personality/hadoop.sh
          git revision trunk / cccf884
          findbugs v3.0.0
          unit https://builds.apache.org/job/PreCommit-YARN-Build/9682/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.8.0_60.txt
          unit https://builds.apache.org/job/PreCommit-YARN-Build/9682/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.7.0_79.txt
          unit test logs https://builds.apache.org/job/PreCommit-YARN-Build/9682/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.8.0_60.txt https://builds.apache.org/job/PreCommit-YARN-Build/9682/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.7.0_79.txt
          JDK v1.7.0_79 Test Results https://builds.apache.org/job/PreCommit-YARN-Build/9682/testReport/
          modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
          Max memory used 229MB
          Powered by Apache Yetus http://yetus.apache.org
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/9682/console

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 7s docker + precommit patch detected. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 3m 4s trunk passed +1 compile 0m 21s trunk passed with JDK v1.8.0_60 +1 compile 0m 23s trunk passed with JDK v1.7.0_79 +1 checkstyle 0m 12s trunk passed +1 mvnsite 0m 30s trunk passed +1 mvneclipse 0m 15s trunk passed +1 findbugs 1m 8s trunk passed +1 javadoc 0m 20s trunk passed with JDK v1.8.0_60 +1 javadoc 0m 25s trunk passed with JDK v1.7.0_79 +1 mvninstall 0m 27s the patch passed +1 compile 0m 21s the patch passed with JDK v1.8.0_60 +1 javac 0m 21s the patch passed +1 compile 0m 24s the patch passed with JDK v1.7.0_79 +1 javac 0m 24s the patch passed +1 checkstyle 0m 11s the patch passed +1 mvnsite 0m 30s the patch passed +1 mvneclipse 0m 15s the patch passed +1 whitespace 0m 0s Patch has no whitespace issues. +1 findbugs 1m 15s the patch passed +1 javadoc 0m 21s the patch passed with JDK v1.8.0_60 +1 javadoc 0m 27s the patch passed with JDK v1.7.0_79 -1 unit 65m 6s hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_60. -1 unit 65m 27s hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.7.0_79. +1 asflicense 0m 22s Patch does not generate ASF License warnings. 142m 52s Reason Tests JDK v1.8.0_60 Failed junit tests hadoop.yarn.server.resourcemanager.TestClientRMTokens   hadoop.yarn.server.resourcemanager.TestAMAuthorization JDK v1.7.0_79 Failed junit tests hadoop.yarn.server.resourcemanager.TestClientRMTokens   hadoop.yarn.server.resourcemanager.TestAMAuthorization Subsystem Report/Notes Docker Client=1.7.1 Server=1.7.1 Image:test-patch-base-hadoop-date2015-11-13 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12772165/YARN-4344.002.patch JIRA Issue YARN-4344 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 683bc3b1257c 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /home/jenkins/jenkins-slave/workspace/PreCommit-YARN-Build@2/patchprocess/apache-yetus-fa12328/precommit/personality/hadoop.sh git revision trunk / cccf884 findbugs v3.0.0 unit https://builds.apache.org/job/PreCommit-YARN-Build/9682/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.8.0_60.txt unit https://builds.apache.org/job/PreCommit-YARN-Build/9682/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.7.0_79.txt unit test logs https://builds.apache.org/job/PreCommit-YARN-Build/9682/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.8.0_60.txt https://builds.apache.org/job/PreCommit-YARN-Build/9682/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.7.0_79.txt JDK v1.7.0_79 Test Results https://builds.apache.org/job/PreCommit-YARN-Build/9682/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Max memory used 229MB Powered by Apache Yetus http://yetus.apache.org Console output https://builds.apache.org/job/PreCommit-YARN-Build/9682/console This message was automatically generated.
          Hide
          vvasudev Varun Vasudev added a comment -

          The test failures are unrelated to the patch.

          Show
          vvasudev Varun Vasudev added a comment - The test failures are unrelated to the patch.
          Hide
          leftnoteasy Wangda Tan added a comment -

          Good catch, Varun Vasudev! Fix looks good to me, as commented by zhihai xu, we should decouple RMNode status from the scheduler's view.

          Show
          leftnoteasy Wangda Tan added a comment - Good catch, Varun Vasudev ! Fix looks good to me, as commented by zhihai xu , we should decouple RMNode status from the scheduler's view.
          Hide
          jlowe Jason Lowe added a comment -

          +1 lgtm. Varun Vasudev could you also put up a patch for 2.6? It doesn't apply there.

          Show
          jlowe Jason Lowe added a comment - +1 lgtm. Varun Vasudev could you also put up a patch for 2.6? It doesn't apply there.
          Hide
          vvasudev Varun Vasudev added a comment -

          Uploaded a version for branch-2.6

          Show
          vvasudev Varun Vasudev added a comment - Uploaded a version for branch-2.6
          Hide
          jlowe Jason Lowe added a comment -

          +1 for branch-2.6 patch as well, committing this.

          Show
          jlowe Jason Lowe added a comment - +1 for branch-2.6 patch as well, committing this.
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-trunk-Commit #8864 (See https://builds.apache.org/job/Hadoop-trunk-Commit/8864/)
          YARN-4344. NMs reconnecting with changed capabilities can lead to wrong (jlowe: rev d36b6e045f317c94e97cb41a163aa974d161a404)

          • hadoop-yarn-project/CHANGES.txt
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestCapacityScheduler.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/FifoScheduler.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-trunk-Commit #8864 (See https://builds.apache.org/job/Hadoop-trunk-Commit/8864/ ) YARN-4344 . NMs reconnecting with changed capabilities can lead to wrong (jlowe: rev d36b6e045f317c94e97cb41a163aa974d161a404) hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestCapacityScheduler.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/FifoScheduler.java
          Hide
          jlowe Jason Lowe added a comment -

          Thanks to Varun for the contribution and to zhihai and Wangda for additional patch review! I committed this to trunk, branch-2, branch-2.7, and branch-2.6.

          Show
          jlowe Jason Lowe added a comment - Thanks to Varun for the contribution and to zhihai and Wangda for additional patch review! I committed this to trunk, branch-2, branch-2.7, and branch-2.6.
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Yarn-trunk #1439 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/1439/)
          YARN-4344. NMs reconnecting with changed capabilities can lead to wrong (jlowe: rev d36b6e045f317c94e97cb41a163aa974d161a404)

          • hadoop-yarn-project/CHANGES.txt
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/FifoScheduler.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestCapacityScheduler.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Yarn-trunk #1439 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/1439/ ) YARN-4344 . NMs reconnecting with changed capabilities can lead to wrong (jlowe: rev d36b6e045f317c94e97cb41a163aa974d161a404) hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/FifoScheduler.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestCapacityScheduler.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #706 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/706/)
          YARN-4344. NMs reconnecting with changed capabilities can lead to wrong (jlowe: rev d36b6e045f317c94e97cb41a163aa974d161a404)

          • hadoop-yarn-project/CHANGES.txt
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestCapacityScheduler.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/FifoScheduler.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #706 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/706/ ) YARN-4344 . NMs reconnecting with changed capabilities can lead to wrong (jlowe: rev d36b6e045f317c94e97cb41a163aa974d161a404) hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestCapacityScheduler.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/FifoScheduler.java
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Mapreduce-trunk #2647 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2647/)
          YARN-4344. NMs reconnecting with changed capabilities can lead to wrong (jlowe: rev d36b6e045f317c94e97cb41a163aa974d161a404)

          • hadoop-yarn-project/CHANGES.txt
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/FifoScheduler.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestCapacityScheduler.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Mapreduce-trunk #2647 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2647/ ) YARN-4344 . NMs reconnecting with changed capabilities can lead to wrong (jlowe: rev d36b6e045f317c94e97cb41a163aa974d161a404) hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/FifoScheduler.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestCapacityScheduler.java
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Hadoop-Yarn-trunk-Java8 #717 (See https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/717/)
          YARN-4344. NMs reconnecting with changed capabilities can lead to wrong (jlowe: rev d36b6e045f317c94e97cb41a163aa974d161a404)

          • hadoop-yarn-project/CHANGES.txt
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/FifoScheduler.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestCapacityScheduler.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-Yarn-trunk-Java8 #717 (See https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/717/ ) YARN-4344 . NMs reconnecting with changed capabilities can lead to wrong (jlowe: rev d36b6e045f317c94e97cb41a163aa974d161a404) hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/FifoScheduler.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestCapacityScheduler.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #633 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/633/)
          YARN-4344. NMs reconnecting with changed capabilities can lead to wrong (jlowe: rev d36b6e045f317c94e97cb41a163aa974d161a404)

          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestCapacityScheduler.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/FifoScheduler.java
          • hadoop-yarn-project/CHANGES.txt
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #633 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/633/ ) YARN-4344 . NMs reconnecting with changed capabilities can lead to wrong (jlowe: rev d36b6e045f317c94e97cb41a163aa974d161a404) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestCapacityScheduler.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/FifoScheduler.java hadoop-yarn-project/CHANGES.txt
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Hdfs-trunk #2572 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/2572/)
          YARN-4344. NMs reconnecting with changed capabilities can lead to wrong (jlowe: rev d36b6e045f317c94e97cb41a163aa974d161a404)

          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestCapacityScheduler.java
          • hadoop-yarn-project/CHANGES.txt
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/FifoScheduler.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Hdfs-trunk #2572 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/2572/ ) YARN-4344 . NMs reconnecting with changed capabilities can lead to wrong (jlowe: rev d36b6e045f317c94e97cb41a163aa974d161a404) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestCapacityScheduler.java hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/FifoScheduler.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java
          Hide
          vinodkv Vinod Kumar Vavilapalli added a comment -

          Pulled this into 2.7.2 to keep the release up-to-date with 2.6.3. Changing fix-versions to reflect the same.

          Show
          vinodkv Vinod Kumar Vavilapalli added a comment - Pulled this into 2.7.2 to keep the release up-to-date with 2.6.3. Changing fix-versions to reflect the same.

            People

            • Assignee:
              vvasudev Varun Vasudev
              Reporter:
              vvasudev Varun Vasudev
            • Votes:
              0 Vote for this issue
              Watchers:
              17 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development