Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-5837

NPE when getting node status of a decommissioned node after an RM restart

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.7.3, 3.0.0-alpha1
    • Fix Version/s: 2.8.0, 2.7.4, 3.0.0-alpha2
    • Component/s: None
    • Labels:
      None

      Description

      If you decommission a node, the yarn node command shows it like this:

      >> bin/yarn node -list -all
      2016-11-04 08:54:37,169 INFO client.RMProxy: Connecting to ResourceManager at 0.0.0.0/0.0.0.0:8032
      Total Nodes:1
               Node-Id	     Node-State	Node-Http-Address	Number-of-Running-Containers
      192.168.1.69:57560	 DECOMMISSIONED	192.168.1.69:8042	                           0
      

      And a full report like this:

      >> bin/yarn node -status 192.168.1.69:57560
      2016-11-04 08:55:08,928 INFO client.RMProxy: Connecting to ResourceManager at 0.0.0.0/0.0.0.0:8032
      Node Report :
      	Node-Id : 192.168.1.69:57560
      	Rack : /default-rack
      	Node-State : DECOMMISSIONED
      	Node-Http-Address : 192.168.1.69:8042
      	Last-Health-Update : Fri 04/Nov/16 08:53:58:802PDT
      	Health-Report :
      	Containers : 0
      	Memory-Used : 0MB
      	Memory-Capacity : 8192MB
      	CPU-Used : 0 vcores
      	CPU-Capacity : 8 vcores
      	Node-Labels :
      	Resource Utilization by Node :
      	Resource Utilization by Containers : PMem:0 MB, VMem:0 MB, VCores:0.0
      

      If you then restart the ResourceManager, you get this report:

      >> bin/yarn node -list -all
      2016-11-04 08:57:18,512 INFO client.RMProxy: Connecting to ResourceManager at 0.0.0.0/0.0.0.0:8032
      Total Nodes:4
               Node-Id	     Node-State	Node-Http-Address	Number-of-Running-Containers
       192.168.1.69:-1	 DECOMMISSIONED	  192.168.1.69:-1	                           0
      

      And when you try to get the full report on the now "-1" node, you get an NPE:

      >> bin/yarn node -status 192.168.1.69:-1
      2016-11-04 08:57:57,385 INFO client.RMProxy: Connecting to ResourceManager at 0.0.0.0/0.0.0.0:8032
      Exception in thread "main" java.lang.NullPointerException
      	at org.apache.hadoop.yarn.client.cli.NodeCLI.printNodeStatus(NodeCLI.java:296)
      	at org.apache.hadoop.yarn.client.cli.NodeCLI.run(NodeCLI.java:116)
      	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
      	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
      	at org.apache.hadoop.yarn.client.cli.NodeCLI.main(NodeCLI.java:63)
      
      1. YARN-5837.branch-2.7.001.patch
        6 kB
        Robert Kanter
      2. YARN-5837.001.patch
        6 kB
        Robert Kanter

        Issue Links

          Activity

          Hide
          rkanter Robert Kanter added a comment -

          The problem is due to this code in the NodesListManager:

            private void setDecomissionedNMs() {
              Set<String> excludeList = hostsReader.getExcludedHosts();
              for (final String host : excludeList) {
                NodeId nodeId = createUnknownNodeId(host);
                RMNodeImpl rmNode = new RMNodeImpl(nodeId,
                    rmContext, host, -1, -1, new UnknownNode(host), null, null);
                rmContext.getInactiveRMNodes().put(nodeId, rmNode);
                rmNode.handle(new RMNodeEvent(nodeId, RMNodeEventType.DECOMMISSION));
              }
            }
          

          We set the Resource capability of the node to null. After the RM restart, we no longer know the resource capabilities of the node because it's down now, but having 0 resources is more reasonable than null, and should fix the CLI and API.

          Show
          rkanter Robert Kanter added a comment - The problem is due to this code in the NodesListManager : private void setDecomissionedNMs() { Set< String > excludeList = hostsReader.getExcludedHosts(); for ( final String host : excludeList) { NodeId nodeId = createUnknownNodeId(host); RMNodeImpl rmNode = new RMNodeImpl(nodeId, rmContext, host, -1, -1, new UnknownNode(host), null , null ); rmContext.getInactiveRMNodes().put(nodeId, rmNode); rmNode.handle( new RMNodeEvent(nodeId, RMNodeEventType.DECOMMISSION)); } } We set the Resource capability of the node to null . After the RM restart, we no longer know the resource capabilities of the node because it's down now, but having 0 resources is more reasonable than null , and should fix the CLI and API.
          Hide
          rkanter Robert Kanter added a comment -

          The patch fixes the problem by passing in a Resources object with 0 memory and 0 vcores. It also sets the version to "unknown" instead of "null" so it shows up nicer. It also updates a test and I've verified it in a cluster.

          The trunk patch applies cleanly to trunk, branch-2, and branch-2.8 (with some fuzzing by the patch command). The branch-2.7 patch applies to branch-2.7.

          Show
          rkanter Robert Kanter added a comment - The patch fixes the problem by passing in a Resources object with 0 memory and 0 vcores. It also sets the version to "unknown" instead of "null" so it shows up nicer. It also updates a test and I've verified it in a cluster. The trunk patch applies cleanly to trunk, branch-2, and branch-2.8 (with some fuzzing by the patch command). The branch-2.7 patch applies to branch-2.7.
          Hide
          hadoopqa Hadoop QA added a comment -
          +1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 26s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 2 new or modified test files.
          +1 mvninstall 7m 8s trunk passed
          +1 compile 0m 32s trunk passed
          +1 checkstyle 0m 22s trunk passed
          +1 mvnsite 0m 39s trunk passed
          +1 mvneclipse 0m 17s trunk passed
          +1 findbugs 1m 1s trunk passed
          +1 javadoc 0m 22s trunk passed
          +1 mvninstall 0m 33s the patch passed
          +1 compile 0m 31s the patch passed
          +1 javac 0m 31s the patch passed
          +1 checkstyle 0m 20s the patch passed
          +1 mvnsite 0m 42s the patch passed
          +1 mvneclipse 0m 14s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 1m 7s the patch passed
          +1 javadoc 0m 20s the patch passed
          +1 unit 35m 27s hadoop-yarn-server-resourcemanager in the patch passed.
          +1 asflicense 0m 18s The patch does not generate ASF License warnings.
          51m 35s



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:9560f25
          JIRA Issue YARN-5837
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12837203/YARN-5837.001.patch
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux aa912f2989b7 3.13.0-95-generic #142-Ubuntu SMP Fri Aug 12 17:00:09 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / abfc15d
          Default Java 1.8.0_111
          findbugs v3.0.0
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/13789/testReport/
          modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/13789/console
          Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - +1 overall Vote Subsystem Runtime Comment 0 reexec 0m 26s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 2 new or modified test files. +1 mvninstall 7m 8s trunk passed +1 compile 0m 32s trunk passed +1 checkstyle 0m 22s trunk passed +1 mvnsite 0m 39s trunk passed +1 mvneclipse 0m 17s trunk passed +1 findbugs 1m 1s trunk passed +1 javadoc 0m 22s trunk passed +1 mvninstall 0m 33s the patch passed +1 compile 0m 31s the patch passed +1 javac 0m 31s the patch passed +1 checkstyle 0m 20s the patch passed +1 mvnsite 0m 42s the patch passed +1 mvneclipse 0m 14s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 7s the patch passed +1 javadoc 0m 20s the patch passed +1 unit 35m 27s hadoop-yarn-server-resourcemanager in the patch passed. +1 asflicense 0m 18s The patch does not generate ASF License warnings. 51m 35s Subsystem Report/Notes Docker Image:yetus/hadoop:9560f25 JIRA Issue YARN-5837 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12837203/YARN-5837.001.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux aa912f2989b7 3.13.0-95-generic #142-Ubuntu SMP Fri Aug 12 17:00:09 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / abfc15d Default Java 1.8.0_111 findbugs v3.0.0 Test Results https://builds.apache.org/job/PreCommit-YARN-Build/13789/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/13789/console Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          jlowe Jason Lowe added a comment -

          Thanks for the patch! My apologies for missing this when reviewing YARN-3102.

          +1, both patches look good to me. I'll commit these later today if there are no objections.

          Show
          jlowe Jason Lowe added a comment - Thanks for the patch! My apologies for missing this when reviewing YARN-3102 . +1, both patches look good to me. I'll commit these later today if there are no objections.
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #10775 (See https://builds.apache.org/job/Hadoop-trunk-Commit/10775/)
          YARN-5837. NPE when getting node status of a decommissioned node after (jlowe: rev 6bb741ff0ef208a8628bc64d6537999d4cd67955)

          • (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/NodesListManager.java
          • (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockNM.java
          • (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMRestart.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #10775 (See https://builds.apache.org/job/Hadoop-trunk-Commit/10775/ ) YARN-5837 . NPE when getting node status of a decommissioned node after (jlowe: rev 6bb741ff0ef208a8628bc64d6537999d4cd67955) (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/NodesListManager.java (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockNM.java (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMRestart.java
          Hide
          jlowe Jason Lowe added a comment -

          Thanks, Robert Kanter! I committed this to trunk, branch-2, branch-2.8, and branch-2.7.

          Show
          jlowe Jason Lowe added a comment - Thanks, Robert Kanter ! I committed this to trunk, branch-2, branch-2.8, and branch-2.7.
          Hide
          rkanter Robert Kanter added a comment -

          Thanks for the quick review!

          Show
          rkanter Robert Kanter added a comment - Thanks for the quick review!
          Hide
          Cyl Yeliang Cang added a comment -

          Thanks!

          Show
          Cyl Yeliang Cang added a comment - Thanks!

            People

            • Assignee:
              rkanter Robert Kanter
              Reporter:
              rkanter Robert Kanter
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development