Details

    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      As proposed in YARN-914, a new state of “DECOMMISSIONING” will be added and can transition from “running” state triggered by a new event - “decommissioning”.
      This new state can be transit to state of “decommissioned” when Resource_Update if no running apps on this NM or NM reconnect after restart. Or it received DECOMMISSIONED event (after timeout from CLI).
      In addition, it can back to “running” if user decides to cancel previous decommission by calling recommission on the same node. The reaction to other events is similar to RUNNING state.

      1. YARN-3212-v6.patch
        46 kB
        Junping Du
      2. YARN-3212-v6.2.patch
        44 kB
        Junping Du
      3. YARN-3212-v6.1.patch
        44 kB
        Junping Du
      4. YARN-3212-v5.patch
        41 kB
        Junping Du
      5. YARN-3212-v5.1.patch
        41 kB
        Junping Du
      6. YARN-3212-v4.patch
        26 kB
        Junping Du
      7. YARN-3212-v4.1.patch
        26 kB
        Junping Du
      8. YARN-3212-v3.patch
        28 kB
        Junping Du
      9. YARN-3212-v2.patch
        27 kB
        Junping Du
      10. YARN-3212-v1.patch
        25 kB
        Junping Du
      11. RMNodeImpl - new.png
        127 kB
        Junping Du

        Issue Links

          Activity

          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Hdfs-trunk #2331 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/2331/)
          YARN-3212. RMNode State Transition Update with DECOMMISSIONING state. (Junping Du via wangda) (wangda: rev 9bc913a35c46e65d373c3ae3f01a377e16e8d0ca)

          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestNodesPage.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeEventType.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/NodesListManager.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java
          • hadoop-yarn-project/CHANGES.txt
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Hdfs-trunk #2331 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/2331/ ) YARN-3212 . RMNode State Transition Update with DECOMMISSIONING state. (Junping Du via wangda) (wangda: rev 9bc913a35c46e65d373c3ae3f01a377e16e8d0ca) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestNodesPage.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeEventType.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/NodesListManager.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java hadoop-yarn-project/CHANGES.txt
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #392 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/392/)
          YARN-3212. RMNode State Transition Update with DECOMMISSIONING state. (Junping Du via wangda) (wangda: rev 9bc913a35c46e65d373c3ae3f01a377e16e8d0ca)

          • hadoop-yarn-project/CHANGES.txt
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestNodesPage.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeEventType.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/NodesListManager.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #392 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/392/ ) YARN-3212 . RMNode State Transition Update with DECOMMISSIONING state. (Junping Du via wangda) (wangda: rev 9bc913a35c46e65d373c3ae3f01a377e16e8d0ca) hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestNodesPage.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeEventType.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/NodesListManager.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Hadoop-Yarn-trunk #1151 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/1151/)
          YARN-3212. RMNode State Transition Update with DECOMMISSIONING state. (Junping Du via wangda) (wangda: rev 9bc913a35c46e65d373c3ae3f01a377e16e8d0ca)

          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java
          • hadoop-yarn-project/CHANGES.txt
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/NodesListManager.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestNodesPage.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeEventType.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-Yarn-trunk #1151 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/1151/ ) YARN-3212 . RMNode State Transition Update with DECOMMISSIONING state. (Junping Du via wangda) (wangda: rev 9bc913a35c46e65d373c3ae3f01a377e16e8d0ca) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/NodesListManager.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestNodesPage.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeEventType.java
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Mapreduce-trunk #2357 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2357/)
          YARN-3212. RMNode State Transition Update with DECOMMISSIONING state. (Junping Du via wangda) (wangda: rev 9bc913a35c46e65d373c3ae3f01a377e16e8d0ca)

          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeEventType.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestNodesPage.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java
          • hadoop-yarn-project/CHANGES.txt
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/NodesListManager.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Mapreduce-trunk #2357 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2357/ ) YARN-3212 . RMNode State Transition Update with DECOMMISSIONING state. (Junping Du via wangda) (wangda: rev 9bc913a35c46e65d373c3ae3f01a377e16e8d0ca) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeEventType.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestNodesPage.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/NodesListManager.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #418 (See https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/418/)
          YARN-3212. RMNode State Transition Update with DECOMMISSIONING state. (Junping Du via wangda) (wangda: rev 9bc913a35c46e65d373c3ae3f01a377e16e8d0ca)

          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeEventType.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestNodesPage.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/NodesListManager.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java
          • hadoop-yarn-project/CHANGES.txt
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #418 (See https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/418/ ) YARN-3212 . RMNode State Transition Update with DECOMMISSIONING state. (Junping Du via wangda) (wangda: rev 9bc913a35c46e65d373c3ae3f01a377e16e8d0ca) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeEventType.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestNodesPage.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/NodesListManager.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java hadoop-yarn-project/CHANGES.txt
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #410 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/410/)
          YARN-3212. RMNode State Transition Update with DECOMMISSIONING state. (Junping Du via wangda) (wangda: rev 9bc913a35c46e65d373c3ae3f01a377e16e8d0ca)

          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/NodesListManager.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestNodesPage.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeEventType.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java
          • hadoop-yarn-project/CHANGES.txt
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #410 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/410/ ) YARN-3212 . RMNode State Transition Update with DECOMMISSIONING state. (Junping Du via wangda) (wangda: rev 9bc913a35c46e65d373c3ae3f01a377e16e8d0ca) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/NodesListManager.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestNodesPage.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeEventType.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-trunk-Commit #8482 (See https://builds.apache.org/job/Hadoop-trunk-Commit/8482/)
          YARN-3212. RMNode State Transition Update with DECOMMISSIONING state. (Junping Du via wangda) (wangda: rev 9bc913a35c46e65d373c3ae3f01a377e16e8d0ca)

          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestNodesPage.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java
          • hadoop-yarn-project/CHANGES.txt
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/NodesListManager.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeEventType.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-trunk-Commit #8482 (See https://builds.apache.org/job/Hadoop-trunk-Commit/8482/ ) YARN-3212 . RMNode State Transition Update with DECOMMISSIONING state. (Junping Du via wangda) (wangda: rev 9bc913a35c46e65d373c3ae3f01a377e16e8d0ca) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestNodesPage.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/NodesListManager.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeEventType.java
          Hide
          leftnoteasy Wangda Tan added a comment -

          Committed to trunk/branch-2, thanks Junping Du and review from Jason Lowe, Sunil G, Rohith Sharma K S!

          Show
          leftnoteasy Wangda Tan added a comment - Committed to trunk/branch-2, thanks Junping Du and review from Jason Lowe , Sunil G , Rohith Sharma K S !
          Hide
          leftnoteasy Wangda Tan added a comment -

          Patch looks good, thanks Junping Du. Will commit in a few days if no opposite opinions.

          Show
          leftnoteasy Wangda Tan added a comment - Patch looks good, thanks Junping Du . Will commit in a few days if no opposite opinions.
          Hide
          hadoopqa Hadoop QA added a comment -



          +1 overall



          Vote Subsystem Runtime Comment
          0 pre-patch 16m 33s Pre-patch trunk compilation is healthy.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 tests included 0m 0s The patch appears to include 2 new or modified test files.
          +1 javac 7m 49s There were no new javac warning messages.
          +1 javadoc 10m 8s There were no new javadoc warning messages.
          +1 release audit 0m 23s The applied patch does not increase the total number of release audit warnings.
          +1 checkstyle 0m 49s There were no new checkstyle issues.
          +1 whitespace 0m 8s The patch has no lines that end in whitespace.
          +1 install 1m 30s mvn install still works.
          +1 eclipse:eclipse 0m 34s The patch built with eclipse:eclipse.
          +1 findbugs 1m 29s The patch does not introduce any new Findbugs (version 3.0.0) warnings.
          +1 yarn tests 59m 12s Tests passed in hadoop-yarn-server-resourcemanager.
              98m 39s  



          Subsystem Report/Notes
          Patch URL http://issues.apache.org/jira/secure/attachment/12756224/YARN-3212-v6.2.patch
          Optional Tests javadoc javac unit findbugs checkstyle
          git revision trunk / bf2f2b4
          hadoop-yarn-server-resourcemanager test log https://builds.apache.org/job/PreCommit-YARN-Build/9167/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/9167/testReport/
          Java 1.7.0_55
          uname Linux asf908.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/9167/console

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - +1 overall Vote Subsystem Runtime Comment 0 pre-patch 16m 33s Pre-patch trunk compilation is healthy. +1 @author 0m 0s The patch does not contain any @author tags. +1 tests included 0m 0s The patch appears to include 2 new or modified test files. +1 javac 7m 49s There were no new javac warning messages. +1 javadoc 10m 8s There were no new javadoc warning messages. +1 release audit 0m 23s The applied patch does not increase the total number of release audit warnings. +1 checkstyle 0m 49s There were no new checkstyle issues. +1 whitespace 0m 8s The patch has no lines that end in whitespace. +1 install 1m 30s mvn install still works. +1 eclipse:eclipse 0m 34s The patch built with eclipse:eclipse. +1 findbugs 1m 29s The patch does not introduce any new Findbugs (version 3.0.0) warnings. +1 yarn tests 59m 12s Tests passed in hadoop-yarn-server-resourcemanager.     98m 39s   Subsystem Report/Notes Patch URL http://issues.apache.org/jira/secure/attachment/12756224/YARN-3212-v6.2.patch Optional Tests javadoc javac unit findbugs checkstyle git revision trunk / bf2f2b4 hadoop-yarn-server-resourcemanager test log https://builds.apache.org/job/PreCommit-YARN-Build/9167/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/9167/testReport/ Java 1.7.0_55 uname Linux asf908.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Console output https://builds.apache.org/job/PreCommit-YARN-Build/9167/console This message was automatically generated.
          Hide
          djp Junping Du added a comment -

          The unit test failure should be unrelated. Fix whitespace issue in v6.2 patch.

          Show
          djp Junping Du added a comment - The unit test failure should be unrelated. Fix whitespace issue in v6.2 patch.
          Hide
          hadoopqa Hadoop QA added a comment -



          -1 overall



          Vote Subsystem Runtime Comment
          0 pre-patch 17m 7s Pre-patch trunk compilation is healthy.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 tests included 0m 0s The patch appears to include 2 new or modified test files.
          +1 javac 7m 57s There were no new javac warning messages.
          +1 javadoc 10m 14s There were no new javadoc warning messages.
          +1 release audit 0m 24s The applied patch does not increase the total number of release audit warnings.
          +1 checkstyle 0m 50s There were no new checkstyle issues.
          -1 whitespace 0m 9s The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix.
          +1 install 1m 30s mvn install still works.
          +1 eclipse:eclipse 0m 34s The patch built with eclipse:eclipse.
          +1 findbugs 1m 30s The patch does not introduce any new Findbugs (version 3.0.0) warnings.
          -1 yarn tests 49m 57s Tests failed in hadoop-yarn-server-resourcemanager.
              90m 17s  



          Reason Tests
          Failed unit tests hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart
            hadoop.yarn.server.resourcemanager.TestRMRestart
            hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter
          Timed out tests org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRMRPCResponseId



          Subsystem Report/Notes
          Patch URL http://issues.apache.org/jira/secure/attachment/12756037/YARN-3212-v6.1.patch
          Optional Tests javadoc javac unit findbugs checkstyle
          git revision trunk / b2017d9
          whitespace https://builds.apache.org/job/PreCommit-YARN-Build/9146/artifact/patchprocess/whitespace.txt
          hadoop-yarn-server-resourcemanager test log https://builds.apache.org/job/PreCommit-YARN-Build/9146/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/9146/testReport/
          Java 1.7.0_55
          uname Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/9146/console

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 pre-patch 17m 7s Pre-patch trunk compilation is healthy. +1 @author 0m 0s The patch does not contain any @author tags. +1 tests included 0m 0s The patch appears to include 2 new or modified test files. +1 javac 7m 57s There were no new javac warning messages. +1 javadoc 10m 14s There were no new javadoc warning messages. +1 release audit 0m 24s The applied patch does not increase the total number of release audit warnings. +1 checkstyle 0m 50s There were no new checkstyle issues. -1 whitespace 0m 9s The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix. +1 install 1m 30s mvn install still works. +1 eclipse:eclipse 0m 34s The patch built with eclipse:eclipse. +1 findbugs 1m 30s The patch does not introduce any new Findbugs (version 3.0.0) warnings. -1 yarn tests 49m 57s Tests failed in hadoop-yarn-server-resourcemanager.     90m 17s   Reason Tests Failed unit tests hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart   hadoop.yarn.server.resourcemanager.TestRMRestart   hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter Timed out tests org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRMRPCResponseId Subsystem Report/Notes Patch URL http://issues.apache.org/jira/secure/attachment/12756037/YARN-3212-v6.1.patch Optional Tests javadoc javac unit findbugs checkstyle git revision trunk / b2017d9 whitespace https://builds.apache.org/job/PreCommit-YARN-Build/9146/artifact/patchprocess/whitespace.txt hadoop-yarn-server-resourcemanager test log https://builds.apache.org/job/PreCommit-YARN-Build/9146/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/9146/testReport/ Java 1.7.0_55 uname Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Console output https://builds.apache.org/job/PreCommit-YARN-Build/9146/console This message was automatically generated.
          Hide
          djp Junping Du added a comment -

          YARN-313 is just committed so cause some patch conflict. Rebase the patch on the current trunk.

          Show
          djp Junping Du added a comment - YARN-313 is just committed so cause some patch conflict. Rebase the patch on the current trunk.
          Hide
          hadoopqa Hadoop QA added a comment -



          -1 overall



          Vote Subsystem Runtime Comment
          -1 patch 0m 0s The patch command could not apply the patch during dryrun.



          Subsystem Report/Notes
          Patch URL http://issues.apache.org/jira/secure/attachment/12756009/YARN-3212-v6.patch
          Optional Tests javadoc javac unit findbugs checkstyle
          git revision trunk / 73e3a49
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/9144/console

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment -1 patch 0m 0s The patch command could not apply the patch during dryrun. Subsystem Report/Notes Patch URL http://issues.apache.org/jira/secure/attachment/12756009/YARN-3212-v6.patch Optional Tests javadoc javac unit findbugs checkstyle git revision trunk / 73e3a49 Console output https://builds.apache.org/job/PreCommit-YARN-Build/9144/console This message was automatically generated.
          Hide
          djp Junping Du added a comment -

          Update patch (v6) with following updates to address comments from Wangda and Sunil:
          1. Remove unnecessary debug log
          2. RMNodeEventType.DECOMMISSION_WITH_TIMEOUT -> RMNodeEventType.GRACEFUL_DECOMMISSION
          3. Update transition from Unhealthy to Decommissioning when receiving GRACEFUL_DECOMMISSION event, also keep node in Decommissioning when receiving node unhealthy update.

          Show
          djp Junping Du added a comment - Update patch (v6) with following updates to address comments from Wangda and Sunil: 1. Remove unnecessary debug log 2. RMNodeEventType.DECOMMISSION_WITH_TIMEOUT -> RMNodeEventType.GRACEFUL_DECOMMISSION 3. Update transition from Unhealthy to Decommissioning when receiving GRACEFUL_DECOMMISSION event, also keep node in Decommissioning when receiving node unhealthy update.
          Hide
          djp Junping Du added a comment -

          Thanks Wangda Tan for review and comments!

          1. Why shutdown a "decommissioning" NM if it is doing heartbeat. Should we allow it continue heartbeat, since RM needs to know about container finished / killed information.

          We don't shutdown a "decommissioning" NM. On the contrary, we differentiates nodes in decommissioning from others which get false in nodesListManager.isValidNode() check so it can still get running instead of decommissioned.

          2. Do we have timeout of graceful decomission? Which will update a node to "DECOMMISSIONED" after the timeout.

          There are some discussions in umbrella JIRA (https://issues.apache.org/jira/browse/YARN-914?focusedCommentId=14314653&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14314653), so we decide to track timeout in CLI instead of RM. The CLI patch (YARN-3225) also shows that.

          3. If I understand correct, decommissioning is another running state, except: We cannot allocate any new containers to it.

          Exactly. Another different is available resource should get updated with each running container get finished.

          If answer to question #2 is no, I suggest to rename RMNodeEventType.DECOMISSION_WITH_TIMEOUT to GRACEFUL_DECOMISSION, since it doesn't have a "real" timeout.

          Already replied above that we support timeout in CLI. DECOMISSION_WITH_TIMEOUT sounds more clear comparing with old DECOMMISSION event. Thoughts?

          Why this is need? .addTransition(NodeState.DECOMMISSIONING, NodeState.DECOMMISSIONING, RMNodeEventType.DECOMMISSION_WITH_TIMEOUT, new DecommissioningNodeTransition(NodeState.DECOMMISSIONING))

          If not adding this transition, an InvalidStateTransitionException will get thrown in our state machine which sounds not right for a normal operation.

          Should we simply ignore the DECOMMISSION_WITH_TIMEOUT event?

          No. RM should aware this event so later do some precisely update on available resource, etc. (YARN-3223).

          Is there specific considerations that transfer UNHEALTHY to DECOMISSIONED when DECOMMISSION_WITH_TIMEOUT received? Is it better to transfer it to DECOMISSIONING since it has some containers running on it?

          I don't have a strong preference in this case. However, my previous consideration is UNHEALTHY event comes from machine monitor which indicate the node is not quite suitable for containers keep running while DECOMMISSION_WITH_TIMEOUT comes from user who is prefer to decommission a batch of nodes without affecting app/container running if there are currently running normally. So I think make it get decommissioned sounds a simpler way before we have more operation experience with this new feature. I have similiar view on discussion above on UNHEALTHY event to a decommissioning event (https://issues.apache.org/jira/browse/YARN-3212?focusedCommentId=14693360&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14693360). May be we can retrospect on this later?

          One suggestion of how to handle node update to scheduler: I think you can add a field "isDecomissioning" to NodeUpdateSchedulerEvent, and scheduler can do all updates except allocate container.

          Thanks for good suggestion here. YARN-3223 will handle the balance of NM's total resource and used resource (so available resource is always 0). So this could be an option that we can use this way (new scheduler event) to keep NM resource balanced. There are also other options too so we can move the discussion to that JIRA I think.

          Show
          djp Junping Du added a comment - Thanks Wangda Tan for review and comments! 1. Why shutdown a "decommissioning" NM if it is doing heartbeat. Should we allow it continue heartbeat, since RM needs to know about container finished / killed information. We don't shutdown a "decommissioning" NM. On the contrary, we differentiates nodes in decommissioning from others which get false in nodesListManager.isValidNode() check so it can still get running instead of decommissioned. 2. Do we have timeout of graceful decomission? Which will update a node to "DECOMMISSIONED" after the timeout. There are some discussions in umbrella JIRA ( https://issues.apache.org/jira/browse/YARN-914?focusedCommentId=14314653&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14314653 ), so we decide to track timeout in CLI instead of RM. The CLI patch ( YARN-3225 ) also shows that. 3. If I understand correct, decommissioning is another running state, except: We cannot allocate any new containers to it. Exactly. Another different is available resource should get updated with each running container get finished. If answer to question #2 is no, I suggest to rename RMNodeEventType.DECOMISSION_WITH_TIMEOUT to GRACEFUL_DECOMISSION, since it doesn't have a "real" timeout. Already replied above that we support timeout in CLI. DECOMISSION_WITH_TIMEOUT sounds more clear comparing with old DECOMMISSION event. Thoughts? Why this is need? .addTransition(NodeState.DECOMMISSIONING, NodeState.DECOMMISSIONING, RMNodeEventType.DECOMMISSION_WITH_TIMEOUT, new DecommissioningNodeTransition(NodeState.DECOMMISSIONING)) If not adding this transition, an InvalidStateTransitionException will get thrown in our state machine which sounds not right for a normal operation. Should we simply ignore the DECOMMISSION_WITH_TIMEOUT event? No. RM should aware this event so later do some precisely update on available resource, etc. ( YARN-3223 ). Is there specific considerations that transfer UNHEALTHY to DECOMISSIONED when DECOMMISSION_WITH_TIMEOUT received? Is it better to transfer it to DECOMISSIONING since it has some containers running on it? I don't have a strong preference in this case. However, my previous consideration is UNHEALTHY event comes from machine monitor which indicate the node is not quite suitable for containers keep running while DECOMMISSION_WITH_TIMEOUT comes from user who is prefer to decommission a batch of nodes without affecting app/container running if there are currently running normally . So I think make it get decommissioned sounds a simpler way before we have more operation experience with this new feature. I have similiar view on discussion above on UNHEALTHY event to a decommissioning event ( https://issues.apache.org/jira/browse/YARN-3212?focusedCommentId=14693360&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14693360 ). May be we can retrospect on this later? One suggestion of how to handle node update to scheduler: I think you can add a field "isDecomissioning" to NodeUpdateSchedulerEvent, and scheduler can do all updates except allocate container. Thanks for good suggestion here. YARN-3223 will handle the balance of NM's total resource and used resource (so available resource is always 0). So this could be an option that we can use this way (new scheduler event) to keep NM resource balanced. There are also other options too so we can move the discussion to that JIRA I think.
          Hide
          leftnoteasy Wangda Tan added a comment -

          Hi Junping Du,

          Thanks for working on this JIRA, just took a look at it:

          1) ResourceTrackerService:
          Question:
          1. Why shutdown a "decommissioning" NM if it is doing heartbeat. Should we allow it continue heartbeat, since RM needs to know about container finished / killed information.

          2) RMNodeImpl:
          Question:
          2. Do we have timeout of graceful decomission? Which will update a node to "DECOMMISSIONED" after the timeout.
          3. If I understand correct, decommissioning is another running state, except:

          • We cannot allocate any new containers to it.

          Comments:

          • If answer to question #2 is no, I suggest to rename RMNodeEventType.DECOMISSION_WITH_TIMEOUT to GRACEFUL_DECOMISSION, since it doesn't have a "real" timeout.
          • Why this is need?
                  .addTransition(NodeState.DECOMMISSIONING, NodeState.DECOMMISSIONING,
                      RMNodeEventType.DECOMMISSION_WITH_TIMEOUT,
                      new DecommissioningNodeTransition(NodeState.DECOMMISSIONING))
            

            Should we simply ignore the DECOMMISSION_WITH_TIMEOUT event?

          • Is there specific considerations that transfer UNHEALTHY to DECOMISSIONED when DECOMMISSION_WITH_TIMEOUT received? Is it better to transfer it to DECOMISSIONING since it has some containers running on it?
          • One suggestion of how to handle node update to scheduler: I think you can add a field "isDecomissioning" to NodeUpdateSchedulerEvent, and scheduler can do all updates except allocate container.
          Show
          leftnoteasy Wangda Tan added a comment - Hi Junping Du , Thanks for working on this JIRA, just took a look at it: 1) ResourceTrackerService: Question: 1. Why shutdown a "decommissioning" NM if it is doing heartbeat. Should we allow it continue heartbeat, since RM needs to know about container finished / killed information. 2) RMNodeImpl: Question: 2. Do we have timeout of graceful decomission? Which will update a node to "DECOMMISSIONED" after the timeout. 3. If I understand correct, decommissioning is another running state, except: We cannot allocate any new containers to it. Comments: If answer to question #2 is no, I suggest to rename RMNodeEventType.DECOMISSION_WITH_TIMEOUT to GRACEFUL_DECOMISSION, since it doesn't have a "real" timeout. Why this is need? .addTransition(NodeState.DECOMMISSIONING, NodeState.DECOMMISSIONING, RMNodeEventType.DECOMMISSION_WITH_TIMEOUT, new DecommissioningNodeTransition(NodeState.DECOMMISSIONING)) Should we simply ignore the DECOMMISSION_WITH_TIMEOUT event? Is there specific considerations that transfer UNHEALTHY to DECOMISSIONED when DECOMMISSION_WITH_TIMEOUT received? Is it better to transfer it to DECOMISSIONING since it has some containers running on it? One suggestion of how to handle node update to scheduler: I think you can add a field "isDecomissioning" to NodeUpdateSchedulerEvent, and scheduler can do all updates except allocate container.
          Hide
          djp Junping Du added a comment -

          Thanks for review and comments, Sunil. I will remove the debug message in next patch.

          Show
          djp Junping Du added a comment - Thanks for review and comments, Sunil. I will remove the debug message in next patch.
          Hide
          sunilg Sunil G added a comment -

          Thanks Junping Du for the detailed thoughts. I agree with your intention having a simpler and cleaner implementation. As you mentioned, as we study and learn with more use case scenarios, we can consider this case and discuss.

          Other than that, overall patch looks goods to me. May be you can remove few info logs added for debugging purpose like below if not needed.

          LOG.info("XX4: Deactivating Node.");
          

          I will also try seeing the test cases. Thank you.

          Show
          sunilg Sunil G added a comment - Thanks Junping Du for the detailed thoughts. I agree with your intention having a simpler and cleaner implementation. As you mentioned, as we study and learn with more use case scenarios, we can consider this case and discuss. Other than that, overall patch looks goods to me. May be you can remove few info logs added for debugging purpose like below if not needed. LOG.info( "XX4: Deactivating Node." ); I will also try seeing the test cases. Thank you.
          Hide
          djp Junping Du added a comment -

          Thanks Sunil G for the comments! I agree this is not a bad idea for node in decommissioning to give more chances for nodes just in UNHEALTHY. However, it will involve more complexities, like: how much rounds we should wait (heartbeat number or timing, a separated configuration?), an additional state for the node that is in decommissioning and unhealthy, etc. We should evaluate if it worth it before we have hands-on experience on this new feature. In practically, I saw rare cases that nodes can back to healthy state quite soon (unless get fixed immediately with people log in) - that's saying within the timeout.
          Thus, I would prefer to keep the current transition which sounds slightly aggressively but a good trade-off with simplicity at this moment. I can put a TODO in later patch (if other outstanding issues according to the comments) to think more on this when we back with more experiences. Make sense?

          Show
          djp Junping Du added a comment - Thanks Sunil G for the comments! I agree this is not a bad idea for node in decommissioning to give more chances for nodes just in UNHEALTHY. However, it will involve more complexities, like: how much rounds we should wait (heartbeat number or timing, a separated configuration?), an additional state for the node that is in decommissioning and unhealthy, etc. We should evaluate if it worth it before we have hands-on experience on this new feature. In practically, I saw rare cases that nodes can back to healthy state quite soon (unless get fixed immediately with people log in) - that's saying within the timeout. Thus, I would prefer to keep the current transition which sounds slightly aggressively but a good trade-off with simplicity at this moment. I can put a TODO in later patch (if other outstanding issues according to the comments) to think more on this when we back with more experiences. Make sense?
          Hide
          sunilg Sunil G added a comment -

          Hi Junping Du
          I have one doubt in this. For StatusUpdateWhenHealthyTransition, if state of node is DECOMMISSIONING at init state, now we move to DECOMMISIONED directly.
          Cud we give a chance to move it to UNHEALTHY here , so later after some rounds we can mark as DECOMMISIONED if it cannot be revived. Your thoughts?

          Show
          sunilg Sunil G added a comment - Hi Junping Du I have one doubt in this. For StatusUpdateWhenHealthyTransition , if state of node is DECOMMISSIONING at init state, now we move to DECOMMISIONED directly. Cud we give a chance to move it to UNHEALTHY here , so later after some rounds we can mark as DECOMMISIONED if it cannot be revived. Your thoughts?
          Hide
          djp Junping Du added a comment -

          Can someone give it a review? With this patch get in, the basic flow for gracefully decommission can work now. Thanks!

          Show
          djp Junping Du added a comment - Can someone give it a review? With this patch get in, the basic flow for gracefully decommission can work now. Thanks!
          Hide
          djp Junping Du added a comment -

          The test failures are not related and get tracked in YARN-4035. v5.1 patch is ready for review.

          Show
          djp Junping Du added a comment - The test failures are not related and get tracked in YARN-4035 . v5.1 patch is ready for review.
          Hide
          hadoopqa Hadoop QA added a comment -



          -1 overall



          Vote Subsystem Runtime Comment
          0 pre-patch 16m 8s Pre-patch trunk compilation is healthy.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 tests included 0m 0s The patch appears to include 2 new or modified test files.
          +1 javac 7m 41s There were no new javac warning messages.
          +1 javadoc 9m 37s There were no new javadoc warning messages.
          +1 release audit 0m 23s The applied patch does not increase the total number of release audit warnings.
          +1 checkstyle 0m 49s There were no new checkstyle issues.
          +1 whitespace 0m 7s The patch has no lines that end in whitespace.
          +1 install 1m 20s mvn install still works.
          +1 eclipse:eclipse 0m 32s The patch built with eclipse:eclipse.
          +1 findbugs 1m 28s The patch does not introduce any new Findbugs (version 3.0.0) warnings.
          -1 yarn tests 52m 43s Tests failed in hadoop-yarn-server-resourcemanager.
              90m 51s  



          Reason Tests
          Failed unit tests hadoop.yarn.server.resourcemanager.scheduler.capacity.TestContainerAllocation
            hadoop.yarn.server.resourcemanager.TestRMAdminService



          Subsystem Report/Notes
          Patch URL http://issues.apache.org/jira/secure/attachment/12749618/YARN-3212-v5.1.patch
          Optional Tests javadoc javac unit findbugs checkstyle
          git revision trunk / 8f73bdd
          hadoop-yarn-server-resourcemanager test log https://builds.apache.org/job/PreCommit-YARN-Build/8813/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/8813/testReport/
          Java 1.7.0_55
          uname Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/8813/console

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 pre-patch 16m 8s Pre-patch trunk compilation is healthy. +1 @author 0m 0s The patch does not contain any @author tags. +1 tests included 0m 0s The patch appears to include 2 new or modified test files. +1 javac 7m 41s There were no new javac warning messages. +1 javadoc 9m 37s There were no new javadoc warning messages. +1 release audit 0m 23s The applied patch does not increase the total number of release audit warnings. +1 checkstyle 0m 49s There were no new checkstyle issues. +1 whitespace 0m 7s The patch has no lines that end in whitespace. +1 install 1m 20s mvn install still works. +1 eclipse:eclipse 0m 32s The patch built with eclipse:eclipse. +1 findbugs 1m 28s The patch does not introduce any new Findbugs (version 3.0.0) warnings. -1 yarn tests 52m 43s Tests failed in hadoop-yarn-server-resourcemanager.     90m 51s   Reason Tests Failed unit tests hadoop.yarn.server.resourcemanager.scheduler.capacity.TestContainerAllocation   hadoop.yarn.server.resourcemanager.TestRMAdminService Subsystem Report/Notes Patch URL http://issues.apache.org/jira/secure/attachment/12749618/YARN-3212-v5.1.patch Optional Tests javadoc javac unit findbugs checkstyle git revision trunk / 8f73bdd hadoop-yarn-server-resourcemanager test log https://builds.apache.org/job/PreCommit-YARN-Build/8813/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/8813/testReport/ Java 1.7.0_55 uname Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Console output https://builds.apache.org/job/PreCommit-YARN-Build/8813/console This message was automatically generated.
          Hide
          djp Junping Du added a comment -

          Sorry. A typo. I meant TestRMAdminService.

          Thanks for pointing this out, Sunil G! Looking at that one now.

          Show
          djp Junping Du added a comment - Sorry. A typo. I meant TestRMAdminService. Thanks for pointing this out, Sunil G ! Looking at that one now.
          Hide
          sunilg Sunil G added a comment -

          Sorry. A typo. I meant TestRMAdminService.

          Show
          sunilg Sunil G added a comment - Sorry. A typo. I meant TestRMAdminService.
          Hide
          sunilg Sunil G added a comment -

          Hi Junping Du
          Looks like test case failures are due to TestRMContainerAllocation. A ticket is raised for this already YARN-4035.

          Show
          sunilg Sunil G added a comment - Hi Junping Du Looks like test case failures are due to TestRMContainerAllocation. A ticket is raised for this already YARN-4035 .
          Hide
          djp Junping Du added a comment -

          Fix the checkstyle/whitespace issues in 5.1 patch. Checked that the test failures are not related but due to YARN-4019 recently get in. Will file a separated JIRA to fix it.

          Show
          djp Junping Du added a comment - Fix the checkstyle/whitespace issues in 5.1 patch. Checked that the test failures are not related but due to YARN-4019 recently get in. Will file a separated JIRA to fix it.
          Hide
          hadoopqa Hadoop QA added a comment -



          -1 overall



          Vote Subsystem Runtime Comment
          0 pre-patch 16m 22s Pre-patch trunk compilation is healthy.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 tests included 0m 0s The patch appears to include 2 new or modified test files.
          +1 javac 7m 49s There were no new javac warning messages.
          +1 javadoc 9m 43s There were no new javadoc warning messages.
          +1 release audit 0m 21s The applied patch does not increase the total number of release audit warnings.
          -1 checkstyle 0m 48s The applied patch generated 1 new checkstyle issues (total was 107, now 68).
          -1 whitespace 0m 4s The patch has 11 line(s) that end in whitespace. Use git apply --whitespace=fix.
          +1 install 1m 19s mvn install still works.
          +1 eclipse:eclipse 0m 31s The patch built with eclipse:eclipse.
          +1 findbugs 1m 26s The patch does not introduce any new Findbugs (version 3.0.0) warnings.
          -1 yarn tests 52m 22s Tests failed in hadoop-yarn-server-resourcemanager.
              90m 50s  



          Reason Tests
          Failed unit tests hadoop.yarn.server.resourcemanager.scheduler.capacity.TestContainerAllocation
            hadoop.yarn.server.resourcemanager.TestRMAdminService



          Subsystem Report/Notes
          Patch URL http://issues.apache.org/jira/secure/attachment/12749596/YARN-3212-v5.patch
          Optional Tests javadoc javac unit findbugs checkstyle
          git revision trunk / 8f73bdd
          checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/8811/artifact/patchprocess/diffcheckstylehadoop-yarn-server-resourcemanager.txt
          whitespace https://builds.apache.org/job/PreCommit-YARN-Build/8811/artifact/patchprocess/whitespace.txt
          hadoop-yarn-server-resourcemanager test log https://builds.apache.org/job/PreCommit-YARN-Build/8811/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/8811/testReport/
          Java 1.7.0_55
          uname Linux asf907.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/8811/console

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 pre-patch 16m 22s Pre-patch trunk compilation is healthy. +1 @author 0m 0s The patch does not contain any @author tags. +1 tests included 0m 0s The patch appears to include 2 new or modified test files. +1 javac 7m 49s There were no new javac warning messages. +1 javadoc 9m 43s There were no new javadoc warning messages. +1 release audit 0m 21s The applied patch does not increase the total number of release audit warnings. -1 checkstyle 0m 48s The applied patch generated 1 new checkstyle issues (total was 107, now 68). -1 whitespace 0m 4s The patch has 11 line(s) that end in whitespace. Use git apply --whitespace=fix. +1 install 1m 19s mvn install still works. +1 eclipse:eclipse 0m 31s The patch built with eclipse:eclipse. +1 findbugs 1m 26s The patch does not introduce any new Findbugs (version 3.0.0) warnings. -1 yarn tests 52m 22s Tests failed in hadoop-yarn-server-resourcemanager.     90m 50s   Reason Tests Failed unit tests hadoop.yarn.server.resourcemanager.scheduler.capacity.TestContainerAllocation   hadoop.yarn.server.resourcemanager.TestRMAdminService Subsystem Report/Notes Patch URL http://issues.apache.org/jira/secure/attachment/12749596/YARN-3212-v5.patch Optional Tests javadoc javac unit findbugs checkstyle git revision trunk / 8f73bdd checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/8811/artifact/patchprocess/diffcheckstylehadoop-yarn-server-resourcemanager.txt whitespace https://builds.apache.org/job/PreCommit-YARN-Build/8811/artifact/patchprocess/whitespace.txt hadoop-yarn-server-resourcemanager test log https://builds.apache.org/job/PreCommit-YARN-Build/8811/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/8811/testReport/ Java 1.7.0_55 uname Linux asf907.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Console output https://builds.apache.org/job/PreCommit-YARN-Build/8811/console This message was automatically generated.
          Hide
          djp Junping Du added a comment -

          Update the patch to v5 with fixing following issues:

          • The unit test failure
          • NM-RM heartbeat should not only check valid of node (if in decommission file list) but also check if node in decommissioning stage
          • check style and white space issues
          Show
          djp Junping Du added a comment - Update the patch to v5 with fixing following issues: The unit test failure NM-RM heartbeat should not only check valid of node (if in decommission file list) but also check if node in decommissioning stage check style and white space issues
          Hide
          hadoopqa Hadoop QA added a comment -



          -1 overall



          Vote Subsystem Runtime Comment
          0 pre-patch 16m 13s Pre-patch trunk compilation is healthy.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 tests included 0m 0s The patch appears to include 2 new or modified test files.
          +1 javac 7m 49s There were no new javac warning messages.
          +1 javadoc 9m 43s There were no new javadoc warning messages.
          +1 release audit 0m 22s The applied patch does not increase the total number of release audit warnings.
          -1 checkstyle 0m 49s The applied patch generated 22 new checkstyle issues (total was 87, now 109).
          -1 whitespace 0m 2s The patch has 35 line(s) that end in whitespace. Use git apply --whitespace=fix.
          +1 install 1m 22s mvn install still works.
          +1 eclipse:eclipse 0m 33s The patch built with eclipse:eclipse.
          +1 findbugs 1m 26s The patch does not introduce any new Findbugs (version 3.0.0) warnings.
          -1 yarn tests 51m 46s Tests failed in hadoop-yarn-server-resourcemanager.
              90m 8s  



          Reason Tests
          Failed unit tests hadoop.yarn.server.resourcemanager.TestRMNodeTransitions



          Subsystem Report/Notes
          Patch URL http://issues.apache.org/jira/secure/attachment/12746591/YARN-3212-v4.1.patch
          Optional Tests javadoc javac unit findbugs checkstyle
          git revision trunk / 4025326
          checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/8618/artifact/patchprocess/diffcheckstylehadoop-yarn-server-resourcemanager.txt
          whitespace https://builds.apache.org/job/PreCommit-YARN-Build/8618/artifact/patchprocess/whitespace.txt
          hadoop-yarn-server-resourcemanager test log https://builds.apache.org/job/PreCommit-YARN-Build/8618/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/8618/testReport/
          Java 1.7.0_55
          uname Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/8618/console

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 pre-patch 16m 13s Pre-patch trunk compilation is healthy. +1 @author 0m 0s The patch does not contain any @author tags. +1 tests included 0m 0s The patch appears to include 2 new or modified test files. +1 javac 7m 49s There were no new javac warning messages. +1 javadoc 9m 43s There were no new javadoc warning messages. +1 release audit 0m 22s The applied patch does not increase the total number of release audit warnings. -1 checkstyle 0m 49s The applied patch generated 22 new checkstyle issues (total was 87, now 109). -1 whitespace 0m 2s The patch has 35 line(s) that end in whitespace. Use git apply --whitespace=fix. +1 install 1m 22s mvn install still works. +1 eclipse:eclipse 0m 33s The patch built with eclipse:eclipse. +1 findbugs 1m 26s The patch does not introduce any new Findbugs (version 3.0.0) warnings. -1 yarn tests 51m 46s Tests failed in hadoop-yarn-server-resourcemanager.     90m 8s   Reason Tests Failed unit tests hadoop.yarn.server.resourcemanager.TestRMNodeTransitions Subsystem Report/Notes Patch URL http://issues.apache.org/jira/secure/attachment/12746591/YARN-3212-v4.1.patch Optional Tests javadoc javac unit findbugs checkstyle git revision trunk / 4025326 checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/8618/artifact/patchprocess/diffcheckstylehadoop-yarn-server-resourcemanager.txt whitespace https://builds.apache.org/job/PreCommit-YARN-Build/8618/artifact/patchprocess/whitespace.txt hadoop-yarn-server-resourcemanager test log https://builds.apache.org/job/PreCommit-YARN-Build/8618/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/8618/testReport/ Java 1.7.0_55 uname Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Console output https://builds.apache.org/job/PreCommit-YARN-Build/8618/console This message was automatically generated.
          Hide
          djp Junping Du added a comment -

          Devaraj K, Sure. Update patch to latest trunk.

          Show
          djp Junping Du added a comment - Devaraj K , Sure. Update patch to latest trunk.
          Hide
          devaraj.k Devaraj K added a comment -

          Junping Du, can you update the patch for this Jira as YARN-3445 got committed, so that we can see this feature working.

          Show
          devaraj.k Devaraj K added a comment - Junping Du , can you update the patch for this Jira as YARN-3445 got committed, so that we can see this feature working.
          Hide
          djp Junping Du added a comment -

          That make sense. Add a TODO in v4 patch so far to address this scenario.

          Show
          djp Junping Du added a comment - That make sense. Add a TODO in v4 patch so far to address this scenario.
          Hide
          djp Junping Du added a comment -

          Update the patch to rebase on current trunk (with YARN-3225 get in with duplicated code). It based on YARN-3445, will mark patch as available when that JIRA get in.

          Show
          djp Junping Du added a comment - Update the patch to rebase on current trunk (with YARN-3225 get in with duplicated code). It based on YARN-3445 , will mark patch as available when that JIRA get in.
          Hide
          jlowe Jason Lowe added a comment -

          releasing an unlaunched container is pretty cheap which could be better than wait the container to executed from beginning

          One could argue that the whole point of the graceful decommission is to avoid container failures, and this would be a container failure from the perspective of the AM. In that sense we should honor the container if we already handed it out to the AM (i.e.: RMContainerImpl instance is in the ACQUIRED state). We should be able to turn off scheduling for the node and then after doing so query the scheduler to see what containers are still active on that node.

          Show
          jlowe Jason Lowe added a comment - releasing an unlaunched container is pretty cheap which could be better than wait the container to executed from beginning One could argue that the whole point of the graceful decommission is to avoid container failures, and this would be a container failure from the perspective of the AM. In that sense we should honor the container if we already handed it out to the AM (i.e.: RMContainerImpl instance is in the ACQUIRED state). We should be able to turn off scheduling for the node and then after doing so query the scheduler to see what containers are still active on that node.
          Hide
          djp Junping Du added a comment -

          we also need to verify the scheduler hasn't allocated or handed out a container for that node that hasn't reached the node yet other than only check application status.

          Just think of this problem again. The other option is we can still go ahead to mark this node as decommissioned, but make AM/RM sync on the same page.
          It depends on how we understand the word - "graceful" here: if it means less expensive/cost in decommissioning nodes, then this case should fall into this category as releasing an unlaunched container is pretty cheap which could be better than wait the container to executed from beginning; if we think it means clean scheduling flow and log messages (at least within timeout), we may should wait container get launching.
          Thoughts?

          Show
          djp Junping Du added a comment - we also need to verify the scheduler hasn't allocated or handed out a container for that node that hasn't reached the node yet other than only check application status. Just think of this problem again. The other option is we can still go ahead to mark this node as decommissioned, but make AM/RM sync on the same page. It depends on how we understand the word - "graceful" here: if it means less expensive/cost in decommissioning nodes, then this case should fall into this category as releasing an unlaunched container is pretty cheap which could be better than wait the container to executed from beginning; if we think it means clean scheduling flow and log messages (at least within timeout), we may should wait container get launching. Thoughts?
          Hide
          djp Junping Du added a comment -

          In addition, per Jason Lowe's comments in YARN-3535 (https://issues.apache.org/jira/browse/YARN-3535?focusedCommentId=14509153&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14509153), we also need to verify the scheduler hasn't allocated or handed out a container for that node that hasn't reached the node yet other than only check application status.

          Show
          djp Junping Du added a comment - In addition, per Jason Lowe 's comments in YARN-3535 ( https://issues.apache.org/jira/browse/YARN-3535?focusedCommentId=14509153&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14509153 ), we also need to verify the scheduler hasn't allocated or handed out a container for that node that hasn't reached the node yet other than only check application status.
          Hide
          rohithsharma Rohith Sharma K S added a comment -

          This should be changed after this patch

          I see.. I missed this state transition while looking patch!!

          Show
          rohithsharma Rohith Sharma K S added a comment - This should be changed after this patch I see.. I missed this state transition while looking patch!!
          Hide
          djp Junping Du added a comment -

          Thanks Rohith Sharma K S for comments!

          I gone through the design doc and the approach looks good to me. Let you know if any clarrification required.

          Sounds good. Thx!

          Because, Reconnected event can trigger only when node state is RUNNING|UNHEALTHY.

          This should be changed after this patch. Because node (NM daemon) could be shutdown and restart in decommissioning stage, and reconnect to RM will go to this state transition.

          BTW, I would suggest to hold on review this patch now as it depends on YARN-3445 (NM heartbeat RM with running apps). Also, YARN-3225 is almost ready to go in, so a rebase could be needed.

          Show
          djp Junping Du added a comment - Thanks Rohith Sharma K S for comments! I gone through the design doc and the approach looks good to me. Let you know if any clarrification required. Sounds good. Thx! Because, Reconnected event can trigger only when node state is RUNNING|UNHEALTHY. This should be changed after this patch. Because node (NM daemon) could be shutdown and restart in decommissioning stage, and reconnect to RM will go to this state transition. BTW, I would suggest to hold on review this patch now as it depends on YARN-3445 (NM heartbeat RM with running apps). Also, YARN-3225 is almost ready to go in, so a rebase could be needed.
          Hide
          rohithsharma Rohith Sharma K S added a comment -

          Hi Junping Du, Thanks for working on this improvement..
          I gone through the design doc and the approach looks good to me. Let you know if any clarrification required.

          Apologies for delayed review.. One comment on the patch

          1. In the ReconnectNodeTransition,It is not necessarily to check for DECOMMISSIONING state and other stuffs. Because, Reconnected event can trigger only when node state is RUNNING|UNHEALTHY
            if (rmNode.getState() == NodeState.DECOMMISSIONING) {
            +          // When node in decommissioning, and no running Apps on this node,
            +          // it will return as decommissioned state.
            +          deactivateNode(rmNode, NodeState.DECOMMISSIONED);
            +          return NodeState.DECOMMISSIONED;
            +        }
            
          Show
          rohithsharma Rohith Sharma K S added a comment - Hi Junping Du , Thanks for working on this improvement.. I gone through the design doc and the approach looks good to me. Let you know if any clarrification required. Apologies for delayed review.. One comment on the patch In the ReconnectNodeTransition,It is not necessarily to check for DECOMMISSIONING state and other stuffs. Because, Reconnected event can trigger only when node state is RUNNING|UNHEALTHY if (rmNode.getState() == NodeState.DECOMMISSIONING) { + // When node in decommissioning, and no running Apps on this node, + // it will return as decommissioned state. + deactivateNode(rmNode, NodeState.DECOMMISSIONED); + return NodeState.DECOMMISSIONED; + }
          Hide
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12706943/YARN-3212-v3.patch
          against trunk revision af618f2.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 2 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

          org.apache.hadoop.yarn.server.resourcemanager.TestApplicationCleanup
          org.apache.hadoop.yarn.server.resourcemanager.TestRM
          org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesApps
          org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesNodes
          org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService
          org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification
          org.apache.hadoop.yarn.server.resourcemanager.TestMoveApplication
          org.apache.hadoop.yarn.server.resourcemanager.TestApplicationMasterLauncher
          org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestContainerAllocation
          org.apache.hadoop.yarn.server.resourcemanager.TestKillApplicationWithRMHA
          org.apache.hadoop.yarn.server.resourcemanager.TestApplicationMasterService
          org.apache.hadoop.yarn.server.resourcemanager.recovery.TestZKRMStateStore
          org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebappAuthentication
          org.apache.hadoop.yarn.server.resourcemanager.TestRMAdminService
          org.apache.hadoop.yarn.server.resourcemanager.TestRMHA

          Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7122//testReport/
          Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7122//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12706943/YARN-3212-v3.patch against trunk revision af618f2. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 2 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.TestApplicationCleanup org.apache.hadoop.yarn.server.resourcemanager.TestRM org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesApps org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesNodes org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification org.apache.hadoop.yarn.server.resourcemanager.TestMoveApplication org.apache.hadoop.yarn.server.resourcemanager.TestApplicationMasterLauncher org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestContainerAllocation org.apache.hadoop.yarn.server.resourcemanager.TestKillApplicationWithRMHA org.apache.hadoop.yarn.server.resourcemanager.TestApplicationMasterService org.apache.hadoop.yarn.server.resourcemanager.recovery.TestZKRMStateStore org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebappAuthentication org.apache.hadoop.yarn.server.resourcemanager.TestRMAdminService org.apache.hadoop.yarn.server.resourcemanager.TestRMHA Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7122//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7122//console This message is automatically generated.
          Hide
          djp Junping Du added a comment -

          The test failure is not related to the patch here and only happen intermediately. The same failure happen in YARN-3258 and YARN-3204 (from search history but log is not available now, manually kick off jenkins test already).
          Will file a separated JIRA to track the test failure.

          Show
          djp Junping Du added a comment - The test failure is not related to the patch here and only happen intermediately. The same failure happen in YARN-3258 and YARN-3204 (from search history but log is not available now, manually kick off jenkins test already). Will file a separated JIRA to track the test failure.
          Hide
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12706943/YARN-3212-v3.patch
          against trunk revision 51f1f49.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 2 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

          org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart

          Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7092//testReport/
          Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7092//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12706943/YARN-3212-v3.patch against trunk revision 51f1f49. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 2 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7092//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7092//console This message is automatically generated.
          Hide
          djp Junping Du added a comment -

          Update patch to address review comments above, include:

          • Properly handling in case of node (in decommissioning) reconnection with a different port.
          • Some refactor work, include: merge StatusUpdateWhenHealthyTransition and StatusUpdateWhenDecommissioningTransition together.
          Show
          djp Junping Du added a comment - Update patch to address review comments above, include: Properly handling in case of node (in decommissioning) reconnection with a different port. Some refactor work, include: merge StatusUpdateWhenHealthyTransition and StatusUpdateWhenDecommissioningTransition together.
          Hide
          djp Junping Du added a comment -

          Thanks Jason Lowe and Ming Ma for review and comments!

          Do we want to handle the DECOMMISSIONING_WITH_TIMEOUT event when the node is already in the DECOMMISSIONING state? Curious if we might get a duplicate decommission event somehow and want to ignore it or if we know for sure this cannot happen in practice.

          This case is possibly happen when user submit another decommssion CLI while the node still in decommissioning. I think it just ignore it now as nothing need to update if node already in decommissioning. We will not have timeout tracking and update in RM side (may only pass to AM for notification) according to discussions in YARN-914 and YARN-3225.

          Do we want to consider DECOMMISSIONING nodes as not active? There are containers actively running on them, and in that sense they are participating in the cluster (and contributing to the overall cluster resource). I think they should still be considered active, but I could be persuaded otherwise.

          I think we discussed this on YARN-914 before. The conclusion so far is keeping node in decommissioning as active (or may broken some services - I am not 100% sure on this) and make node resource equals to resource of assigned containers at anytime. Do we need to change this conclusion?

          In the reconnected node transition there is a switch statement that will debug-log an unexpected state message when in fact the DECOMMISSIONING state is expected for this transition.

          That's a good point. Will fix it in v3 patch. Thanks!

          Curious why the null check is needed in handleNMContainerStatuses? What about this change allows the container statuses to be null?

          I think so. Look like the RMNodeReconnectEvent comes from RegisterNodeManagerRequest and containerStatuses field (getting from proto) could be nullable. So there is an NPE bug here and I found through unit test where we created event like "new RMNodeReconnectEvent(node.getNodeID(), node, null, null)" even before this patch. Am I missing something here?

          It would be nice to see some refactoring of the common code between StatusUpdateWhenHealthyTransition, StatusUpdateWhenUnhealthyTransition, and StatusUpdateWhenDecommissioningTransition.

          Yes. I should do earlier. Will do it in v3 patch.

          These change seems unnecessary?

          These are still necessary because we changed state transition from one final state to multiple final states (like below example) and interface only accept EnumSet.

             public static class ReconnectNodeTransition implements
          -      SingleArcTransition<RMNodeImpl, RMNodeEvent> {
          +      MultipleArcTransition<RMNodeImpl, RMNodeEvent, NodeState> {
          

          Do we need to support the scenario where NM becomes dead when it is being decommissioned? Say decommission timeout is 30 minutes larger than the NM liveness timeout. The node drops out of the cluster for some time and rejoin later all within the decommission time out. Will Yarn show the status as just dead node, or

          {dead, decommissioning}

          Now, the node can be LOST (dead) when it is in decommissioning. It is not different with running node get lost but cannot join back except user put it back through recommission. Make sense?

          Seems useful for admins to know about it. If we need that, we can consider two types of NodeState. One is liveness state, one is admin state. Then you will have different combinations.

          We can add necessary log to let admin know about it. Are you talking about scenario like this: admin put some node in decommissioning with a timeout, some upgrade script doing OS upgrade and finish with a restart in a random time which could be shorter than decommissioning time. Admin want these nodes can join back automatically. But how YARN know about Admin want these nodes back or not after a restart? May be a explicitly set back to white list (recommission) still necessary.

          Show
          djp Junping Du added a comment - Thanks Jason Lowe and Ming Ma for review and comments! Do we want to handle the DECOMMISSIONING_WITH_TIMEOUT event when the node is already in the DECOMMISSIONING state? Curious if we might get a duplicate decommission event somehow and want to ignore it or if we know for sure this cannot happen in practice. This case is possibly happen when user submit another decommssion CLI while the node still in decommissioning. I think it just ignore it now as nothing need to update if node already in decommissioning. We will not have timeout tracking and update in RM side (may only pass to AM for notification) according to discussions in YARN-914 and YARN-3225 . Do we want to consider DECOMMISSIONING nodes as not active? There are containers actively running on them, and in that sense they are participating in the cluster (and contributing to the overall cluster resource). I think they should still be considered active, but I could be persuaded otherwise. I think we discussed this on YARN-914 before. The conclusion so far is keeping node in decommissioning as active (or may broken some services - I am not 100% sure on this) and make node resource equals to resource of assigned containers at anytime. Do we need to change this conclusion? In the reconnected node transition there is a switch statement that will debug-log an unexpected state message when in fact the DECOMMISSIONING state is expected for this transition. That's a good point. Will fix it in v3 patch. Thanks! Curious why the null check is needed in handleNMContainerStatuses? What about this change allows the container statuses to be null? I think so. Look like the RMNodeReconnectEvent comes from RegisterNodeManagerRequest and containerStatuses field (getting from proto) could be nullable. So there is an NPE bug here and I found through unit test where we created event like "new RMNodeReconnectEvent(node.getNodeID(), node, null, null)" even before this patch. Am I missing something here? It would be nice to see some refactoring of the common code between StatusUpdateWhenHealthyTransition, StatusUpdateWhenUnhealthyTransition, and StatusUpdateWhenDecommissioningTransition. Yes. I should do earlier. Will do it in v3 patch. These change seems unnecessary? These are still necessary because we changed state transition from one final state to multiple final states (like below example) and interface only accept EnumSet. public static class ReconnectNodeTransition implements - SingleArcTransition<RMNodeImpl, RMNodeEvent> { + MultipleArcTransition<RMNodeImpl, RMNodeEvent, NodeState> { Do we need to support the scenario where NM becomes dead when it is being decommissioned? Say decommission timeout is 30 minutes larger than the NM liveness timeout. The node drops out of the cluster for some time and rejoin later all within the decommission time out. Will Yarn show the status as just dead node, or {dead, decommissioning} Now, the node can be LOST (dead) when it is in decommissioning. It is not different with running node get lost but cannot join back except user put it back through recommission. Make sense? Seems useful for admins to know about it. If we need that, we can consider two types of NodeState. One is liveness state, one is admin state. Then you will have different combinations. We can add necessary log to let admin know about it. Are you talking about scenario like this: admin put some node in decommissioning with a timeout, some upgrade script doing OS upgrade and finish with a restart in a random time which could be shorter than decommissioning time. Admin want these nodes can join back automatically. But how YARN know about Admin want these nodes back or not after a restart? May be a explicitly set back to white list (recommission) still necessary.
          Hide
          mingma Ming Ma added a comment -

          Do we want to consider DECOMMISSIONING nodes as not active? There are containers actively running on them, and in that sense they are participating in the cluster (and contributing to the overall cluster resource). I think they should still be considered active, but I could be persuaded otherwise.

          Do we need to support the scenario where NM becomes dead when it is being decommissioned? Say decommission timeout is 30 minutes larger than the NM liveness timeout. The node drops out of the cluster for some time and rejoin later all within the decommission time out. Will Yarn show the status as just dead node, or

          {dead, decommissioning}

          ? Seems useful for admins to know about it. If we need that, we can consider two types of NodeState. One is liveness state, one is admin state. Then you will have different combinations.

          Show
          mingma Ming Ma added a comment - Do we want to consider DECOMMISSIONING nodes as not active? There are containers actively running on them, and in that sense they are participating in the cluster (and contributing to the overall cluster resource). I think they should still be considered active, but I could be persuaded otherwise. Do we need to support the scenario where NM becomes dead when it is being decommissioned? Say decommission timeout is 30 minutes larger than the NM liveness timeout. The node drops out of the cluster for some time and rejoin later all within the decommission time out. Will Yarn show the status as just dead node, or {dead, decommissioning} ? Seems useful for admins to know about it. If we need that, we can consider two types of NodeState. One is liveness state, one is admin state. Then you will have different combinations.
          Hide
          jlowe Jason Lowe added a comment -

          Thanks for the patch, Junping. Some comments on the patch from a quick overview of it:

          Do we want to handle the DECOMMISSIONING_WITH_TIMEOUT event when the node is already in the DECOMMISSIONING state? Curious if we might get a duplicate decommission event somehow and want to ignore it or if we know for sure this cannot happen in practice.

          Do we want to consider DECOMMISSIONING nodes as not active? There are containers actively running on them, and in that sense they are participating in the cluster (and contributing to the overall cluster resource). I think they should still be considered active, but I could be persuaded otherwise.

          In the reconnected node transition there is a switch statement that will debug-log an unexpected state message when in fact the DECOMMISSIONING state is expected for this transition.

          Curious why the null check is needed in handleNMContainerStatuses? What about this change allows the container statuses to be null?

          It would be nice to see some refactoring of the common code between StatusUpdateWhenHealthyTransition, StatusUpdateWhenUnhealthyTransition, and StatusUpdateWhenDecommissioningTransition.

          These change seems unnecessary?

          -     .addTransition(NodeState.RUNNING, NodeState.RUNNING,
          +     .addTransition(NodeState.RUNNING, EnumSet.of(NodeState.RUNNING),
          
          [...]
          
          -     .addTransition(NodeState.UNHEALTHY, NodeState.UNHEALTHY,
          +     .addTransition(NodeState.UNHEALTHY, EnumSet.of(NodeState.UNHEALTHY),
          
          [...]
          
          -        LOG.info("Node " + rmNode.nodeId + " reported UNHEALTHY with details: "
          -            + remoteNodeHealthStatus.getHealthReport());
          +        LOG.info("Node " + rmNode.nodeId + " reported UNHEALTHY with details: " +
          +            remoteNodeHealthStatus.getHealthReport());
          
          Show
          jlowe Jason Lowe added a comment - Thanks for the patch, Junping. Some comments on the patch from a quick overview of it: Do we want to handle the DECOMMISSIONING_WITH_TIMEOUT event when the node is already in the DECOMMISSIONING state? Curious if we might get a duplicate decommission event somehow and want to ignore it or if we know for sure this cannot happen in practice. Do we want to consider DECOMMISSIONING nodes as not active? There are containers actively running on them, and in that sense they are participating in the cluster (and contributing to the overall cluster resource). I think they should still be considered active, but I could be persuaded otherwise. In the reconnected node transition there is a switch statement that will debug-log an unexpected state message when in fact the DECOMMISSIONING state is expected for this transition. Curious why the null check is needed in handleNMContainerStatuses? What about this change allows the container statuses to be null? It would be nice to see some refactoring of the common code between StatusUpdateWhenHealthyTransition, StatusUpdateWhenUnhealthyTransition, and StatusUpdateWhenDecommissioningTransition. These change seems unnecessary? - .addTransition(NodeState.RUNNING, NodeState.RUNNING, + .addTransition(NodeState.RUNNING, EnumSet.of(NodeState.RUNNING), [...] - .addTransition(NodeState.UNHEALTHY, NodeState.UNHEALTHY, + .addTransition(NodeState.UNHEALTHY, EnumSet.of(NodeState.UNHEALTHY), [...] - LOG.info("Node " + rmNode.nodeId + " reported UNHEALTHY with details: " - + remoteNodeHealthStatus.getHealthReport()); + LOG.info("Node " + rmNode.nodeId + " reported UNHEALTHY with details: " + + remoteNodeHealthStatus.getHealthReport());
          Hide
          djp Junping Du added a comment -

          The findbugs warning is not related to this patch and tracked in YARN-3204. So v2 patch is ready for review.

          Show
          djp Junping Du added a comment - The findbugs warning is not related to this patch and tracked in YARN-3204 . So v2 patch is ready for review.
          Hide
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12704811/YARN-3212-v2.patch
          against trunk revision d1eebd9.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 2 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          -1 findbugs. The patch appears to introduce 5 new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

          Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6974//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6974//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html
          Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6974//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12704811/YARN-3212-v2.patch against trunk revision d1eebd9. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 2 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. -1 findbugs . The patch appears to introduce 5 new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6974//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6974//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6974//console This message is automatically generated.
          Hide
          djp Junping Du added a comment -

          Fix two test failures in v2 version. The findbugs warnings are not related but belongs to FairScheduler. Will filed separated JIRA to fix them.

          Show
          djp Junping Du added a comment - Fix two test failures in v2 version. The findbugs warnings are not related but belongs to FairScheduler. Will filed separated JIRA to fix them.
          Hide
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12704518/YARN-3212-v1.patch
          against trunk revision 6fdef76.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          -1 findbugs. The patch appears to introduce 5 new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

          org.apache.hadoop.yarn.server.resourcemanager.webapp.TestNodesPage
          org.apache.hadoop.yarn.server.resourcemanager.TestClientRMService

          Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6958//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6958//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html
          Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6958//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12704518/YARN-3212-v1.patch against trunk revision 6fdef76. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. -1 findbugs . The patch appears to introduce 5 new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.webapp.TestNodesPage org.apache.hadoop.yarn.server.resourcemanager.TestClientRMService Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6958//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6958//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6958//console This message is automatically generated.
          Hide
          djp Junping Du added a comment -

          Upload the first patch for core state changes with decommissioning. For RMNodeEventType, I would prefer DECOMMISSION_WITH_DELAY over DECOMMISSION_WITH_TIMEOUT like my comments in YARN-3225. May update later if that comments get adopted.

          Show
          djp Junping Du added a comment - Upload the first patch for core state changes with decommissioning. For RMNodeEventType, I would prefer DECOMMISSION_WITH_DELAY over DECOMMISSION_WITH_TIMEOUT like my comments in YARN-3225 . May update later if that comments get adopted.
          Hide
          djp Junping Du added a comment -

          Attache the new state transition diagram for RMNode.

          Show
          djp Junping Du added a comment - Attache the new state transition diagram for RMNode.

            People

            • Assignee:
              djp Junping Du
              Reporter:
              djp Junping Du
            • Votes:
              0 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development