Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.8.0
    • Fix Version/s: 2.8.0, 3.0.0-alpha2
    • Component/s: nodemanager
    • Labels:
      None
    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      I was testing the client-side NM graceful decommission and noticed that it was always waiting for the timeout, even if all jobs running on that node (or even the cluster) had already finished.

      For example:

      1. JobA is running with at least one container on NodeA
      2. User runs client-side decom on NodeA at 5:00am with a timeout of 3 hours --> NodeA enters DECOMMISSIONING state
      3. JobA finishes at 6:00am and there are no other jobs running on NodeA
      4. User's client reaches the timeout at 8:00am, and forcibly decommissions NodeA

      NodeA should have decommissioned at 6:00am.

      1. YARN-5566.004.branch-2.8.addendum.patch
        3 kB
        Robert Kanter
      2. YARN-5566.004.branch-2.8.patch
        14 kB
        Robert Kanter
      3. YARN-5566.004.patch
        9 kB
        Robert Kanter
      4. YARN-5566.003.patch
        8 kB
        Robert Kanter
      5. YARN-5566.002.patch
        2 kB
        Robert Kanter
      6. YARN-5566.001.patch
        1 kB
        Robert Kanter

        Issue Links

          Activity

          Hide
          rkanter Robert Kanter added a comment -
          Show
          rkanter Robert Kanter added a comment - Thanks Karthik Kambatla
          Hide
          kasha Karthik Kambatla added a comment -

          Thanks for following up on this, Robert.

          +1 on the addendum patch. Checking it in..

          Show
          kasha Karthik Kambatla added a comment - Thanks for following up on this, Robert. +1 on the addendum patch. Checking it in..
          Hide
          rkanter Robert Kanter added a comment -

          Test failures unrelated (UnknownHostException)

          Show
          rkanter Robert Kanter added a comment - Test failures unrelated (UnknownHostException)
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 14s Docker mode activated.
          +1 @author 0m 1s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 2 new or modified test files.
          +1 mvninstall 6m 49s branch-2.8 passed
          +1 compile 0m 28s branch-2.8 passed with JDK v1.8.0_101
          +1 compile 0m 32s branch-2.8 passed with JDK v1.7.0_111
          +1 checkstyle 0m 19s branch-2.8 passed
          +1 mvnsite 0m 38s branch-2.8 passed
          +1 mvneclipse 0m 17s branch-2.8 passed
          +1 findbugs 1m 13s branch-2.8 passed
          +1 javadoc 0m 20s branch-2.8 passed with JDK v1.8.0_101
          +1 javadoc 0m 23s branch-2.8 passed with JDK v1.7.0_111
          +1 mvninstall 0m 29s the patch passed
          +1 compile 0m 26s the patch passed with JDK v1.8.0_101
          +1 javac 0m 26s the patch passed
          +1 compile 0m 29s the patch passed with JDK v1.7.0_111
          +1 javac 0m 29s the patch passed
          +1 checkstyle 0m 16s the patch passed
          +1 mvnsite 0m 34s the patch passed
          +1 mvneclipse 0m 14s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 1m 19s the patch passed
          +1 javadoc 0m 19s the patch passed with JDK v1.8.0_101
          +1 javadoc 0m 21s the patch passed with JDK v1.7.0_111
          -1 unit 69m 38s hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_101.
          -1 unit 70m 56s hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.7.0_111.
          +1 asflicense 0m 16s The patch does not generate ASF License warnings.
          157m 34s



          Reason Tests
          JDK v1.8.0_101 Failed junit tests hadoop.yarn.server.resourcemanager.TestAMAuthorization
            hadoop.yarn.server.resourcemanager.TestClientRMTokens
          JDK v1.7.0_111 Failed junit tests hadoop.yarn.server.resourcemanager.TestAMAuthorization
            hadoop.yarn.server.resourcemanager.TestClientRMTokens



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:5af2af1
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12827669/YARN-5566.004.branch-2.8.addendum.patch
          JIRA Issue YARN-5566
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux a7197c4c8fb5 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision branch-2.8 / c45f1ec
          Default Java 1.7.0_111
          Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_101 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_111
          findbugs v3.0.0
          unit https://builds.apache.org/job/PreCommit-YARN-Build/13054/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.8.0_101.txt
          unit https://builds.apache.org/job/PreCommit-YARN-Build/13054/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.7.0_111.txt
          unit test logs https://builds.apache.org/job/PreCommit-YARN-Build/13054/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.8.0_101.txt https://builds.apache.org/job/PreCommit-YARN-Build/13054/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.7.0_111.txt
          JDK v1.7.0_111 Test Results https://builds.apache.org/job/PreCommit-YARN-Build/13054/testReport/
          modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/13054/console
          Powered by Apache Yetus 0.3.0 http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 14s Docker mode activated. +1 @author 0m 1s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 2 new or modified test files. +1 mvninstall 6m 49s branch-2.8 passed +1 compile 0m 28s branch-2.8 passed with JDK v1.8.0_101 +1 compile 0m 32s branch-2.8 passed with JDK v1.7.0_111 +1 checkstyle 0m 19s branch-2.8 passed +1 mvnsite 0m 38s branch-2.8 passed +1 mvneclipse 0m 17s branch-2.8 passed +1 findbugs 1m 13s branch-2.8 passed +1 javadoc 0m 20s branch-2.8 passed with JDK v1.8.0_101 +1 javadoc 0m 23s branch-2.8 passed with JDK v1.7.0_111 +1 mvninstall 0m 29s the patch passed +1 compile 0m 26s the patch passed with JDK v1.8.0_101 +1 javac 0m 26s the patch passed +1 compile 0m 29s the patch passed with JDK v1.7.0_111 +1 javac 0m 29s the patch passed +1 checkstyle 0m 16s the patch passed +1 mvnsite 0m 34s the patch passed +1 mvneclipse 0m 14s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 19s the patch passed +1 javadoc 0m 19s the patch passed with JDK v1.8.0_101 +1 javadoc 0m 21s the patch passed with JDK v1.7.0_111 -1 unit 69m 38s hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_101. -1 unit 70m 56s hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.7.0_111. +1 asflicense 0m 16s The patch does not generate ASF License warnings. 157m 34s Reason Tests JDK v1.8.0_101 Failed junit tests hadoop.yarn.server.resourcemanager.TestAMAuthorization   hadoop.yarn.server.resourcemanager.TestClientRMTokens JDK v1.7.0_111 Failed junit tests hadoop.yarn.server.resourcemanager.TestAMAuthorization   hadoop.yarn.server.resourcemanager.TestClientRMTokens Subsystem Report/Notes Docker Image:yetus/hadoop:5af2af1 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12827669/YARN-5566.004.branch-2.8.addendum.patch JIRA Issue YARN-5566 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux a7197c4c8fb5 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision branch-2.8 / c45f1ec Default Java 1.7.0_111 Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_101 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_111 findbugs v3.0.0 unit https://builds.apache.org/job/PreCommit-YARN-Build/13054/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.8.0_101.txt unit https://builds.apache.org/job/PreCommit-YARN-Build/13054/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.7.0_111.txt unit test logs https://builds.apache.org/job/PreCommit-YARN-Build/13054/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.8.0_101.txt https://builds.apache.org/job/PreCommit-YARN-Build/13054/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.7.0_111.txt JDK v1.7.0_111 Test Results https://builds.apache.org/job/PreCommit-YARN-Build/13054/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/13054/console Powered by Apache Yetus 0.3.0 http://yetus.apache.org This message was automatically generated.
          Hide
          rkanter Robert Kanter added a comment - - edited

          The addendum branch-2.8 patch adds the missing code to waitForState, and makes a few other trivial changes to make the test code more similar to the original tests in YARN-4676.

          Show
          rkanter Robert Kanter added a comment - - edited The addendum branch-2.8 patch adds the missing code to waitForState , and makes a few other trivial changes to make the test code more similar to the original tests in YARN-4676 .
          Hide
          rkanter Robert Kanter added a comment -

          I've discovered that the tests added to TestResourceTrackerService in the branch-2.8 version of the patch have a race condition. If a DECOMMISSIONING node receives the heartbeat to become DECOMMISSIONED, the node might do this quickly enough that by the time the test code goes to check the node's status, it's already gone from the list of nodes, and the test fails because the node is null. This can easily be reproduced by adding a sleep between sending the heartbeat and waiting for the DECOMMISSIONED state.

          I missed a small change to the waitForState method when I borrowed the tests from YARN-4676. This allows the test to also grab nodes from the inactive list of nodes, which is where DECOMMISSIONED nodes would be found.

          Show
          rkanter Robert Kanter added a comment - I've discovered that the tests added to TestResourceTrackerService in the branch-2.8 version of the patch have a race condition. If a DECOMMISSIONING node receives the heartbeat to become DECOMMISSIONED, the node might do this quickly enough that by the time the test code goes to check the node's status, it's already gone from the list of nodes, and the test fails because the node is null. This can easily be reproduced by adding a sleep between sending the heartbeat and waiting for the DECOMMISSIONED state. I missed a small change to the waitForState method when I borrowed the tests from YARN-4676 . This allows the test to also grab nodes from the inactive list of nodes, which is where DECOMMISSIONED nodes would be found.
          Hide
          kasha Karthik Kambatla added a comment -

          Just committed the final patch to branch-2.8. Thanks Robert Kanter for working on this, and Junping Du for your reviews.

          Show
          kasha Karthik Kambatla added a comment - Just committed the final patch to branch-2.8. Thanks Robert Kanter for working on this, and Junping Du for your reviews.
          Hide
          kasha Karthik Kambatla added a comment -

          +1. Committing this.

          Show
          kasha Karthik Kambatla added a comment - +1. Committing this.
          Hide
          rkanter Robert Kanter added a comment -

          Uploaded the same patch but with a name that I hope Jenkins will apply to the 2.8 branch.

          Show
          rkanter Robert Kanter added a comment - Uploaded the same patch but with a name that I hope Jenkins will apply to the 2.8 branch.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 20s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 2 new or modified test files.
          +1 mvninstall 8m 29s branch-2.8 passed
          +1 compile 0m 28s branch-2.8 passed with JDK v1.8.0_101
          +1 compile 0m 31s branch-2.8 passed with JDK v1.7.0_111
          +1 checkstyle 0m 19s branch-2.8 passed
          +1 mvnsite 0m 38s branch-2.8 passed
          +1 mvneclipse 0m 18s branch-2.8 passed
          +1 findbugs 1m 12s branch-2.8 passed
          +1 javadoc 0m 21s branch-2.8 passed with JDK v1.8.0_101
          +1 javadoc 0m 22s branch-2.8 passed with JDK v1.7.0_111
          +1 mvninstall 0m 30s the patch passed
          +1 compile 0m 24s the patch passed with JDK v1.8.0_101
          +1 javac 0m 24s the patch passed
          +1 compile 0m 28s the patch passed with JDK v1.7.0_111
          +1 javac 0m 28s the patch passed
          +1 checkstyle 0m 16s the patch passed
          +1 mvnsite 0m 33s the patch passed
          +1 mvneclipse 0m 14s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 1m 18s the patch passed
          +1 javadoc 0m 17s the patch passed with JDK v1.8.0_101
          +1 javadoc 0m 20s the patch passed with JDK v1.7.0_111
          -1 unit 69m 21s hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_101.
          -1 unit 70m 37s hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.7.0_111.
          +1 asflicense 0m 18s The patch does not generate ASF License warnings.
          158m 40s



          Reason Tests
          JDK v1.8.0_101 Failed junit tests hadoop.yarn.server.resourcemanager.TestAMAuthorization
            hadoop.yarn.server.resourcemanager.TestClientRMTokens
          JDK v1.7.0_111 Failed junit tests hadoop.yarn.server.resourcemanager.TestAMAuthorization
            hadoop.yarn.server.resourcemanager.TestClientRMTokens



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:5af2af1
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12827248/YARN-5566.004.branch-2.8.patch
          JIRA Issue YARN-5566
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux b4f350fb2450 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision branch-2.8 / e84b5c5
          Default Java 1.7.0_111
          Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_101 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_111
          findbugs v3.0.0
          unit https://builds.apache.org/job/PreCommit-YARN-Build/13019/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.8.0_101.txt
          unit https://builds.apache.org/job/PreCommit-YARN-Build/13019/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.7.0_111.txt
          unit test logs https://builds.apache.org/job/PreCommit-YARN-Build/13019/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.8.0_101.txt https://builds.apache.org/job/PreCommit-YARN-Build/13019/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.7.0_111.txt
          JDK v1.7.0_111 Test Results https://builds.apache.org/job/PreCommit-YARN-Build/13019/testReport/
          modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/13019/console
          Powered by Apache Yetus 0.3.0 http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 20s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 2 new or modified test files. +1 mvninstall 8m 29s branch-2.8 passed +1 compile 0m 28s branch-2.8 passed with JDK v1.8.0_101 +1 compile 0m 31s branch-2.8 passed with JDK v1.7.0_111 +1 checkstyle 0m 19s branch-2.8 passed +1 mvnsite 0m 38s branch-2.8 passed +1 mvneclipse 0m 18s branch-2.8 passed +1 findbugs 1m 12s branch-2.8 passed +1 javadoc 0m 21s branch-2.8 passed with JDK v1.8.0_101 +1 javadoc 0m 22s branch-2.8 passed with JDK v1.7.0_111 +1 mvninstall 0m 30s the patch passed +1 compile 0m 24s the patch passed with JDK v1.8.0_101 +1 javac 0m 24s the patch passed +1 compile 0m 28s the patch passed with JDK v1.7.0_111 +1 javac 0m 28s the patch passed +1 checkstyle 0m 16s the patch passed +1 mvnsite 0m 33s the patch passed +1 mvneclipse 0m 14s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 18s the patch passed +1 javadoc 0m 17s the patch passed with JDK v1.8.0_101 +1 javadoc 0m 20s the patch passed with JDK v1.7.0_111 -1 unit 69m 21s hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_101. -1 unit 70m 37s hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.7.0_111. +1 asflicense 0m 18s The patch does not generate ASF License warnings. 158m 40s Reason Tests JDK v1.8.0_101 Failed junit tests hadoop.yarn.server.resourcemanager.TestAMAuthorization   hadoop.yarn.server.resourcemanager.TestClientRMTokens JDK v1.7.0_111 Failed junit tests hadoop.yarn.server.resourcemanager.TestAMAuthorization   hadoop.yarn.server.resourcemanager.TestClientRMTokens Subsystem Report/Notes Docker Image:yetus/hadoop:5af2af1 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12827248/YARN-5566.004.branch-2.8.patch JIRA Issue YARN-5566 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux b4f350fb2450 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision branch-2.8 / e84b5c5 Default Java 1.7.0_111 Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_101 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_111 findbugs v3.0.0 unit https://builds.apache.org/job/PreCommit-YARN-Build/13019/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.8.0_101.txt unit https://builds.apache.org/job/PreCommit-YARN-Build/13019/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.7.0_111.txt unit test logs https://builds.apache.org/job/PreCommit-YARN-Build/13019/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.8.0_101.txt https://builds.apache.org/job/PreCommit-YARN-Build/13019/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.7.0_111.txt JDK v1.7.0_111 Test Results https://builds.apache.org/job/PreCommit-YARN-Build/13019/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/13019/console Powered by Apache Yetus 0.3.0 http://yetus.apache.org This message was automatically generated.
          Hide
          rkanter Robert Kanter added a comment -

          Thanks for the reviews Junping Du and Karthik Kambatla. I've just attached a version of the patch for branch-2.8. It steals a few unit tests (with some minor modifications) from YARN-4676, and it deletes the old code from StatusUpdateWhenHealthyTransition for transitioning to DECOMMISSIONED (originally done by YARN-4676) in favor of the new code added by this JIRA.

          Show
          rkanter Robert Kanter added a comment - Thanks for the reviews Junping Du and Karthik Kambatla . I've just attached a version of the patch for branch-2.8. It steals a few unit tests (with some minor modifications) from YARN-4676 , and it deletes the old code from StatusUpdateWhenHealthyTransition for transitioning to DECOMMISSIONED (originally done by YARN-4676 ) in favor of the new code added by this JIRA.
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #10387 (See https://builds.apache.org/job/Hadoop-trunk-Commit/10387/)
          YARN-5566. Client-side NM graceful decom is not triggered when jobs (kasha: rev 74f4bae45597f4794e99e33309130ddff647b21f)

          • (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java
          • (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestResourceTrackerService.java
          • (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #10387 (See https://builds.apache.org/job/Hadoop-trunk-Commit/10387/ ) YARN-5566 . Client-side NM graceful decom is not triggered when jobs (kasha: rev 74f4bae45597f4794e99e33309130ddff647b21f) (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestResourceTrackerService.java (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java
          Hide
          kasha Karthik Kambatla added a comment -

          Thanks Junping Du for the review, and Robert Kanter for the patch. Just committed this to trunk and branch-2.

          Leaving the JIRA open for 2.8 patch.

          Show
          kasha Karthik Kambatla added a comment - Thanks Junping Du for the review, and Robert Kanter for the patch. Just committed this to trunk and branch-2. Leaving the JIRA open for 2.8 patch.
          Hide
          djp Junping Du added a comment - - edited

          From above description, it seems the root cause is RM receive container status after RMApp do App Finish Transition (will remove app from runningApplications), then it add back the application to RMNode's runningApplications but never remove it again. I am not 100% sure as RM log is not included.
          Robert Kanter, if you can check the timestamp for calling "runningApplications.add(containerAppId);" (in RMNodeImpl) and AppFinishedTransition (in RMAppImpl) for the same app when this issue happens, you should get the same answer. Current fix is a right one as we should always check application's status in context before we adding it to RMNode's runningApplication.
          +1. 004 patch LGTM. Karthik Kambatla, please feel free to commit it today or I will commit it tomorrow.
          BTW, patch for branch-2.8 should be slightly different. Robert, can you deliver one for 2.8 also? Thx!

          Show
          djp Junping Du added a comment - - edited From above description, it seems the root cause is RM receive container status after RMApp do App Finish Transition (will remove app from runningApplications), then it add back the application to RMNode's runningApplications but never remove it again. I am not 100% sure as RM log is not included. Robert Kanter , if you can check the timestamp for calling "runningApplications.add(containerAppId);" (in RMNodeImpl) and AppFinishedTransition (in RMAppImpl) for the same app when this issue happens, you should get the same answer. Current fix is a right one as we should always check application's status in context before we adding it to RMNode's runningApplication. +1. 004 patch LGTM. Karthik Kambatla , please feel free to commit it today or I will commit it tomorrow. BTW, patch for branch-2.8 should be slightly different. Robert, can you deliver one for 2.8 also? Thx!
          Hide
          djp Junping Du added a comment -

          I'm not exactly sure why this is happening, but from what I can tell, this issue is based on some timing of when things occur, and somehow DECOMMISSIONING makes it more likely to happen.

          Karthik Kambatla, can you hold on the commit given we are not 100% sure this fix is enough and side-effect? I will do more investigation and review today.

          Show
          djp Junping Du added a comment - I'm not exactly sure why this is happening, but from what I can tell, this issue is based on some timing of when things occur, and somehow DECOMMISSIONING makes it more likely to happen. Karthik Kambatla , can you hold on the commit given we are not 100% sure this fix is enough and side-effect? I will do more investigation and review today.
          Hide
          kasha Karthik Kambatla added a comment -

          +1. Will commit this later today.

          Junping Du - could you take a quick look?

          Show
          kasha Karthik Kambatla added a comment - +1. Will commit this later today. Junping Du - could you take a quick look?
          Hide
          hadoopqa Hadoop QA added a comment -
          +1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 22s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 2 new or modified test files.
          +1 mvninstall 8m 22s trunk passed
          +1 compile 0m 37s trunk passed
          +1 checkstyle 0m 24s trunk passed
          +1 mvnsite 0m 46s trunk passed
          +1 mvneclipse 0m 17s trunk passed
          +1 findbugs 1m 8s trunk passed
          +1 javadoc 0m 23s trunk passed
          +1 mvninstall 0m 36s the patch passed
          +1 compile 0m 31s the patch passed
          +1 javac 0m 31s the patch passed
          +1 checkstyle 0m 21s the patch passed
          +1 mvnsite 0m 36s the patch passed
          +1 mvneclipse 0m 17s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 1m 14s the patch passed
          +1 javadoc 0m 22s the patch passed
          +1 unit 38m 24s hadoop-yarn-server-resourcemanager in the patch passed.
          +1 asflicense 0m 16s The patch does not generate ASF License warnings.
          55m 38s



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:9560f25
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12826224/YARN-5566.004.patch
          JIRA Issue YARN-5566
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux e73f2848e9f0 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / c4ee691
          Default Java 1.8.0_101
          findbugs v3.0.0
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/12948/testReport/
          modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/12948/console
          Powered by Apache Yetus 0.3.0 http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - +1 overall Vote Subsystem Runtime Comment 0 reexec 0m 22s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 2 new or modified test files. +1 mvninstall 8m 22s trunk passed +1 compile 0m 37s trunk passed +1 checkstyle 0m 24s trunk passed +1 mvnsite 0m 46s trunk passed +1 mvneclipse 0m 17s trunk passed +1 findbugs 1m 8s trunk passed +1 javadoc 0m 23s trunk passed +1 mvninstall 0m 36s the patch passed +1 compile 0m 31s the patch passed +1 javac 0m 31s the patch passed +1 checkstyle 0m 21s the patch passed +1 mvnsite 0m 36s the patch passed +1 mvneclipse 0m 17s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 14s the patch passed +1 javadoc 0m 22s the patch passed +1 unit 38m 24s hadoop-yarn-server-resourcemanager in the patch passed. +1 asflicense 0m 16s The patch does not generate ASF License warnings. 55m 38s Subsystem Report/Notes Docker Image:yetus/hadoop:9560f25 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12826224/YARN-5566.004.patch JIRA Issue YARN-5566 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux e73f2848e9f0 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / c4ee691 Default Java 1.8.0_101 findbugs v3.0.0 Test Results https://builds.apache.org/job/PreCommit-YARN-Build/12948/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/12948/console Powered by Apache Yetus 0.3.0 http://yetus.apache.org This message was automatically generated.
          Hide
          rkanter Robert Kanter added a comment - - edited

          Thanks Karthik Kambatla for the review. The 004 patch addresses your feedback.
          (I assume by TestRMNodeTransitions#testGracefulDecommissionWithApp you meant TestRMNodeTransitions#testDecommissioningUnhealthy)

          Show
          rkanter Robert Kanter added a comment - - edited Thanks Karthik Kambatla for the review. The 004 patch addresses your feedback. (I assume by TestRMNodeTransitions#testGracefulDecommissionWithApp you meant TestRMNodeTransitions#testDecommissioningUnhealthy)
          Hide
          kasha Karthik Kambatla added a comment -

          Patch makes sense to me. Minor comments:

          1. RMNodeImpl: Nit - the comments in the added code don't add much information. We should remove them or added more details so they add some information.
                    // no running (and keeping alive) app on this node, get it
                    // decommissioned.
            
          2. TestResourceTrackerService: The second heartbeat from node1 does not need to indicate running containers.
          3. TestRMNodeTransitions#testGracefulDecommissionWithApp: When creating NodeStatus, we don't need to specify ContainerStatus when creating a new ArrayList
          Show
          kasha Karthik Kambatla added a comment - Patch makes sense to me. Minor comments: RMNodeImpl: Nit - the comments in the added code don't add much information. We should remove them or added more details so they add some information. // no running (and keeping alive) app on this node, get it // decommissioned. TestResourceTrackerService: The second heartbeat from node1 does not need to indicate running containers. TestRMNodeTransitions#testGracefulDecommissionWithApp: When creating NodeStatus, we don't need to specify ContainerStatus when creating a new ArrayList
          Hide
          djp Junping Du added a comment -

          Hi Robert Kanter, sorry for my reply late as I am between in travel. The above analysis make sense to me. However, I need a bit more time for check the code. Will give it a review before EOD of tomorrow.

          Show
          djp Junping Du added a comment - Hi Robert Kanter , sorry for my reply late as I am between in travel. The above analysis make sense to me. However, I need a bit more time for check the code. Will give it a review before EOD of tomorrow.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 25s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 2 new or modified test files.
          +1 mvninstall 8m 29s trunk passed
          +1 compile 0m 39s trunk passed
          +1 checkstyle 0m 27s trunk passed
          +1 mvnsite 0m 48s trunk passed
          +1 mvneclipse 0m 19s trunk passed
          +1 findbugs 1m 9s trunk passed
          +1 javadoc 0m 23s trunk passed
          +1 mvninstall 0m 36s the patch passed
          +1 compile 0m 34s the patch passed
          +1 javac 0m 34s the patch passed
          +1 checkstyle 0m 23s the patch passed
          +1 mvnsite 0m 46s the patch passed
          +1 mvneclipse 0m 17s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 1m 18s the patch passed
          +1 javadoc 0m 22s the patch passed
          -1 unit 41m 5s hadoop-yarn-server-resourcemanager in the patch failed.
          +1 asflicense 0m 17s The patch does not generate ASF License warnings.
          58m 59s



          Reason Tests
          Failed junit tests hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:9560f25
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12825936/YARN-5566.003.patch
          JIRA Issue YARN-5566
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 0310f7af232b 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / c258171
          Default Java 1.8.0_101
          findbugs v3.0.0
          unit https://builds.apache.org/job/PreCommit-YARN-Build/12922/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
          unit test logs https://builds.apache.org/job/PreCommit-YARN-Build/12922/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/12922/testReport/
          modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/12922/console
          Powered by Apache Yetus 0.3.0 http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 25s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 2 new or modified test files. +1 mvninstall 8m 29s trunk passed +1 compile 0m 39s trunk passed +1 checkstyle 0m 27s trunk passed +1 mvnsite 0m 48s trunk passed +1 mvneclipse 0m 19s trunk passed +1 findbugs 1m 9s trunk passed +1 javadoc 0m 23s trunk passed +1 mvninstall 0m 36s the patch passed +1 compile 0m 34s the patch passed +1 javac 0m 34s the patch passed +1 checkstyle 0m 23s the patch passed +1 mvnsite 0m 46s the patch passed +1 mvneclipse 0m 17s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 18s the patch passed +1 javadoc 0m 22s the patch passed -1 unit 41m 5s hadoop-yarn-server-resourcemanager in the patch failed. +1 asflicense 0m 17s The patch does not generate ASF License warnings. 58m 59s Reason Tests Failed junit tests hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer Subsystem Report/Notes Docker Image:yetus/hadoop:9560f25 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12825936/YARN-5566.003.patch JIRA Issue YARN-5566 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 0310f7af232b 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / c258171 Default Java 1.8.0_101 findbugs v3.0.0 unit https://builds.apache.org/job/PreCommit-YARN-Build/12922/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt unit test logs https://builds.apache.org/job/PreCommit-YARN-Build/12922/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/12922/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/12922/console Powered by Apache Yetus 0.3.0 http://yetus.apache.org This message was automatically generated.
          Hide
          rkanter Robert Kanter added a comment -

          The 003 patch fixes the unit tests.

          Show
          rkanter Robert Kanter added a comment - The 003 patch fixes the unit tests.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 21s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
          +1 mvninstall 7m 8s trunk passed
          +1 compile 0m 34s trunk passed
          +1 checkstyle 0m 21s trunk passed
          +1 mvnsite 0m 39s trunk passed
          +1 mvneclipse 0m 17s trunk passed
          +1 findbugs 0m 57s trunk passed
          +1 javadoc 0m 21s trunk passed
          +1 mvninstall 0m 32s the patch passed
          +1 compile 0m 30s the patch passed
          +1 javac 0m 30s the patch passed
          +1 checkstyle 0m 18s the patch passed
          +1 mvnsite 0m 37s the patch passed
          +1 mvneclipse 0m 14s the patch passed
          +1 whitespace 0m 1s The patch has no whitespace issues.
          +1 findbugs 1m 5s the patch passed
          +1 javadoc 0m 20s the patch passed
          -1 unit 37m 51s hadoop-yarn-server-resourcemanager in the patch failed.
          +1 asflicense 0m 15s The patch does not generate ASF License warnings.
          53m 1s



          Reason Tests
          Failed junit tests hadoop.yarn.server.resourcemanager.TestRMNodeTransitions
            hadoop.yarn.server.resourcemanager.TestResourceTrackerService



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:9560f25
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12825930/YARN-5566.002.patch
          JIRA Issue YARN-5566
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux d4b8891d8451 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / c258171
          Default Java 1.8.0_101
          findbugs v3.0.0
          unit https://builds.apache.org/job/PreCommit-YARN-Build/12920/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
          unit test logs https://builds.apache.org/job/PreCommit-YARN-Build/12920/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/12920/testReport/
          modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/12920/console
          Powered by Apache Yetus 0.3.0 http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 21s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 mvninstall 7m 8s trunk passed +1 compile 0m 34s trunk passed +1 checkstyle 0m 21s trunk passed +1 mvnsite 0m 39s trunk passed +1 mvneclipse 0m 17s trunk passed +1 findbugs 0m 57s trunk passed +1 javadoc 0m 21s trunk passed +1 mvninstall 0m 32s the patch passed +1 compile 0m 30s the patch passed +1 javac 0m 30s the patch passed +1 checkstyle 0m 18s the patch passed +1 mvnsite 0m 37s the patch passed +1 mvneclipse 0m 14s the patch passed +1 whitespace 0m 1s The patch has no whitespace issues. +1 findbugs 1m 5s the patch passed +1 javadoc 0m 20s the patch passed -1 unit 37m 51s hadoop-yarn-server-resourcemanager in the patch failed. +1 asflicense 0m 15s The patch does not generate ASF License warnings. 53m 1s Reason Tests Failed junit tests hadoop.yarn.server.resourcemanager.TestRMNodeTransitions   hadoop.yarn.server.resourcemanager.TestResourceTrackerService Subsystem Report/Notes Docker Image:yetus/hadoop:9560f25 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12825930/YARN-5566.002.patch JIRA Issue YARN-5566 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux d4b8891d8451 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / c258171 Default Java 1.8.0_101 findbugs v3.0.0 unit https://builds.apache.org/job/PreCommit-YARN-Build/12920/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt unit test logs https://builds.apache.org/job/PreCommit-YARN-Build/12920/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/12920/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/12920/console Powered by Apache Yetus 0.3.0 http://yetus.apache.org This message was automatically generated.
          Hide
          rkanter Robert Kanter added a comment -

          My test was doing something wrong. After I fixed that, the 001 patch stopped helping (which makes more sense because that code never actually did DECOMMISSIONG --> UNHEALTHY).

          I put back the code that YARN-4676 removed that you mentioned, but tweaked it a little bit and moved it above the getIsNodeHealthy call so that it can transition to DECOMMISSIONED even if a node is UNHEALTHY now.

          I temporarily added a bunch more log statements to help investigate, and saw that sometimes handleContainerStatus (when called from StatusUpdateWhenHealthyTransition) would add an Application to runningApplications, but then nothing ever removed it. This happened way more frequently for DECOMMISSIONING nodes, but I did see it once happen to a normal node. There's a piece of code here that adds an Application to runningApplications if it sees a Container without an Application in runningApplications. I changed this code to call handleRunningAppOnNode instead of simply adding the Application, which basically makes it check that the Application still exists. I'm not exactly sure why this is happening, but from what I can tell, this issue is based on some timing of when things occur, and somehow DECOMMISSIONING makes it more likely to happen.

          I've attached a 002 patch with the new changes. I ran my test over 150 times with the 002 patch and it worked every time. When I ran my test without the patch (or with the 001 patch, or with just adding the code removed by YARN-4676), it would fail on the first run, except for one time where it failed on the second.

          Junping Du, please take a look.

          Show
          rkanter Robert Kanter added a comment - My test was doing something wrong. After I fixed that, the 001 patch stopped helping (which makes more sense because that code never actually did DECOMMISSIONG --> UNHEALTHY). I put back the code that YARN-4676 removed that you mentioned, but tweaked it a little bit and moved it above the getIsNodeHealthy call so that it can transition to DECOMMISSIONED even if a node is UNHEALTHY now. I temporarily added a bunch more log statements to help investigate, and saw that sometimes handleContainerStatus (when called from StatusUpdateWhenHealthyTransition ) would add an Application to runningApplications , but then nothing ever removed it. This happened way more frequently for DECOMMISSIONING nodes, but I did see it once happen to a normal node. There's a piece of code here that adds an Application to runningApplications if it sees a Container without an Application in runningApplications . I changed this code to call handleRunningAppOnNode instead of simply adding the Application, which basically makes it check that the Application still exists. I'm not exactly sure why this is happening, but from what I can tell, this issue is based on some timing of when things occur, and somehow DECOMMISSIONING makes it more likely to happen. I've attached a 002 patch with the new changes. I ran my test over 150 times with the 002 patch and it worked every time. When I ran my test without the patch (or with the 001 patch, or with just adding the code removed by YARN-4676 ), it would fail on the first run, except for one time where it failed on the second. Junping Du , please take a look.
          Hide
          rkanter Robert Kanter added a comment -

          I've been super busy and haven't had a chance to look at YARN-4676 again. I've actually been testing the client-side graceful decom without YARN-4676 applied, so the logic that was removed by it is actually still there.

          I agree that UNHEALTHY isn't a state that should occur if the initial state is DECOMMISSIONING; as you pointed out, the code doesn't go to UNHEALTHY from DECOMMISSIONG. That's partly why I was unsure why my patch fixed the problem; the node never goes to UNHEALTHY, so it doesn't make sense why that would help. I'll try to dig into this some more.

          Show
          rkanter Robert Kanter added a comment - I've been super busy and haven't had a chance to look at YARN-4676 again. I've actually been testing the client-side graceful decom without YARN-4676 applied, so the logic that was removed by it is actually still there. I agree that UNHEALTHY isn't a state that should occur if the initial state is DECOMMISSIONING; as you pointed out, the code doesn't go to UNHEALTHY from DECOMMISSIONG. That's partly why I was unsure why my patch fixed the problem; the node never goes to UNHEALTHY, so it doesn't make sense why that would help. I'll try to dig into this some more.
          Hide
          djp Junping Du added a comment -

          Hi Robert Kanter, thanks for reporting this issue. Unhealthy state is not a expected one in StatusUpdateWhenHealthyTransition if node's initial state is decommissioning. I am not exactly sure how update changes like attached patch could help here but there is indeed something wrong with our latest change.
          First, I think this is a regression issue caused by YARN-4676 (https://github.com/apache/hadoop/commit/0da69c324dee9baab0f0b9700db1cc5b623f8421#diff-29befd5a1922a2121b26766561f9e447). You can see we were removing the logic of detecting of app finish in StatusUpdateWhenHealthyTransition. We should add it back (in branch-2.9).
          I assume you do your test in bit of branch-2. Isn't it?
          Another issue we should fix here (in either 2.8 or 2.9) is: if NM keep reporting unhealthy after it is in decommissioning stage, then it will only ends up as decommissioning without get chance to move to decommissioned.

                if (!remoteNodeHealthStatus.getIsNodeHealthy()) {
                  LOG.info("Node " + rmNode.nodeId +
                      " reported UNHEALTHY with details: " +
                      remoteNodeHealthStatus.getHealthReport());
                  // if a node in decommissioning receives an unhealthy report,
                  // it will keep decommissioning.
                  if (isNodeDecommissioning) {
                    return NodeState.DECOMMISSIONING;
                  } else {
          

          Instead of return NodeState.DECOMMISSIONING directly, we should also check running apps on that node and return decommissioned if no running apps. Isn't?

          Show
          djp Junping Du added a comment - Hi Robert Kanter , thanks for reporting this issue. Unhealthy state is not a expected one in StatusUpdateWhenHealthyTransition if node's initial state is decommissioning. I am not exactly sure how update changes like attached patch could help here but there is indeed something wrong with our latest change. First, I think this is a regression issue caused by YARN-4676 ( https://github.com/apache/hadoop/commit/0da69c324dee9baab0f0b9700db1cc5b623f8421#diff-29befd5a1922a2121b26766561f9e447 ). You can see we were removing the logic of detecting of app finish in StatusUpdateWhenHealthyTransition. We should add it back (in branch-2.9). I assume you do your test in bit of branch-2. Isn't it? Another issue we should fix here (in either 2.8 or 2.9) is: if NM keep reporting unhealthy after it is in decommissioning stage, then it will only ends up as decommissioning without get chance to move to decommissioned. if (!remoteNodeHealthStatus.getIsNodeHealthy()) { LOG.info("Node " + rmNode.nodeId + " reported UNHEALTHY with details: " + remoteNodeHealthStatus.getHealthReport()); // if a node in decommissioning receives an unhealthy report, // it will keep decommissioning. if (isNodeDecommissioning) { return NodeState.DECOMMISSIONING; } else { Instead of return NodeState.DECOMMISSIONING directly, we should also check running apps on that node and return decommissioned if no running apps. Isn't?
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 20s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
          +1 mvninstall 7m 15s trunk passed
          +1 compile 0m 31s trunk passed
          +1 checkstyle 0m 21s trunk passed
          +1 mvnsite 0m 39s trunk passed
          +1 mvneclipse 0m 17s trunk passed
          +1 findbugs 1m 3s trunk passed
          +1 javadoc 0m 21s trunk passed
          +1 mvninstall 0m 32s the patch passed
          +1 compile 0m 29s the patch passed
          +1 javac 0m 29s the patch passed
          +1 checkstyle 0m 17s the patch passed
          +1 mvnsite 0m 36s the patch passed
          +1 mvneclipse 0m 15s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 1m 1s the patch passed
          +1 javadoc 0m 18s the patch passed
          +1 unit 38m 28s hadoop-yarn-server-resourcemanager in the patch passed.
          +1 asflicense 0m 16s The patch does not generate ASF License warnings.
          53m 39s



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:9560f25
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12825630/YARN-5566.001.patch
          JIRA Issue YARN-5566
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux ecb9b6923d9d 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 27c3b86
          Default Java 1.8.0_101
          findbugs v3.0.0
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/12903/testReport/
          modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/12903/console
          Powered by Apache Yetus 0.3.0 http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 20s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 mvninstall 7m 15s trunk passed +1 compile 0m 31s trunk passed +1 checkstyle 0m 21s trunk passed +1 mvnsite 0m 39s trunk passed +1 mvneclipse 0m 17s trunk passed +1 findbugs 1m 3s trunk passed +1 javadoc 0m 21s trunk passed +1 mvninstall 0m 32s the patch passed +1 compile 0m 29s the patch passed +1 javac 0m 29s the patch passed +1 checkstyle 0m 17s the patch passed +1 mvnsite 0m 36s the patch passed +1 mvneclipse 0m 15s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 1s the patch passed +1 javadoc 0m 18s the patch passed +1 unit 38m 28s hadoop-yarn-server-resourcemanager in the patch passed. +1 asflicense 0m 16s The patch does not generate ASF License warnings. 53m 39s Subsystem Report/Notes Docker Image:yetus/hadoop:9560f25 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12825630/YARN-5566.001.patch JIRA Issue YARN-5566 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux ecb9b6923d9d 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 27c3b86 Default Java 1.8.0_101 findbugs v3.0.0 Test Results https://builds.apache.org/job/PreCommit-YARN-Build/12903/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/12903/console Powered by Apache Yetus 0.3.0 http://yetus.apache.org This message was automatically generated.
          Hide
          rkanter Robert Kanter added a comment -

          When digging into this, I figured that the DECOMMISSIONG node must have thought it was still running apps, and so I added a bunch of extra print statements and saw that this was the case. While all RUNNING nodes had the correct counts, the DECOMMISSIONING node's count went back up somehow. I was looking at the state transitions, and saw that this one

                .addTransition(NodeState.DECOMMISSIONING,
                    EnumSet.of(NodeState.DECOMMISSIONING, NodeState.DECOMMISSIONED),
                    RMNodeEventType.STATUS_UPDATE,
                    new StatusUpdateWhenHealthyTransition())
          

          looked different from this analogue one

                .addTransition(NodeState.RUNNING,
                    EnumSet.of(NodeState.RUNNING, NodeState.UNHEALTHY),
                    RMNodeEventType.STATUS_UPDATE,
                    new StatusUpdateWhenHealthyTransition())
          

          despite calling the same transition. So I tried adding the UNHEALTHY state, and that fixed the problem.

          Junping Du any ideas what's going on here?

          Show
          rkanter Robert Kanter added a comment - When digging into this, I figured that the DECOMMISSIONG node must have thought it was still running apps, and so I added a bunch of extra print statements and saw that this was the case. While all RUNNING nodes had the correct counts, the DECOMMISSIONING node's count went back up somehow. I was looking at the state transitions, and saw that this one .addTransition(NodeState.DECOMMISSIONING, EnumSet.of(NodeState.DECOMMISSIONING, NodeState.DECOMMISSIONED), RMNodeEventType.STATUS_UPDATE, new StatusUpdateWhenHealthyTransition()) looked different from this analogue one .addTransition(NodeState.RUNNING, EnumSet.of(NodeState.RUNNING, NodeState.UNHEALTHY), RMNodeEventType.STATUS_UPDATE, new StatusUpdateWhenHealthyTransition()) despite calling the same transition. So I tried adding the UNHEALTHY state, and that fixed the problem. Junping Du any ideas what's going on here?

            People

            • Assignee:
              rkanter Robert Kanter
              Reporter:
              rkanter Robert Kanter
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development