Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-4478 [Umbrella] : Track all the Test failures in YARN
  3. YARN-5416

TestRMRestart#testRMRestartWaitForPreviousAMToFinish failed intermittently due to not wait SchedulerApplicationAttempt to be stopped

    Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.9.0, 3.0.0-alpha2
    • Component/s: test, yarn
    • Labels:
      None
    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      The test failure stack is:
      Running org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart
      Tests run: 54, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 385.338 sec <<< FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart
      testRMRestartWaitForPreviousAMToFinish[0](org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart) Time elapsed: 43.134 sec <<< FAILURE!
      java.lang.AssertionError: AppAttempt state is not correct (timedout) expected:<ALLOCATED> but was:<SCHEDULED>
      at org.junit.Assert.fail(Assert.java:88)
      at org.junit.Assert.failNotEquals(Assert.java:743)
      at org.junit.Assert.assertEquals(Assert.java:118)
      at org.apache.hadoop.yarn.server.resourcemanager.MockAM.waitForState(MockAM.java:86)
      at org.apache.hadoop.yarn.server.resourcemanager.MockRM.sendAMLaunched(MockRM.java:594)
      at org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.launchAM(TestRMRestart.java:1008)
      at org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestartWaitForPreviousAMToFinish(TestRMRestart.java:530)

      This is due to the same issue that partially fixed in YARN-4968

      1. YARN-5416.patch
        4 kB
        Junping Du
      2. YARN-5416-v2.patch
        4 kB
        Junping Du

        Issue Links

          Activity

          Hide
          jlowe Jason Lowe added a comment -

          This looks like an exact dup of YARN-1468 which you also filed. Are they actually different?

          Show
          jlowe Jason Lowe added a comment - This looks like an exact dup of YARN-1468 which you also filed. Are they actually different?
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 18s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 2 new or modified test files.
          +1 mvninstall 7m 24s trunk passed
          +1 compile 0m 35s trunk passed
          +1 checkstyle 0m 22s trunk passed
          +1 mvnsite 0m 40s trunk passed
          +1 mvneclipse 0m 18s trunk passed
          +1 findbugs 1m 2s trunk passed
          +1 javadoc 0m 22s trunk passed
          +1 mvninstall 0m 32s the patch passed
          +1 compile 0m 31s the patch passed
          +1 javac 0m 31s the patch passed
          -1 checkstyle 0m 19s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 3 new + 100 unchanged - 2 fixed = 103 total (was 102)
          +1 mvnsite 0m 36s the patch passed
          +1 mvneclipse 0m 15s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 1m 9s the patch passed
          +1 javadoc 0m 19s the patch passed
          +1 unit 34m 1s hadoop-yarn-server-resourcemanager in the patch passed.
          +1 asflicense 0m 15s The patch does not generate ASF License warnings.
          49m 37s



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:9560f25
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12819424/YARN-5416.patch
          JIRA Issue YARN-5416
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 6ea94621e588 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / ecff7d0
          Default Java 1.8.0_91
          findbugs v3.0.0
          checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/12444/artifact/patchprocess/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/12444/testReport/
          modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/12444/console
          Powered by Apache Yetus 0.3.0 http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 18s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 2 new or modified test files. +1 mvninstall 7m 24s trunk passed +1 compile 0m 35s trunk passed +1 checkstyle 0m 22s trunk passed +1 mvnsite 0m 40s trunk passed +1 mvneclipse 0m 18s trunk passed +1 findbugs 1m 2s trunk passed +1 javadoc 0m 22s trunk passed +1 mvninstall 0m 32s the patch passed +1 compile 0m 31s the patch passed +1 javac 0m 31s the patch passed -1 checkstyle 0m 19s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 3 new + 100 unchanged - 2 fixed = 103 total (was 102) +1 mvnsite 0m 36s the patch passed +1 mvneclipse 0m 15s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 9s the patch passed +1 javadoc 0m 19s the patch passed +1 unit 34m 1s hadoop-yarn-server-resourcemanager in the patch passed. +1 asflicense 0m 15s The patch does not generate ASF License warnings. 49m 37s Subsystem Report/Notes Docker Image:yetus/hadoop:9560f25 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12819424/YARN-5416.patch JIRA Issue YARN-5416 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 6ea94621e588 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / ecff7d0 Default Java 1.8.0_91 findbugs v3.0.0 checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/12444/artifact/patchprocess/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/12444/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/12444/console Powered by Apache Yetus 0.3.0 http://yetus.apache.org This message was automatically generated.
          Hide
          ebadger Eric Badger added a comment -

          Junping Du, is there any reason why we would only add the waitSchedulerApplicationAttemptStopped call for the first app attempt, but not for the subsequent ones?

          Show
          ebadger Eric Badger added a comment - Junping Du , is there any reason why we would only add the waitSchedulerApplicationAttemptStopped call for the first app attempt, but not for the subsequent ones?
          Hide
          djp Junping Du added a comment -

          This looks like an exact dup of YARN-1468 which you also filed. Are they actually different?

          Oh. no. YARN-1468 is a very old jira and out of my radar for some reason (I didn't notice recent comments from Eric there). I think we can close this as dup of that. What do you think?

          Junping Du, is there any reason why we would only add the waitSchedulerApplicationAttemptStopped call for the first app attempt, but not for the subsequent ones?

          Hi Eric, this is just following the pattern we applied in YARN-4968 which seems only necessary to wait before launch another AM immediately - that is exactly where the exception happens. Do you think there are other places we should add?

          Show
          djp Junping Du added a comment - This looks like an exact dup of YARN-1468 which you also filed. Are they actually different? Oh. no. YARN-1468 is a very old jira and out of my radar for some reason (I didn't notice recent comments from Eric there). I think we can close this as dup of that. What do you think? Junping Du, is there any reason why we would only add the waitSchedulerApplicationAttemptStopped call for the first app attempt, but not for the subsequent ones? Hi Eric, this is just following the pattern we applied in YARN-4968 which seems only necessary to wait before launch another AM immediately - that is exactly where the exception happens. Do you think there are other places we should add?
          Hide
          jlowe Jason Lowe added a comment -

          I think we can close this as dup of that. What do you think?

          I don't care much if we want to close this one for that one or vice-versa, just that we shouldn't keep both open. Since this is the one that has a patch, I'll go ahead and comment on the patch here as Eric has also done.

          seems only necessary to wait before launch another AM immediately

          I agree with Eric that it looks like another place was missed in the test. IIUC we launch AM1 then wait for it to enter the FAILED state then launch AM2. This patch changes that to do a more thorough wait before trying to launch AM2. However later in the same test we wait for the second AM to fail and launch a third attempt, which looks like the same case we're trying to fix – waiting for a previous AM to fully stop before immediately launching another attempt:

              rm2.waitForState(am2.getApplicationAttemptId(), RMAppAttemptState.FAILED);
              launchAM(rmApp, rm2, nm1);
             Assert.assertEquals(3, rmApp.getAppAttempts().size());
           
          Show
          jlowe Jason Lowe added a comment - I think we can close this as dup of that. What do you think? I don't care much if we want to close this one for that one or vice-versa, just that we shouldn't keep both open. Since this is the one that has a patch, I'll go ahead and comment on the patch here as Eric has also done. seems only necessary to wait before launch another AM immediately I agree with Eric that it looks like another place was missed in the test. IIUC we launch AM1 then wait for it to enter the FAILED state then launch AM2. This patch changes that to do a more thorough wait before trying to launch AM2. However later in the same test we wait for the second AM to fail and launch a third attempt, which looks like the same case we're trying to fix – waiting for a previous AM to fully stop before immediately launching another attempt: rm2.waitForState(am2.getApplicationAttemptId(), RMAppAttemptState.FAILED); launchAM(rmApp, rm2, nm1); Assert.assertEquals(3, rmApp.getAppAttempts().size());
          Hide
          ebadger Eric Badger added a comment - - edited

          Thanks for the response, Jason. That's exactly what I was thinking. I believe that it would mitigate the error that Mit Desai posted in his stack trace on YARN-1468. I'm fine with leaving this open since we have a patch here, but we need to make sure that we address all failures across both Jiras.

          Show
          ebadger Eric Badger added a comment - - edited Thanks for the response, Jason. That's exactly what I was thinking. I believe that it would mitigate the error that Mit Desai posted in his stack trace on YARN-1468 . I'm fine with leaving this open since we have a patch here, but we need to make sure that we address all failures across both Jiras.
          Hide
          jlowe Jason Lowe added a comment -

          Cancelling patch until review comments are addressed.

          Show
          jlowe Jason Lowe added a comment - Cancelling patch until review comments are addressed.
          Hide
          djp Junping Du added a comment -

          Sorry for missing above comments, Eric Badger and Jason Lowe. Just update v2 patch.

          Show
          djp Junping Du added a comment - Sorry for missing above comments, Eric Badger and Jason Lowe . Just update v2 patch.
          Hide
          hadoopqa Hadoop QA added a comment -
          +1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 14s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 2 new or modified test files.
          +1 mvninstall 14m 13s trunk passed
          +1 compile 0m 38s trunk passed
          +1 checkstyle 0m 24s trunk passed
          +1 mvnsite 0m 40s trunk passed
          +1 mvneclipse 0m 17s trunk passed
          +1 findbugs 1m 11s trunk passed
          +1 javadoc 0m 23s trunk passed
          +1 mvninstall 0m 38s the patch passed
          +1 compile 0m 34s the patch passed
          +1 javac 0m 34s the patch passed
          -0 checkstyle 0m 20s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 3 new + 100 unchanged - 2 fixed = 103 total (was 102)
          +1 mvnsite 0m 34s the patch passed
          +1 mvneclipse 0m 15s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 1m 24s the patch passed
          +1 javadoc 0m 21s the patch passed
          +1 unit 41m 47s hadoop-yarn-server-resourcemanager in the patch passed.
          +1 asflicense 0m 16s The patch does not generate ASF License warnings.
          65m 30s



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:a9ad5d6
          JIRA Issue YARN-5416
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12846700/YARN-5416-v2.patch
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux b9513ef2e4c4 3.13.0-95-generic #142-Ubuntu SMP Fri Aug 12 17:00:09 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / e692316
          Default Java 1.8.0_111
          findbugs v3.0.0
          checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/14631/artifact/patchprocess/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/14631/testReport/
          modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/14631/console
          Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - +1 overall Vote Subsystem Runtime Comment 0 reexec 0m 14s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 2 new or modified test files. +1 mvninstall 14m 13s trunk passed +1 compile 0m 38s trunk passed +1 checkstyle 0m 24s trunk passed +1 mvnsite 0m 40s trunk passed +1 mvneclipse 0m 17s trunk passed +1 findbugs 1m 11s trunk passed +1 javadoc 0m 23s trunk passed +1 mvninstall 0m 38s the patch passed +1 compile 0m 34s the patch passed +1 javac 0m 34s the patch passed -0 checkstyle 0m 20s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 3 new + 100 unchanged - 2 fixed = 103 total (was 102) +1 mvnsite 0m 34s the patch passed +1 mvneclipse 0m 15s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 24s the patch passed +1 javadoc 0m 21s the patch passed +1 unit 41m 47s hadoop-yarn-server-resourcemanager in the patch passed. +1 asflicense 0m 16s The patch does not generate ASF License warnings. 65m 30s Subsystem Report/Notes Docker Image:yetus/hadoop:a9ad5d6 JIRA Issue YARN-5416 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12846700/YARN-5416-v2.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux b9513ef2e4c4 3.13.0-95-generic #142-Ubuntu SMP Fri Aug 12 17:00:09 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / e692316 Default Java 1.8.0_111 findbugs v3.0.0 checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/14631/artifact/patchprocess/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/14631/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/14631/console Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          ebadger Eric Badger added a comment -

          Junping Du, patch looks good to me. Should probably clean up the checkstyle errors though (at least the unused imports, which are easy).

          Show
          ebadger Eric Badger added a comment - Junping Du , patch looks good to me. Should probably clean up the checkstyle errors though (at least the unused imports, which are easy).
          Hide
          jlowe Jason Lowe added a comment -

          +1 lgtm. I'll fix the unused import checkstyle nits during the commit.

          Show
          jlowe Jason Lowe added a comment - +1 lgtm. I'll fix the unused import checkstyle nits during the commit.
          Hide
          jlowe Jason Lowe added a comment -

          Thanks to Junping Du for the contribution and to Eric Badger for additional review! I committed this to trunk and branch-2.

          Show
          jlowe Jason Lowe added a comment - Thanks to Junping Du for the contribution and to Eric Badger for additional review! I committed this to trunk and branch-2.
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Jenkins build Hadoop-trunk-Commit #11107 (See https://builds.apache.org/job/Hadoop-trunk-Commit/11107/)
          YARN-5416. TestRMRestart#testRMRestartWaitForPreviousAMToFinish failed (jlowe: rev 357eab95668dbc419239857ac5ce763d76fd40e7)

          • (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMRestart.java
          • (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/TestSchedulerUtils.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Jenkins build Hadoop-trunk-Commit #11107 (See https://builds.apache.org/job/Hadoop-trunk-Commit/11107/ ) YARN-5416 . TestRMRestart#testRMRestartWaitForPreviousAMToFinish failed (jlowe: rev 357eab95668dbc419239857ac5ce763d76fd40e7) (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMRestart.java (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/TestSchedulerUtils.java

            People

            • Assignee:
              djp Junping Du
              Reporter:
              djp Junping Du
            • Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development