Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-5023

TestAMRestart#testShouldNotCountFailureToMaxAttemptRetry random failure

    Details

    • Type: Test
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.8.0, 3.0.0-alpha1
    • Component/s: None
    • Labels:
      None
    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      https://builds.apache.org/job/PreCommit-YARN-Build/11296/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.8.0_91.txt

      Tests run: 10, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 96.482 sec <<< FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart
      testShouldNotCountFailureToMaxAttemptRetry(org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart)  Time elapsed: 56.467 sec  <<< FAILURE!
      java.lang.AssertionError: Attempt state is not correct (timeout). expected:<SCHEDULED> but was:<ALLOCATED>
      	at org.junit.Assert.fail(Assert.java:88)
      	at org.junit.Assert.failNotEquals(Assert.java:743)
      	at org.junit.Assert.assertEquals(Assert.java:118)
      	at org.apache.hadoop.yarn.server.resourcemanager.MockRM.waitForState(MockRM.java:266)
      	at org.apache.hadoop.yarn.server.resourcemanager.MockRM.waitForState(MockRM.java:225)
      	at org.apache.hadoop.yarn.server.resourcemanager.MockRM.waitForState(MockRM.java:207)
      	at org.apache.hadoop.yarn.server.resourcemanager.MockRM.waitForAttemptScheduled(MockRM.java:955)
      	at org.apache.hadoop.yarn.server.resourcemanager.MockRM.launchAM(MockRM.java:942)
      	at org.apache.hadoop.yarn.server.resourcemanager.MockRM.launchAndRegisterAM(MockRM.java:961)
      	at org.apache.hadoop.yarn.server.resourcemanager.MockRM.waitForNewAMToLaunchAndRegister(MockRM.java:295)
      	at org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart.testShouldNotCountFailureToMaxAttemptRetry(TestAMRestart.java:647)
      

        Activity

        Hide
        sandflee sandflee added a comment -
            // launch next AM in nm2
            nm2.nodeHeartbeat(true);
            MockAM am5 =
                rm1.waitForNewAMToLaunchAndRegister(app1.getApplicationId(), 5, nm2);
        

        seems we should not send nodeHeartBeat, since NODE STATUS_UPDATE event is processed async, there could be race conditions than app attempt becomes ALLOCATED but we are waiting for SCHEDULED.
        Bibin A Chundatt if you had not worked on this, I'd like update a patch.

        Show
        sandflee sandflee added a comment - // launch next AM in nm2 nm2.nodeHeartbeat( true ); MockAM am5 = rm1.waitForNewAMToLaunchAndRegister(app1.getApplicationId(), 5, nm2); seems we should not send nodeHeartBeat, since NODE STATUS_UPDATE event is processed async, there could be race conditions than app attempt becomes ALLOCATED but we are waiting for SCHEDULED. Bibin A Chundatt if you had not worked on this, I'd like update a patch.
        Hide
        bibinchundatt Bibin A Chundatt added a comment -

        sandflee
        I havn't started working on this jira. please feel free to upload patch.

        Show
        bibinchundatt Bibin A Chundatt added a comment - sandflee I havn't started working on this jira. please feel free to upload patch.
        Hide
        sunilg Sunil G added a comment -

        Yes sandflee
        We have recently faced similar issues. Internally launchAM also invokes nodeHeartbeat, so states will go one step ahead due to the second heartbeat. Could you pls attach patch on this line.

        Show
        sunilg Sunil G added a comment - Yes sandflee We have recently faced similar issues. Internally launchAM also invokes nodeHeartbeat , so states will go one step ahead due to the second heartbeat. Could you pls attach patch on this line.
        Hide
        sandflee sandflee added a comment -

        Hi Sunil G, I think the main problem is we shoudn't send first nodeHeartBeat, and the second nodeHeatBeat is nessesary for it drives rmapptempt state change from SCHEDULED to ALLOCATED. correct me if I miss something.

        Show
        sandflee sandflee added a comment - Hi Sunil G , I think the main problem is we shoudn't send first nodeHeartBeat, and the second nodeHeatBeat is nessesary for it drives rmapptempt state change from SCHEDULED to ALLOCATED. correct me if I miss something.
        Hide
        sandflee sandflee added a comment -

        one thing maybe we could improve: even the first nodeHeartBeat is sent, launchAM shoud also succ

        Show
        sandflee sandflee added a comment - one thing maybe we could improve: even the first nodeHeartBeat is sent, launchAM shoud also succ
        Hide
        sunilg Sunil G added a comment -

        Yes sandflee. I was also trying to mention the same, sorry for lack of clarity earlier. Two node heartbeats caused the issue. And as you suggested, we can remove the first one from test case, as its been invoked from launchAM internally also.

        I am not seeing any point in sending heartbeat explicitly from test case. As Rohith Sharma K S and me explained in YARN-4478 summary, we have seen many cases where contributers used apis from MockRM without knowing nodeHeartBeat is sent internally. And this is good example for same.

        Show
        sunilg Sunil G added a comment - Yes sandflee . I was also trying to mention the same, sorry for lack of clarity earlier. Two node heartbeats caused the issue. And as you suggested, we can remove the first one from test case, as its been invoked from launchAM internally also. I am not seeing any point in sending heartbeat explicitly from test case. As Rohith Sharma K S and me explained in YARN-4478 summary, we have seen many cases where contributers used apis from MockRM without knowing nodeHeartBeat is sent internally. And this is good example for same.
        Hide
        sandflee sandflee added a comment -

        update the patch, simple remove the first nodeHeatbeat before launchAM, noticed that TestRM#testNMTokenSentForNormalContainer had the same problem, fix it together

        Show
        sandflee sandflee added a comment - update the patch, simple remove the first nodeHeatbeat before launchAM, noticed that TestRM#testNMTokenSentForNormalContainer had the same problem, fix it together
        Hide
        hadoopqa Hadoop QA added a comment -
        -1 overall



        Vote Subsystem Runtime Comment
        0 reexec 12m 10s Docker mode activated.
        +1 @author 0m 0s The patch does not contain any @author tags.
        +1 test4tests 0m 0s The patch appears to include 3 new or modified test files.
        +1 mvninstall 6m 41s trunk passed
        +1 compile 0m 28s trunk passed with JDK v1.8.0_91
        +1 compile 0m 28s trunk passed with JDK v1.7.0_95
        +1 checkstyle 0m 18s trunk passed
        +1 mvnsite 0m 36s trunk passed
        +1 mvneclipse 0m 15s trunk passed
        +1 findbugs 1m 7s trunk passed
        +1 javadoc 0m 23s trunk passed with JDK v1.8.0_91
        +1 javadoc 0m 27s trunk passed with JDK v1.7.0_95
        +1 mvninstall 0m 32s the patch passed
        +1 compile 0m 28s the patch passed with JDK v1.8.0_91
        +1 javac 0m 28s the patch passed
        +1 compile 0m 28s the patch passed with JDK v1.7.0_95
        +1 javac 0m 28s the patch passed
        +1 checkstyle 0m 17s the patch passed
        +1 mvnsite 0m 35s the patch passed
        +1 mvneclipse 0m 13s the patch passed
        +1 whitespace 0m 0s Patch has no whitespace issues.
        +1 findbugs 1m 20s the patch passed
        +1 javadoc 0m 20s the patch passed with JDK v1.8.0_91
        +1 javadoc 0m 23s the patch passed with JDK v1.7.0_95
        -1 unit 44m 0s hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_91.
        -1 unit 46m 20s hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.7.0_95.
        +1 asflicense 0m 17s Patch does not generate ASF License warnings.
        119m 6s



        Reason Tests
        JDK v1.8.0_91 Failed junit tests hadoop.yarn.server.resourcemanager.TestClientRMTokens
          hadoop.yarn.server.resourcemanager.TestAMAuthorization
        JDK v1.8.0_91 Timed out junit tests org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesNodes
        JDK v1.7.0_95 Failed junit tests hadoop.yarn.server.resourcemanager.TestClientRMTokens
          hadoop.yarn.server.resourcemanager.TestRMRestart
          hadoop.yarn.server.resourcemanager.TestAMAuthorization
        JDK v1.7.0_95 Timed out junit tests org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesNodes



        Subsystem Report/Notes
        Docker Image:yetus/hadoop:cf2ee45
        JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12802070/YARN-5023.01.patch
        JIRA Issue YARN-5023
        Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
        uname Linux f91e754135c8 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
        Build tool maven
        Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
        git revision trunk / ed54f5f
        Default Java 1.7.0_95
        Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_91 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_95
        findbugs v3.0.0
        unit https://builds.apache.org/job/PreCommit-YARN-Build/11326/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.8.0_91.txt
        unit https://builds.apache.org/job/PreCommit-YARN-Build/11326/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.7.0_95.txt
        unit test logs https://builds.apache.org/job/PreCommit-YARN-Build/11326/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.8.0_91.txt https://builds.apache.org/job/PreCommit-YARN-Build/11326/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.7.0_95.txt
        JDK v1.7.0_95 Test Results https://builds.apache.org/job/PreCommit-YARN-Build/11326/testReport/
        modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
        Console output https://builds.apache.org/job/PreCommit-YARN-Build/11326/console
        Powered by Apache Yetus 0.2.0 http://yetus.apache.org

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 12m 10s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 3 new or modified test files. +1 mvninstall 6m 41s trunk passed +1 compile 0m 28s trunk passed with JDK v1.8.0_91 +1 compile 0m 28s trunk passed with JDK v1.7.0_95 +1 checkstyle 0m 18s trunk passed +1 mvnsite 0m 36s trunk passed +1 mvneclipse 0m 15s trunk passed +1 findbugs 1m 7s trunk passed +1 javadoc 0m 23s trunk passed with JDK v1.8.0_91 +1 javadoc 0m 27s trunk passed with JDK v1.7.0_95 +1 mvninstall 0m 32s the patch passed +1 compile 0m 28s the patch passed with JDK v1.8.0_91 +1 javac 0m 28s the patch passed +1 compile 0m 28s the patch passed with JDK v1.7.0_95 +1 javac 0m 28s the patch passed +1 checkstyle 0m 17s the patch passed +1 mvnsite 0m 35s the patch passed +1 mvneclipse 0m 13s the patch passed +1 whitespace 0m 0s Patch has no whitespace issues. +1 findbugs 1m 20s the patch passed +1 javadoc 0m 20s the patch passed with JDK v1.8.0_91 +1 javadoc 0m 23s the patch passed with JDK v1.7.0_95 -1 unit 44m 0s hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_91. -1 unit 46m 20s hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.7.0_95. +1 asflicense 0m 17s Patch does not generate ASF License warnings. 119m 6s Reason Tests JDK v1.8.0_91 Failed junit tests hadoop.yarn.server.resourcemanager.TestClientRMTokens   hadoop.yarn.server.resourcemanager.TestAMAuthorization JDK v1.8.0_91 Timed out junit tests org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesNodes JDK v1.7.0_95 Failed junit tests hadoop.yarn.server.resourcemanager.TestClientRMTokens   hadoop.yarn.server.resourcemanager.TestRMRestart   hadoop.yarn.server.resourcemanager.TestAMAuthorization JDK v1.7.0_95 Timed out junit tests org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesNodes Subsystem Report/Notes Docker Image:yetus/hadoop:cf2ee45 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12802070/YARN-5023.01.patch JIRA Issue YARN-5023 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux f91e754135c8 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / ed54f5f Default Java 1.7.0_95 Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_91 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_95 findbugs v3.0.0 unit https://builds.apache.org/job/PreCommit-YARN-Build/11326/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.8.0_91.txt unit https://builds.apache.org/job/PreCommit-YARN-Build/11326/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.7.0_95.txt unit test logs https://builds.apache.org/job/PreCommit-YARN-Build/11326/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.8.0_91.txt https://builds.apache.org/job/PreCommit-YARN-Build/11326/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.7.0_95.txt JDK v1.7.0_95 Test Results https://builds.apache.org/job/PreCommit-YARN-Build/11326/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/11326/console Powered by Apache Yetus 0.2.0 http://yetus.apache.org This message was automatically generated.
        Hide
        sandflee sandflee added a comment -

        test failure not related to the patch, TestRMRestart failure could be reproduced randomly, open YARN-5037 to track.

        Show
        sandflee sandflee added a comment - test failure not related to the patch, TestRMRestart failure could be reproduced randomly, open YARN-5037 to track.
        Hide
        bibinchundatt Bibin A Chundatt added a comment -

        sandflee in any of your testrun wr you able to reproduce failure for TestAMRestart.testRMAppAttemptFailuresValidityInterval?

        Show
        bibinchundatt Bibin A Chundatt added a comment - sandflee in any of your testrun wr you able to reproduce failure for TestAMRestart.testRMAppAttemptFailuresValidityInterval ?
        Hide
        sandflee sandflee added a comment -

        Hi, Bibin A Chundatt I run several times, and couldn't reproduce TestAMRestart.testRMAppAttemptFailuresValidityInterval

        Show
        sandflee sandflee added a comment - Hi, Bibin A Chundatt I run several times, and couldn't reproduce TestAMRestart.testRMAppAttemptFailuresValidityInterval
        Hide
        sandflee sandflee added a comment -

        I write a script to auto test and reproduce it, file YARN-5043, may it help

        Show
        sandflee sandflee added a comment - I write a script to auto test and reproduce it, file YARN-5043 , may it help
        Hide
        jianhe Jian He added a comment -

        Committed to trunk, branch-2, branch-2.8 thanks sandflee !
        Thanks Bibin A Chundatt for the review !

        Show
        jianhe Jian He added a comment - Committed to trunk, branch-2, branch-2.8 thanks sandflee ! Thanks Bibin A Chundatt for the review !
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Hadoop-trunk-Commit #10044 (See https://builds.apache.org/job/Hadoop-trunk-Commit/10044/)
        YARN-5023. TestAMRestart#testShouldNotCountFailureToMaxAttemptRetry (jianhe: rev c35a5a7a8d85b42498e6981a6b1f09f2bdd56459)

        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRM.java
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/applicationsmanager/TestAMRestart.java
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockRM.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-trunk-Commit #10044 (See https://builds.apache.org/job/Hadoop-trunk-Commit/10044/ ) YARN-5023 . TestAMRestart#testShouldNotCountFailureToMaxAttemptRetry (jianhe: rev c35a5a7a8d85b42498e6981a6b1f09f2bdd56459) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRM.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/applicationsmanager/TestAMRestart.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockRM.java

          People

          • Assignee:
            sandflee sandflee
            Reporter:
            bibinchundatt Bibin A Chundatt
          • Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development