Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-6068

Log aggregation get failed when NM restart even with recovery

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.8.0, 2.9.0, 3.0.0-alpha2
    • Component/s: None
    • Labels:
      None
    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      The exception log is as following:

      2017-01-05 19:16:36,352 INFO  logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:abortLogAggregation(527)) - Aborting log aggregation for application_1483640789847_0001
      2017-01-05 19:16:36,352 WARN  logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:run(399)) - Aggregation did not complete for application application_1483640789847_0001
      2017-01-05 19:16:36,353 WARN  application.ApplicationImpl (ApplicationImpl.java:handle(461)) - Can't handle this event at current state
      org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: APPLICATION_LOG_HANDLING_FAILED at RUNNING
              at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
              at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
              at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
              at org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl.handle(ApplicationImpl.java:459)
              at org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl.handle(ApplicationImpl.java:64)
              at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ApplicationEventDispatcher.handle(ContainerManagerImpl.java:1084)
              at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ApplicationEventDispatcher.handle(ContainerManagerImpl.java:1076)
              at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184)
              at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110)
              at java.lang.Thread.run(Thread.java:745)
      2017-01-05 19:16:36,355 INFO  application.ApplicationImpl (ApplicationImpl.java:handle(464)) - Application application_1483640789847_0001 transitioned from RUNNING to null
      
      1. YARN-6068.patch
        2 kB
        Junping Du
      2. YARN-6068-v2.patch
        1 kB
        Junping Du

        Activity

        Hide
        djp Junping Du added a comment -

        In YARN-4325, we add sending out aggregation failure event to get rid of app leak in NM state store issues. However, we forget one case that log aggregation could abort rather than finish when NM get restart. In this case, we shouldn't send aggregation failure event.

        Show
        djp Junping Du added a comment - In YARN-4325 , we add sending out aggregation failure event to get rid of app leak in NM state store issues. However, we forget one case that log aggregation could abort rather than finish when NM get restart. In this case, we shouldn't send aggregation failure event.
        Hide
        djp Junping Du added a comment -

        Upload a quick patch to fix it. Should be straightforward enough without UT.

        Show
        djp Junping Du added a comment - Upload a quick patch to fix it. Should be straightforward enough without UT.
        Hide
        hadoopqa Hadoop QA added a comment -
        -1 overall



        Vote Subsystem Runtime Comment
        0 reexec 0m 14s Docker mode activated.
        +1 @author 0m 0s The patch does not contain any @author tags.
        -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
        +1 mvninstall 12m 44s trunk passed
        +1 compile 0m 27s trunk passed
        +1 checkstyle 0m 16s trunk passed
        +1 mvnsite 0m 26s trunk passed
        +1 mvneclipse 0m 13s trunk passed
        +1 findbugs 0m 40s trunk passed
        +1 javadoc 0m 18s trunk passed
        +1 mvninstall 0m 22s the patch passed
        +1 compile 0m 23s the patch passed
        +1 javac 0m 23s the patch passed
        -0 checkstyle 0m 14s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: The patch generated 3 new + 15 unchanged - 0 fixed = 18 total (was 15)
        +1 mvnsite 0m 23s the patch passed
        +1 mvneclipse 0m 11s the patch passed
        -1 whitespace 0m 0s The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix <<patch_file>>. Refer https://git-scm.com/docs/git-apply
        +1 findbugs 0m 45s the patch passed
        +1 javadoc 0m 14s the patch passed
        +1 unit 12m 49s hadoop-yarn-server-nodemanager in the patch passed.
        +1 asflicense 0m 16s The patch does not generate ASF License warnings.
        32m 9s



        Subsystem Report/Notes
        Docker Image:yetus/hadoop:a9ad5d6
        JIRA Issue YARN-6068
        JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12846158/YARN-6068.patch
        Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
        uname Linux 35b9e0478b31 3.13.0-95-generic #142-Ubuntu SMP Fri Aug 12 17:00:09 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
        Build tool maven
        Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
        git revision trunk / a59df15
        Default Java 1.8.0_111
        findbugs v3.0.0
        checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/14597/artifact/patchprocess/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt
        whitespace https://builds.apache.org/job/PreCommit-YARN-Build/14597/artifact/patchprocess/whitespace-eol.txt
        Test Results https://builds.apache.org/job/PreCommit-YARN-Build/14597/testReport/
        modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
        Console output https://builds.apache.org/job/PreCommit-YARN-Build/14597/console
        Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 14s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 mvninstall 12m 44s trunk passed +1 compile 0m 27s trunk passed +1 checkstyle 0m 16s trunk passed +1 mvnsite 0m 26s trunk passed +1 mvneclipse 0m 13s trunk passed +1 findbugs 0m 40s trunk passed +1 javadoc 0m 18s trunk passed +1 mvninstall 0m 22s the patch passed +1 compile 0m 23s the patch passed +1 javac 0m 23s the patch passed -0 checkstyle 0m 14s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: The patch generated 3 new + 15 unchanged - 0 fixed = 18 total (was 15) +1 mvnsite 0m 23s the patch passed +1 mvneclipse 0m 11s the patch passed -1 whitespace 0m 0s The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix <<patch_file>>. Refer https://git-scm.com/docs/git-apply +1 findbugs 0m 45s the patch passed +1 javadoc 0m 14s the patch passed +1 unit 12m 49s hadoop-yarn-server-nodemanager in the patch passed. +1 asflicense 0m 16s The patch does not generate ASF License warnings. 32m 9s Subsystem Report/Notes Docker Image:yetus/hadoop:a9ad5d6 JIRA Issue YARN-6068 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12846158/YARN-6068.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 35b9e0478b31 3.13.0-95-generic #142-Ubuntu SMP Fri Aug 12 17:00:09 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / a59df15 Default Java 1.8.0_111 findbugs v3.0.0 checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/14597/artifact/patchprocess/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt whitespace https://builds.apache.org/job/PreCommit-YARN-Build/14597/artifact/patchprocess/whitespace-eol.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/14597/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/14597/console Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
        Hide
        varun_saxena Varun Saxena added a comment -

        Thanks Junping Du for raising the issue. We infact saw exact same issue in our clusters yesterday night.
        The changes as such look fine to me.
        In the patch, we have added an additional log ("Log aggregation abort for application .... due to NM restart"). I think this is not required. We already have a log printed when we call AppLogAggregatorImpl#abortLogAggregation. That should be good enough I guess.

        Show
        varun_saxena Varun Saxena added a comment - Thanks Junping Du for raising the issue. We infact saw exact same issue in our clusters yesterday night. The changes as such look fine to me. In the patch, we have added an additional log ("Log aggregation abort for application .... due to NM restart"). I think this is not required. We already have a log printed when we call AppLogAggregatorImpl#abortLogAggregation. That should be good enough I guess.
        Hide
        djp Junping Du added a comment -

        Thanks Varun Saxena for review and comments! You are right that we already have info log for this case. v2 patch incorporate your comments.

        Show
        djp Junping Du added a comment - Thanks Varun Saxena for review and comments! You are right that we already have info log for this case. v2 patch incorporate your comments.
        Hide
        hadoopqa Hadoop QA added a comment -
        -1 overall



        Vote Subsystem Runtime Comment
        0 reexec 0m 16s Docker mode activated.
        +1 @author 0m 0s The patch does not contain any @author tags.
        -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
        +1 mvninstall 14m 18s trunk passed
        +1 compile 0m 32s trunk passed
        +1 checkstyle 0m 19s trunk passed
        +1 mvnsite 0m 31s trunk passed
        +1 mvneclipse 0m 15s trunk passed
        +1 findbugs 0m 47s trunk passed
        +1 javadoc 0m 20s trunk passed
        +1 mvninstall 0m 27s the patch passed
        +1 compile 0m 29s the patch passed
        +1 javac 0m 29s the patch passed
        +1 checkstyle 0m 16s the patch passed
        +1 mvnsite 0m 27s the patch passed
        +1 mvneclipse 0m 13s the patch passed
        +1 whitespace 0m 0s The patch has no whitespace issues.
        +1 findbugs 0m 54s the patch passed
        +1 javadoc 0m 20s the patch passed
        +1 unit 13m 31s hadoop-yarn-server-nodemanager in the patch passed.
        +1 asflicense 0m 18s The patch does not generate ASF License warnings.
        35m 37s



        Subsystem Report/Notes
        Docker Image:yetus/hadoop:a9ad5d6
        JIRA Issue YARN-6068
        JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12846206/YARN-6068-v2.patch
        Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
        uname Linux 2c8e8c72e361 3.13.0-105-generic #152-Ubuntu SMP Fri Dec 2 15:37:11 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
        Build tool maven
        Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
        git revision trunk / ac16400
        Default Java 1.8.0_111
        findbugs v3.0.0
        Test Results https://builds.apache.org/job/PreCommit-YARN-Build/14602/testReport/
        modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
        Console output https://builds.apache.org/job/PreCommit-YARN-Build/14602/console
        Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 16s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 mvninstall 14m 18s trunk passed +1 compile 0m 32s trunk passed +1 checkstyle 0m 19s trunk passed +1 mvnsite 0m 31s trunk passed +1 mvneclipse 0m 15s trunk passed +1 findbugs 0m 47s trunk passed +1 javadoc 0m 20s trunk passed +1 mvninstall 0m 27s the patch passed +1 compile 0m 29s the patch passed +1 javac 0m 29s the patch passed +1 checkstyle 0m 16s the patch passed +1 mvnsite 0m 27s the patch passed +1 mvneclipse 0m 13s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 0m 54s the patch passed +1 javadoc 0m 20s the patch passed +1 unit 13m 31s hadoop-yarn-server-nodemanager in the patch passed. +1 asflicense 0m 18s The patch does not generate ASF License warnings. 35m 37s Subsystem Report/Notes Docker Image:yetus/hadoop:a9ad5d6 JIRA Issue YARN-6068 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12846206/YARN-6068-v2.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 2c8e8c72e361 3.13.0-105-generic #152-Ubuntu SMP Fri Dec 2 15:37:11 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / ac16400 Default Java 1.8.0_111 findbugs v3.0.0 Test Results https://builds.apache.org/job/PreCommit-YARN-Build/14602/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/14602/console Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
        Hide
        djp Junping Du added a comment -

        Mark this as blocker for 2.8 as the issue will break NM restart work preserving feature. Can someone take a quick look and commit as our RC is almost out of the door?

        Show
        djp Junping Du added a comment - Mark this as blocker for 2.8 as the issue will break NM restart work preserving feature. Can someone take a quick look and commit as our RC is almost out of the door?
        Hide
        varun_saxena Varun Saxena added a comment -

        +1 LGTM.
        Will commit it shortly

        Show
        varun_saxena Varun Saxena added a comment - +1 LGTM. Will commit it shortly
        Hide
        varun_saxena Varun Saxena added a comment -

        Committed to trunk, branch-2 and branch-2.8
        Thanks Junping Du for raising the issue and fixing it.

        Show
        varun_saxena Varun Saxena added a comment - Committed to trunk, branch-2 and branch-2.8 Thanks Junping Du for raising the issue and fixing it.
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #11088 (See https://builds.apache.org/job/Hadoop-trunk-Commit/11088/)
        YARN-6068. Log aggregation get failed when NM restart even with recovery (varunsaxena: rev f59e36b4ce71d3019ab91b136b6d7646316954e7)

        • (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #11088 (See https://builds.apache.org/job/Hadoop-trunk-Commit/11088/ ) YARN-6068 . Log aggregation get failed when NM restart even with recovery (varunsaxena: rev f59e36b4ce71d3019ab91b136b6d7646316954e7) (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java

          People

          • Assignee:
            djp Junping Du
            Reporter:
            djp Junping Du
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development