Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-4581

AHS writer thread leak makes RM crash while RM is recovering

    Details

      Description

      we enable ApplicationHistoryWriter, and find thousands of Errors:

      2016-01-08 03:13:03,441 ERROR org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore: Error when openning history file of application application_1451878591907_0197
      java.io.IOException: Output file not at zero offset.
      at org.apache.hadoop.io.file.tfile.BCFile$Writer.<init>(BCFile.java:288)
      at org.apache.hadoop.io.file.tfile.TFile$Writer.<init>(TFile.java:288)
      at org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore$HistoryFileWriter.<init>(FileSystemApplicationHistoryStore.java:728)
      at org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore.applicationStarted(FileSystemApplicationHistoryStore.java:418)
      at org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter.handleWritingApplicationHistoryEvent(RMApplicationHistoryWriter.java:140)
      at org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter$ForwardingEventHandler.handle(RMApplicationHistoryWriter.java:297)
      at org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter$ForwardingEventHandler.handle(RMApplicationHistoryWriter.java:292)
      at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:191)
      at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:124)
      at java.lang.Thread.run(Thread.java:745)

      and this leads rm crashed:

      2016-01-08 03:13:08,335 FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread
      java.lang.OutOfMemoryError: unable to create new native thread
      at java.lang.Thread.start0(Native Method)
      at java.lang.Thread.start(Thread.java:714)
      at org.apache.hadoop.hdfs.DFSOutputStream.start(DFSOutputStream.java:2033)
      at org.apache.hadoop.hdfs.DFSOutputStream.newStreamForAppend(DFSOutputStream.java:1652)
      at org.apache.hadoop.hdfs.DFSClient.callAppend(DFSClient.java:1573)
      at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:1603)
      at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:1591)
      at org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:328)
      at org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:324)
      at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
      at org.apache.hadoop.hdfs.DistributedFileSystem.append(DistributedFileSystem.java:324)
      at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:1161)
      at org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore$HistoryFileWriter.<init>(FileSystemApplicationHistoryStore.java:723)
      at org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore.applicationStarted(FileSystemApplicationHistoryStore.java:418)
      at org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter.handleWritingApplicationHistoryEvent(RMApplicationHistoryWriter.java:140)
      at org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter$ForwardingEventHandler.handle(RMApplicationHistoryWriter.java:297)
      at org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter$ForwardingEventHandler.handle(RMApplicationHistoryWriter.java:292)
      at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:191)
      at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:124)
      at java.lang.Thread.run(Thread.java:745)

      after serveval failover, rm finish recovering, thousands of hdfs client thread are leaked in rm.

      "Thread-22723" #22893 daemon prio=5 os_prio=0 tid=0x00007f75f0346000 nid=0x132e in Object.wait() [0x00007f74ea7ca000]
      java.lang.Thread.State: TIMED_WAITING (on object monitor)
      at java.lang.Object.wait(Native Method)
      at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:502)

      • locked <0x0000000745f88b98> (a java.util.LinkedList)

        Activity

        Hide
        Naganarasimha Naganarasimha G R added a comment -

        Given that AHS(including FileSystemWriter) is already deprecated and planned to be completely removed (YARN-4542), Do we need to work further on this issue ?
        cc/ Junping Du

        Show
        Naganarasimha Naganarasimha G R added a comment - Given that AHS(including FileSystemWriter) is already deprecated and planned to be completely removed ( YARN-4542 ), Do we need to work further on this issue ? cc/ Junping Du
        Hide
        Naganarasimha Naganarasimha G R added a comment -

        meant FileSystemWriter = > FileSystemApplicationHistoryStore}}

        Show
        Naganarasimha Naganarasimha G R added a comment - meant FileSystemWriter = > FileSystemApplicationHistoryStore}}
        Hide
        sandflee sandflee added a comment -

        simple fix thread leak problem.

        Show
        sandflee sandflee added a comment - simple fix thread leak problem.
        Hide
        djp Junping Du added a comment -

        Hi sandflee, thanks for reporting the issue and delivering the patch.
        Like Naga mentioned above, AHS is already a deprecated feature in community and ATS (Application Timeline Service) is a replacement for it since 2.6.0. Do you have plan to migrate to ATS instead of AHS?

        Show
        djp Junping Du added a comment - Hi sandflee , thanks for reporting the issue and delivering the patch. Like Naga mentioned above, AHS is already a deprecated feature in community and ATS (Application Timeline Service) is a replacement for it since 2.6.0. Do you have plan to migrate to ATS instead of AHS?
        Hide
        sandflee sandflee added a comment -

        thanks Naganarasimha G R Junping Du, our cluster is based on 2.4.1, and will use ATS util we update cluster.

        Show
        sandflee sandflee added a comment - thanks Naganarasimha G R Junping Du , our cluster is based on 2.4.1, and will use ATS util we update cluster.
        Hide
        vinodkv Vinod Kumar Vavilapalli added a comment -

        sandflee / Junping Du / Naganarasimha G R Given that the patch is straightforward, shall we just get it in for folks on older versions? I don't see any downside to including the patch.

        Show
        vinodkv Vinod Kumar Vavilapalli added a comment - sandflee / Junping Du / Naganarasimha G R Given that the patch is straightforward, shall we just get it in for folks on older versions? I don't see any downside to including the patch.
        Hide
        Naganarasimha Naganarasimha G R added a comment -

        Hi Vinod Kumar Vavilapalli, Junping Du
        Yes may be It makes sense to be available in 2.6.x and 2.7.x and we can remove AHS based on FileStore in later versions as part of YARN-4542. Thoughts ?

        Show
        Naganarasimha Naganarasimha G R added a comment - Hi Vinod Kumar Vavilapalli , Junping Du Yes may be It makes sense to be available in 2.6.x and 2.7.x and we can remove AHS based on FileStore in later versions as part of YARN-4542 . Thoughts ?
        Hide
        hadoopqa Hadoop QA added a comment -
        -1 overall



        Vote Subsystem Runtime Comment
        0 reexec 0m 0s Docker mode activated.
        +1 @author 0m 0s The patch does not contain any @author tags.
        -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
        +1 mvninstall 7m 35s trunk passed
        +1 compile 0m 16s trunk passed with JDK v1.8.0_66
        +1 compile 0m 19s trunk passed with JDK v1.7.0_91
        +1 checkstyle 0m 10s trunk passed
        +1 mvnsite 0m 24s trunk passed
        +1 mvneclipse 0m 15s trunk passed
        +1 findbugs 0m 37s trunk passed
        +1 javadoc 0m 15s trunk passed with JDK v1.8.0_66
        +1 javadoc 0m 19s trunk passed with JDK v1.7.0_91
        +1 mvninstall 0m 18s the patch passed
        +1 compile 0m 14s the patch passed with JDK v1.8.0_66
        +1 javac 0m 14s the patch passed
        +1 compile 0m 16s the patch passed with JDK v1.7.0_91
        +1 javac 0m 16s the patch passed
        -1 checkstyle 0m 10s Patch generated 1 new checkstyle issues in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice (total was 23, now 23).
        +1 mvnsite 0m 21s the patch passed
        +1 mvneclipse 0m 12s the patch passed
        +1 whitespace 0m 0s Patch has no whitespace issues.
        +1 findbugs 0m 41s the patch passed
        +1 javadoc 0m 13s the patch passed with JDK v1.8.0_66
        +1 javadoc 0m 17s the patch passed with JDK v1.7.0_91
        +1 unit 3m 48s hadoop-yarn-server-applicationhistoryservice in the patch passed with JDK v1.8.0_66.
        +1 unit 4m 7s hadoop-yarn-server-applicationhistoryservice in the patch passed with JDK v1.7.0_91.
        +1 asflicense 0m 20s Patch does not generate ASF License warnings.
        22m 11s



        Subsystem Report/Notes
        Docker Image:yetus/hadoop:0ca8df7
        JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12781743/YARN-4581.01.patch
        JIRA Issue YARN-4581
        Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
        uname Linux 69b2b038e401 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
        Build tool maven
        Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
        git revision trunk / c0537bc
        Default Java 1.7.0_91
        Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_66 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_91
        findbugs v3.0.0
        checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/10262/artifact/patchprocess/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-applicationhistoryservice.txt
        JDK v1.7.0_91 Test Results https://builds.apache.org/job/PreCommit-YARN-Build/10262/testReport/
        modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice
        Max memory used 76MB
        Powered by Apache Yetus 0.2.0-SNAPSHOT http://yetus.apache.org
        Console output https://builds.apache.org/job/PreCommit-YARN-Build/10262/console

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 0s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 mvninstall 7m 35s trunk passed +1 compile 0m 16s trunk passed with JDK v1.8.0_66 +1 compile 0m 19s trunk passed with JDK v1.7.0_91 +1 checkstyle 0m 10s trunk passed +1 mvnsite 0m 24s trunk passed +1 mvneclipse 0m 15s trunk passed +1 findbugs 0m 37s trunk passed +1 javadoc 0m 15s trunk passed with JDK v1.8.0_66 +1 javadoc 0m 19s trunk passed with JDK v1.7.0_91 +1 mvninstall 0m 18s the patch passed +1 compile 0m 14s the patch passed with JDK v1.8.0_66 +1 javac 0m 14s the patch passed +1 compile 0m 16s the patch passed with JDK v1.7.0_91 +1 javac 0m 16s the patch passed -1 checkstyle 0m 10s Patch generated 1 new checkstyle issues in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice (total was 23, now 23). +1 mvnsite 0m 21s the patch passed +1 mvneclipse 0m 12s the patch passed +1 whitespace 0m 0s Patch has no whitespace issues. +1 findbugs 0m 41s the patch passed +1 javadoc 0m 13s the patch passed with JDK v1.8.0_66 +1 javadoc 0m 17s the patch passed with JDK v1.7.0_91 +1 unit 3m 48s hadoop-yarn-server-applicationhistoryservice in the patch passed with JDK v1.8.0_66. +1 unit 4m 7s hadoop-yarn-server-applicationhistoryservice in the patch passed with JDK v1.7.0_91. +1 asflicense 0m 20s Patch does not generate ASF License warnings. 22m 11s Subsystem Report/Notes Docker Image:yetus/hadoop:0ca8df7 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12781743/YARN-4581.01.patch JIRA Issue YARN-4581 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 69b2b038e401 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / c0537bc Default Java 1.7.0_91 Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_66 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_91 findbugs v3.0.0 checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/10262/artifact/patchprocess/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-applicationhistoryservice.txt JDK v1.7.0_91 Test Results https://builds.apache.org/job/PreCommit-YARN-Build/10262/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice Max memory used 76MB Powered by Apache Yetus 0.2.0-SNAPSHOT http://yetus.apache.org Console output https://builds.apache.org/job/PreCommit-YARN-Build/10262/console This message was automatically generated.
        Hide
        djp Junping Du added a comment -

        Agree with Vinod and Naga that we should include it for older version user.
        +1 on the patch which is very straightforward so no test should be fine. Will commit it shortly if no further comments from others.

        Show
        djp Junping Du added a comment - Agree with Vinod and Naga that we should include it for older version user. +1 on the patch which is very straightforward so no test should be fine. Will commit it shortly if no further comments from others.
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-trunk-Commit #9119 (See https://builds.apache.org/job/Hadoop-trunk-Commit/9119/)
        YARN-4581. AHS writer thread leak makes RM crash while RM is recovering. (junping_du: rev fc6d3a3b234efff2b0b646c31a4e6ff0a5118ef9)

        • hadoop-yarn-project/CHANGES.txt
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/FileSystemApplicationHistoryStore.java
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-trunk-Commit #9119 (See https://builds.apache.org/job/Hadoop-trunk-Commit/9119/ ) YARN-4581 . AHS writer thread leak makes RM crash while RM is recovering. (junping_du: rev fc6d3a3b234efff2b0b646c31a4e6ff0a5118ef9) hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/FileSystemApplicationHistoryStore.java
        Hide
        djp Junping Du added a comment -

        I have commit the patch to trunk, branch-2, branch-2.6, branch-2.7 and branch-2.8. Thanks sandflee for contributing the patch! Also, thanks to Vinod and Naga for review and comment.

        Show
        djp Junping Du added a comment - I have commit the patch to trunk, branch-2, branch-2.6, branch-2.7 and branch-2.8. Thanks sandflee for contributing the patch! Also, thanks to Vinod and Naga for review and comment.
        Hide
        sandflee sandflee added a comment -

        thanks Junping, Naga, Vinod!

        Show
        sandflee sandflee added a comment - thanks Junping, Naga, Vinod!
        Hide
        vinodkv Vinod Kumar Vavilapalli added a comment -

        Closing the JIRA as part of 2.7.3 release.

        Show
        vinodkv Vinod Kumar Vavilapalli added a comment - Closing the JIRA as part of 2.7.3 release.

          People

          • Assignee:
            sandflee sandflee
            Reporter:
            sandflee sandflee
          • Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development