Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-4581

AHS writer thread leak makes RM crash while RM is recovering

    XMLWordPrintableJSON

Details

    Description

      we enable ApplicationHistoryWriter, and find thousands of Errors:

      2016-01-08 03:13:03,441 ERROR org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore: Error when openning history file of application application_1451878591907_0197
      java.io.IOException: Output file not at zero offset.
      at org.apache.hadoop.io.file.tfile.BCFile$Writer.<init>(BCFile.java:288)
      at org.apache.hadoop.io.file.tfile.TFile$Writer.<init>(TFile.java:288)
      at org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore$HistoryFileWriter.<init>(FileSystemApplicationHistoryStore.java:728)
      at org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore.applicationStarted(FileSystemApplicationHistoryStore.java:418)
      at org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter.handleWritingApplicationHistoryEvent(RMApplicationHistoryWriter.java:140)
      at org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter$ForwardingEventHandler.handle(RMApplicationHistoryWriter.java:297)
      at org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter$ForwardingEventHandler.handle(RMApplicationHistoryWriter.java:292)
      at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:191)
      at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:124)
      at java.lang.Thread.run(Thread.java:745)

      and this leads rm crashed:

      2016-01-08 03:13:08,335 FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread
      java.lang.OutOfMemoryError: unable to create new native thread
      at java.lang.Thread.start0(Native Method)
      at java.lang.Thread.start(Thread.java:714)
      at org.apache.hadoop.hdfs.DFSOutputStream.start(DFSOutputStream.java:2033)
      at org.apache.hadoop.hdfs.DFSOutputStream.newStreamForAppend(DFSOutputStream.java:1652)
      at org.apache.hadoop.hdfs.DFSClient.callAppend(DFSClient.java:1573)
      at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:1603)
      at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:1591)
      at org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:328)
      at org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:324)
      at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
      at org.apache.hadoop.hdfs.DistributedFileSystem.append(DistributedFileSystem.java:324)
      at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:1161)
      at org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore$HistoryFileWriter.<init>(FileSystemApplicationHistoryStore.java:723)
      at org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore.applicationStarted(FileSystemApplicationHistoryStore.java:418)
      at org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter.handleWritingApplicationHistoryEvent(RMApplicationHistoryWriter.java:140)
      at org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter$ForwardingEventHandler.handle(RMApplicationHistoryWriter.java:297)
      at org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter$ForwardingEventHandler.handle(RMApplicationHistoryWriter.java:292)
      at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:191)
      at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:124)
      at java.lang.Thread.run(Thread.java:745)

      after serveval failover, rm finish recovering, thousands of hdfs client thread are leaked in rm.

      "Thread-22723" #22893 daemon prio=5 os_prio=0 tid=0x00007f75f0346000 nid=0x132e in Object.wait() [0x00007f74ea7ca000]
      java.lang.Thread.State: TIMED_WAITING (on object monitor)
      at java.lang.Object.wait(Native Method)
      at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:502)

      • locked <0x0000000745f88b98> (a java.util.LinkedList)

      Attachments

        1. YARN-4581.01.patch
          2 kB
          sandflee

        Activity

          People

            sandflee sandflee
            sandflee sandflee
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: