Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-3058

Sometimes task keeps on running while its Syslog says that it is shutdown

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • 0.23.0
    • 0.23.0
    • contrib/gridmix, mrv2
    • None
    • Reviewed

    Description

      While running GridMixV3, one of the jobs got stuck for 15 hrs. After clicking on the Job-page, found one of its reduces to be stuck. Looking at syslog of the stuck reducer, found this:
      Task-logs' head:

      2011-09-19 17:57:22,002 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s).
      2011-09-19 17:57:22,002 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ReduceTask metrics system started
      

      Task-logs' tail:

      2011-09-19 18:06:49,818 INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink as <DATANODE1>
      2011-09-19 18:06:49,818 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block BP-1405370709-<NAMENODE>-1316452621953:blk_-7004355226367468317_79871 in pipeline  <DATANODE2>,  <DATANODE1>: bad datanode  <DATANODE1>
      2011-09-19 18:06:49,818 DEBUG org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtocol: lastAckedSeqno = 26870
      2011-09-19 18:06:49,820 DEBUG org.apache.hadoop.ipc.Client: IPC Client (26613121) connection to <NAMENODE> from gridperf sending #454
      2011-09-19 18:06:49,826 DEBUG org.apache.hadoop.ipc.Client: IPC Client (26613121) connection to <<NAMENODE> from gridperf got value #454
      2011-09-19 18:06:49,827 DEBUG org.apache.hadoop.ipc.RPC: Call: getAdditionalDatanode 8
      2011-09-19 18:06:49,827 DEBUG org.apache.hadoop.hdfs.DFSClient: Connecting to datanode <DATANODE2>
      2011-09-19 18:06:49,827 DEBUG org.apache.hadoop.hdfs.DFSClient: Send buf size 131071
      2011-09-19 18:06:49,833 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception
      java.io.EOFException: Premature EOF: no length prefix available
              at org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:158)
              at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:860)
              at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:838)
              at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:929)
              at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:740)
              at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:415)
      2011-09-19 18:06:49,837 WARN org.apache.hadoop.mapred.YarnChild: Exception running child : java.io.EOFException: Premature EOF: no length prefix available
              at org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:158)
              at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:860)
              at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:838)
              at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:929)
              at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:740)
              at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:415)
      
      2011-09-19 18:06:49,837 DEBUG org.apache.hadoop.ipc.Client: IPC Client (26613121) connection to <APPMASTER> from job_1316452677984_0862 sending #455
      2011-09-19 18:06:49,839 DEBUG org.apache.hadoop.ipc.Client: IPC Client (26613121) connection to <APPMASTER> from job_1316452677984_0862 got value #455
      2011-09-19 18:06:49,840 DEBUG org.apache.hadoop.ipc.RPC: Call: statusUpdate 3
      2011-09-19 18:06:49,840 INFO org.apache.hadoop.mapred.Task: Runnning cleanup for the task
      2011-09-19 18:06:49,840 DEBUG org.apache.hadoop.ipc.Client: IPC Client (26613121) connection to <NAMENODE> from gridperf sending #456
      2011-09-19 18:06:49,858 DEBUG org.apache.hadoop.ipc.Client: IPC Client (26613121) connection to <NAMENODE> from gridperf got value #456
      2011-09-19 18:06:49,858 DEBUG org.apache.hadoop.ipc.RPC: Call: delete 18
      2011-09-19 18:06:49,858 DEBUG org.apache.hadoop.ipc.Client: IPC Client (26613121) connection to <APPMASTER> from job_1316452677984_0862 sending #457
      2011-09-19 18:06:49,859 DEBUG org.apache.hadoop.ipc.Client: IPC Client (26613121) connection to <APPMASTER> from job_1316452677984_0862 got value #457
      2011-09-19 18:06:49,859 DEBUG org.apache.hadoop.ipc.RPC: Call: reportDiagnosticInfo 1
      2011-09-19 18:06:49,859 DEBUG org.apache.hadoop.metrics2.impl.MetricsSystemImpl: refCount=1
      2011-09-19 18:06:49,859 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping ReduceTask metrics system...
      2011-09-19 18:06:49,859 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping metrics source UgiMetrics
      2011-09-19 18:06:49,859 DEBUG org.apache.hadoop.metrics2.impl.MetricsSystemImpl: class org.apache.hadoop.metrics2.lib.MetricsSourceBuilder$1
      2011-09-19 18:06:49,860 DEBUG org.apache.hadoop.metrics2.util.MBeans: Unregistering Hadoop:service=ReduceTask,name=UgiMetrics
      2011-09-19 18:06:49,860 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping metrics source JvmMetrics
      2011-09-19 18:06:49,860 DEBUG org.apache.hadoop.metrics2.impl.MetricsSystemImpl: class org.apache.hadoop.metrics2.source.JvmMetrics
      2011-09-19 18:06:49,860 DEBUG org.apache.hadoop.metrics2.util.MBeans: Unregistering Hadoop:service=ReduceTask,name=JvmMetrics
      2011-09-19 18:06:49,860 DEBUG org.apache.hadoop.metrics2.util.MBeans: Unregistering Hadoop:service=ReduceTask,name=MetricsSystem,sub=Stats
      2011-09-19 18:06:49,860 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ReduceTask metrics system stopped.
      2011-09-19 18:06:49,860 DEBUG org.apache.hadoop.metrics2.util.MBeans: Unregistering Hadoop:service=ReduceTask,name=MetricsSystem,sub=Control
      2011-09-19 18:06:49,860 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ReduceTask metrics system shutdown complete.
      

      Which means that tasks is supposed to have stopped within 20 secs, whereas the process itself is stuck for more than 15 hours. From AM log, also found that this task was sending its update regularly. ps -ef | grep java was also showing that process is still alive.

      Attachments

        1. MR-3058.wip.patch
          23 kB
          Hitesh Shah
        2. MAPREDUCE-3058-20111021.txt
          4 kB
          Vinod Kumar Vavilapalli
        3. MAPREDUCE-3058-20110923.txt
          0.8 kB
          Vinod Kumar Vavilapalli

        Activity

          People

            vinodkv Vinod Kumar Vavilapalli
            karams Karam Singh
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: