Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-3058

Sometimes task keeps on running while its Syslog says that it is shutdown

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 0.23.0
    • Fix Version/s: 0.23.0
    • Component/s: contrib/gridmix, mrv2
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      While running GridMixV3, one of the jobs got stuck for 15 hrs. After clicking on the Job-page, found one of its reduces to be stuck. Looking at syslog of the stuck reducer, found this:
      Task-logs' head:

      2011-09-19 17:57:22,002 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s).
      2011-09-19 17:57:22,002 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ReduceTask metrics system started
      

      Task-logs' tail:

      2011-09-19 18:06:49,818 INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink as <DATANODE1>
      2011-09-19 18:06:49,818 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block BP-1405370709-<NAMENODE>-1316452621953:blk_-7004355226367468317_79871 in pipeline  <DATANODE2>,  <DATANODE1>: bad datanode  <DATANODE1>
      2011-09-19 18:06:49,818 DEBUG org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtocol: lastAckedSeqno = 26870
      2011-09-19 18:06:49,820 DEBUG org.apache.hadoop.ipc.Client: IPC Client (26613121) connection to <NAMENODE> from gridperf sending #454
      2011-09-19 18:06:49,826 DEBUG org.apache.hadoop.ipc.Client: IPC Client (26613121) connection to <<NAMENODE> from gridperf got value #454
      2011-09-19 18:06:49,827 DEBUG org.apache.hadoop.ipc.RPC: Call: getAdditionalDatanode 8
      2011-09-19 18:06:49,827 DEBUG org.apache.hadoop.hdfs.DFSClient: Connecting to datanode <DATANODE2>
      2011-09-19 18:06:49,827 DEBUG org.apache.hadoop.hdfs.DFSClient: Send buf size 131071
      2011-09-19 18:06:49,833 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception
      java.io.EOFException: Premature EOF: no length prefix available
              at org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:158)
              at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:860)
              at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:838)
              at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:929)
              at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:740)
              at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:415)
      2011-09-19 18:06:49,837 WARN org.apache.hadoop.mapred.YarnChild: Exception running child : java.io.EOFException: Premature EOF: no length prefix available
              at org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:158)
              at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:860)
              at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:838)
              at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:929)
              at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:740)
              at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:415)
      
      2011-09-19 18:06:49,837 DEBUG org.apache.hadoop.ipc.Client: IPC Client (26613121) connection to <APPMASTER> from job_1316452677984_0862 sending #455
      2011-09-19 18:06:49,839 DEBUG org.apache.hadoop.ipc.Client: IPC Client (26613121) connection to <APPMASTER> from job_1316452677984_0862 got value #455
      2011-09-19 18:06:49,840 DEBUG org.apache.hadoop.ipc.RPC: Call: statusUpdate 3
      2011-09-19 18:06:49,840 INFO org.apache.hadoop.mapred.Task: Runnning cleanup for the task
      2011-09-19 18:06:49,840 DEBUG org.apache.hadoop.ipc.Client: IPC Client (26613121) connection to <NAMENODE> from gridperf sending #456
      2011-09-19 18:06:49,858 DEBUG org.apache.hadoop.ipc.Client: IPC Client (26613121) connection to <NAMENODE> from gridperf got value #456
      2011-09-19 18:06:49,858 DEBUG org.apache.hadoop.ipc.RPC: Call: delete 18
      2011-09-19 18:06:49,858 DEBUG org.apache.hadoop.ipc.Client: IPC Client (26613121) connection to <APPMASTER> from job_1316452677984_0862 sending #457
      2011-09-19 18:06:49,859 DEBUG org.apache.hadoop.ipc.Client: IPC Client (26613121) connection to <APPMASTER> from job_1316452677984_0862 got value #457
      2011-09-19 18:06:49,859 DEBUG org.apache.hadoop.ipc.RPC: Call: reportDiagnosticInfo 1
      2011-09-19 18:06:49,859 DEBUG org.apache.hadoop.metrics2.impl.MetricsSystemImpl: refCount=1
      2011-09-19 18:06:49,859 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping ReduceTask metrics system...
      2011-09-19 18:06:49,859 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping metrics source UgiMetrics
      2011-09-19 18:06:49,859 DEBUG org.apache.hadoop.metrics2.impl.MetricsSystemImpl: class org.apache.hadoop.metrics2.lib.MetricsSourceBuilder$1
      2011-09-19 18:06:49,860 DEBUG org.apache.hadoop.metrics2.util.MBeans: Unregistering Hadoop:service=ReduceTask,name=UgiMetrics
      2011-09-19 18:06:49,860 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping metrics source JvmMetrics
      2011-09-19 18:06:49,860 DEBUG org.apache.hadoop.metrics2.impl.MetricsSystemImpl: class org.apache.hadoop.metrics2.source.JvmMetrics
      2011-09-19 18:06:49,860 DEBUG org.apache.hadoop.metrics2.util.MBeans: Unregistering Hadoop:service=ReduceTask,name=JvmMetrics
      2011-09-19 18:06:49,860 DEBUG org.apache.hadoop.metrics2.util.MBeans: Unregistering Hadoop:service=ReduceTask,name=MetricsSystem,sub=Stats
      2011-09-19 18:06:49,860 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ReduceTask metrics system stopped.
      2011-09-19 18:06:49,860 DEBUG org.apache.hadoop.metrics2.util.MBeans: Unregistering Hadoop:service=ReduceTask,name=MetricsSystem,sub=Control
      2011-09-19 18:06:49,860 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ReduceTask metrics system shutdown complete.
      

      Which means that tasks is supposed to have stopped within 20 secs, whereas the process itself is stuck for more than 15 hours. From AM log, also found that this task was sending its update regularly. ps -ef | grep java was also showing that process is still alive.

      1. MAPREDUCE-3058-20110923.txt
        0.8 kB
        Vinod Kumar Vavilapalli
      2. MAPREDUCE-3058-20111021.txt
        4 kB
        Vinod Kumar Vavilapalli
      3. MR-3058.wip.patch
        23 kB
        Hitesh Shah

        Activity

        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Open Open Patch Available Patch Available
        30d 1h 38m 1 Vinod Kumar Vavilapalli 21/Oct/11 15:32
        Patch Available Patch Available Resolved Resolved
        15h 43m 1 Vinod Kumar Vavilapalli 22/Oct/11 07:15
        Resolved Resolved Closed Closed
        23d 18h 33m 1 Arun C Murthy 15/Nov/11 00:48
        Arun C Murthy made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-trunk #838 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/838/)
        MAPREDUCE-3058. Fixed MR YarnChild to report failure when task throws an error and thus prevent a hanging task and job. (vinodkv)

        vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1187654
        Files :

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapred/YarnChild.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/FailingMapper.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapreduce/v2/TestMRJobs.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib/gridmix/src/java/org/apache/hadoop/mapred/gridmix/LoadJob.java
        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #838 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/838/ ) MAPREDUCE-3058 . Fixed MR YarnChild to report failure when task throws an error and thus prevent a hanging task and job. (vinodkv) vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1187654 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapred/YarnChild.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/FailingMapper.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapreduce/v2/TestMRJobs.java /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib/gridmix/src/java/org/apache/hadoop/mapred/gridmix/LoadJob.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Mapreduce-0.23-Build #59 (See https://builds.apache.org/job/Hadoop-Mapreduce-0.23-Build/59/)
        MAPREDUCE-3058. Fixed MR YarnChild to report failure when task throws an error and thus prevent a hanging task and job. (vinodkv)
        svn merge -c r1187654 --ignore-ancestry ../../trunk/

        vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1187655
        Files :

        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapred/YarnChild.java
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/FailingMapper.java
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapreduce/v2/TestMRJobs.java
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/src/contrib/gridmix/src/java/org/apache/hadoop/mapred/gridmix/LoadJob.java
        Show
        Hudson added a comment - Integrated in Hadoop-Mapreduce-0.23-Build #59 (See https://builds.apache.org/job/Hadoop-Mapreduce-0.23-Build/59/ ) MAPREDUCE-3058 . Fixed MR YarnChild to report failure when task throws an error and thus prevent a hanging task and job. (vinodkv) svn merge -c r1187654 --ignore-ancestry ../../trunk/ vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1187655 Files : /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapred/YarnChild.java /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/FailingMapper.java /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapreduce/v2/TestMRJobs.java /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/src/contrib/gridmix/src/java/org/apache/hadoop/mapred/gridmix/LoadJob.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Mapreduce-trunk #868 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/868/)
        MAPREDUCE-3058. Fixed MR YarnChild to report failure when task throws an error and thus prevent a hanging task and job. (vinodkv)

        vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1187654
        Files :

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapred/YarnChild.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/FailingMapper.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapreduce/v2/TestMRJobs.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib/gridmix/src/java/org/apache/hadoop/mapred/gridmix/LoadJob.java
        Show
        Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #868 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/868/ ) MAPREDUCE-3058 . Fixed MR YarnChild to report failure when task throws an error and thus prevent a hanging task and job. (vinodkv) vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1187654 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapred/YarnChild.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/FailingMapper.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapreduce/v2/TestMRJobs.java /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib/gridmix/src/java/org/apache/hadoop/mapred/gridmix/LoadJob.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-0.23-Build #47 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/47/)
        MAPREDUCE-3058. Fixed MR YarnChild to report failure when task throws an error and thus prevent a hanging task and job. (vinodkv)
        svn merge -c r1187654 --ignore-ancestry ../../trunk/

        vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1187655
        Files :

        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapred/YarnChild.java
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/FailingMapper.java
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapreduce/v2/TestMRJobs.java
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/src/contrib/gridmix/src/java/org/apache/hadoop/mapred/gridmix/LoadJob.java
        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-0.23-Build #47 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/47/ ) MAPREDUCE-3058 . Fixed MR YarnChild to report failure when task throws an error and thus prevent a hanging task and job. (vinodkv) svn merge -c r1187654 --ignore-ancestry ../../trunk/ vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1187655 Files : /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapred/YarnChild.java /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/FailingMapper.java /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapreduce/v2/TestMRJobs.java /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/src/contrib/gridmix/src/java/org/apache/hadoop/mapred/gridmix/LoadJob.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Mapreduce-0.23-Commit #39 (See https://builds.apache.org/job/Hadoop-Mapreduce-0.23-Commit/39/)
        MAPREDUCE-3058. Fixed MR YarnChild to report failure when task throws an error and thus prevent a hanging task and job. (vinodkv)
        svn merge -c r1187654 --ignore-ancestry ../../trunk/

        vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1187655
        Files :

        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapred/YarnChild.java
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/FailingMapper.java
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapreduce/v2/TestMRJobs.java
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/src/contrib/gridmix/src/java/org/apache/hadoop/mapred/gridmix/LoadJob.java
        Show
        Hudson added a comment - Integrated in Hadoop-Mapreduce-0.23-Commit #39 (See https://builds.apache.org/job/Hadoop-Mapreduce-0.23-Commit/39/ ) MAPREDUCE-3058 . Fixed MR YarnChild to report failure when task throws an error and thus prevent a hanging task and job. (vinodkv) svn merge -c r1187654 --ignore-ancestry ../../trunk/ vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1187655 Files : /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapred/YarnChild.java /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/FailingMapper.java /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapreduce/v2/TestMRJobs.java /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/src/contrib/gridmix/src/java/org/apache/hadoop/mapred/gridmix/LoadJob.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Mapreduce-trunk-Commit #1150 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/1150/)
        MAPREDUCE-3058. Fixed MR YarnChild to report failure when task throws an error and thus prevent a hanging task and job. (vinodkv)

        vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1187654
        Files :

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapred/YarnChild.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/FailingMapper.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapreduce/v2/TestMRJobs.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib/gridmix/src/java/org/apache/hadoop/mapred/gridmix/LoadJob.java
        Show
        Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk-Commit #1150 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Commit/1150/ ) MAPREDUCE-3058 . Fixed MR YarnChild to report failure when task throws an error and thus prevent a hanging task and job. (vinodkv) vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1187654 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapred/YarnChild.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/FailingMapper.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapreduce/v2/TestMRJobs.java /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib/gridmix/src/java/org/apache/hadoop/mapred/gridmix/LoadJob.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Common-0.23-Commit #39 (See https://builds.apache.org/job/Hadoop-Common-0.23-Commit/39/)
        MAPREDUCE-3058. Fixed MR YarnChild to report failure when task throws an error and thus prevent a hanging task and job. (vinodkv)
        svn merge -c r1187654 --ignore-ancestry ../../trunk/

        vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1187655
        Files :

        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapred/YarnChild.java
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/FailingMapper.java
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapreduce/v2/TestMRJobs.java
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/src/contrib/gridmix/src/java/org/apache/hadoop/mapred/gridmix/LoadJob.java
        Show
        Hudson added a comment - Integrated in Hadoop-Common-0.23-Commit #39 (See https://builds.apache.org/job/Hadoop-Common-0.23-Commit/39/ ) MAPREDUCE-3058 . Fixed MR YarnChild to report failure when task throws an error and thus prevent a hanging task and job. (vinodkv) svn merge -c r1187654 --ignore-ancestry ../../trunk/ vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1187655 Files : /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapred/YarnChild.java /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/FailingMapper.java /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapreduce/v2/TestMRJobs.java /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/src/contrib/gridmix/src/java/org/apache/hadoop/mapred/gridmix/LoadJob.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-trunk-Commit #1213 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/1213/)
        MAPREDUCE-3058. Fixed MR YarnChild to report failure when task throws an error and thus prevent a hanging task and job. (vinodkv)

        vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1187654
        Files :

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapred/YarnChild.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/FailingMapper.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapreduce/v2/TestMRJobs.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib/gridmix/src/java/org/apache/hadoop/mapred/gridmix/LoadJob.java
        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-trunk-Commit #1213 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Commit/1213/ ) MAPREDUCE-3058 . Fixed MR YarnChild to report failure when task throws an error and thus prevent a hanging task and job. (vinodkv) vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1187654 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapred/YarnChild.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/FailingMapper.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapreduce/v2/TestMRJobs.java /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib/gridmix/src/java/org/apache/hadoop/mapred/gridmix/LoadJob.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-0.23-Commit #40 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Commit/40/)
        MAPREDUCE-3058. Fixed MR YarnChild to report failure when task throws an error and thus prevent a hanging task and job. (vinodkv)
        svn merge -c r1187654 --ignore-ancestry ../../trunk/

        vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1187655
        Files :

        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapred/YarnChild.java
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/FailingMapper.java
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapreduce/v2/TestMRJobs.java
        • /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/src/contrib/gridmix/src/java/org/apache/hadoop/mapred/gridmix/LoadJob.java
        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-0.23-Commit #40 (See https://builds.apache.org/job/Hadoop-Hdfs-0.23-Commit/40/ ) MAPREDUCE-3058 . Fixed MR YarnChild to report failure when task throws an error and thus prevent a hanging task and job. (vinodkv) svn merge -c r1187654 --ignore-ancestry ../../trunk/ vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1187655 Files : /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapred/YarnChild.java /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/FailingMapper.java /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapreduce/v2/TestMRJobs.java /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/src/contrib/gridmix/src/java/org/apache/hadoop/mapred/gridmix/LoadJob.java
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Common-trunk-Commit #1135 (See https://builds.apache.org/job/Hadoop-Common-trunk-Commit/1135/)
        MAPREDUCE-3058. Fixed MR YarnChild to report failure when task throws an error and thus prevent a hanging task and job. (vinodkv)

        vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1187654
        Files :

        • /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapred/YarnChild.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/FailingMapper.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapreduce/v2/TestMRJobs.java
        • /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib/gridmix/src/java/org/apache/hadoop/mapred/gridmix/LoadJob.java
        Show
        Hudson added a comment - Integrated in Hadoop-Common-trunk-Commit #1135 (See https://builds.apache.org/job/Hadoop-Common-trunk-Commit/1135/ ) MAPREDUCE-3058 . Fixed MR YarnChild to report failure when task throws an error and thus prevent a hanging task and job. (vinodkv) vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1187654 Files : /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapred/YarnChild.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/FailingMapper.java /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapreduce/v2/TestMRJobs.java /hadoop/common/trunk/hadoop-mapreduce-project/src/contrib/gridmix/src/java/org/apache/hadoop/mapred/gridmix/LoadJob.java
        Vinod Kumar Vavilapalli made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Hadoop Flags Reviewed [ 10343 ]
        Resolution Fixed [ 1 ]
        Hide
        Vinod Kumar Vavilapalli added a comment -

        I just committed this to trunk and branch-0.23.

        Show
        Vinod Kumar Vavilapalli added a comment - I just committed this to trunk and branch-0.23.
        Hide
        Vinod Kumar Vavilapalli added a comment -

        Thanks for the review Hitesh. Yes I thought of having the test like that, but it is way too much effort. The discernible delay in the failure clearly validates it for me, will stick to that.

        Show
        Vinod Kumar Vavilapalli added a comment - Thanks for the review Hitesh. Yes I thought of having the test like that, but it is way too much effort. The discernible delay in the failure clearly validates it for me, will stick to that.
        Hide
        Hitesh Shah added a comment -

        Patch with respect to change in YarnChild looks good. Verified tests show the time elapsed change with the fix although would be preferable to have a test that errors out without this fix ( which may not be that trivial to implement ).

        Show
        Hitesh Shah added a comment - Patch with respect to change in YarnChild looks good. Verified tests show the time elapsed change with the fix although would be preferable to have a test that errors out without this fix ( which may not be that trivial to implement ).
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12500186/MAPREDUCE-3058-20111021.txt
        against trunk revision .

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 6 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        -1 findbugs. The patch appears to introduce 160 new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in .

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1102//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1102//artifact/trunk/hadoop-mapreduce-project/patchprocess/newPatchFindbugsWarningshadoop-mapreduce-client-common.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1102//artifact/trunk/hadoop-mapreduce-project/patchprocess/newPatchFindbugsWarningshadoop-mapreduce-client-core.html
        Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1102//artifact/trunk/hadoop-mapreduce-project/patchprocess/newPatchFindbugsWarningshadoop-mapreduce-client-app.html
        Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1102//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12500186/MAPREDUCE-3058-20111021.txt against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 160 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in . +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1102//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1102//artifact/trunk/hadoop-mapreduce-project/patchprocess/newPatchFindbugsWarningshadoop-mapreduce-client-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1102//artifact/trunk/hadoop-mapreduce-project/patchprocess/newPatchFindbugsWarningshadoop-mapreduce-client-core.html Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1102//artifact/trunk/hadoop-mapreduce-project/patchprocess/newPatchFindbugsWarningshadoop-mapreduce-client-app.html Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1102//console This message is automatically generated.
        Vinod Kumar Vavilapalli made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Vinod Kumar Vavilapalli made changes -
        Assignee Hitesh Shah [ hitesh ] Vinod Kumar Vavilapalli [ vinodkv ]
        Vinod Kumar Vavilapalli made changes -
        Attachment MAPREDUCE-3058-20111021.txt [ 12500186 ]
        Hide
        Vinod Kumar Vavilapalli added a comment -

        My earlier suspicion that a non-daemon thread created by user's code hanging the JVM also turned wrong. In that case, eventually the child heads for shutdown, kills the umbilical client which in turns makes Task's reporter thread to sys-exit with 65 exit-code after about 10 seconds.

        Something else happened in this case - may be the DFSClient hung and caused the JVM to get stuck.

        Attaching patch that should fix such cases. For non-daemon hanging threads, this will avoid the 10 seconds delay which I confirmed with the test.

        Show
        Vinod Kumar Vavilapalli added a comment - My earlier suspicion that a non-daemon thread created by user's code hanging the JVM also turned wrong. In that case, eventually the child heads for shutdown, kills the umbilical client which in turns makes Task's reporter thread to sys-exit with 65 exit-code after about 10 seconds. Something else happened in this case - may be the DFSClient hung and caused the JVM to get stuck. Attaching patch that should fix such cases. For non-daemon hanging threads, this will avoid the 10 seconds delay which I confirmed with the test.
        Hide
        Vinod Kumar Vavilapalli added a comment -

        Okay, this patch is still not going to help. The main bug is that the task code is not reporting the AM that it got an exception. If the task had reported, AM would have known to go ahead with cleaning up this task.

        We still need SIGTERM followed by SIGKILL to properly cleanup tasks, but that is not going to solve the current problem. I'll create a separate ticket for that.

        Show
        Vinod Kumar Vavilapalli added a comment - Okay, this patch is still not going to help. The main bug is that the task code is not reporting the AM that it got an exception. If the task had reported, AM would have known to go ahead with cleaning up this task. We still need SIGTERM followed by SIGKILL to properly cleanup tasks, but that is not going to solve the current problem. I'll create a separate ticket for that.
        Hitesh Shah made changes -
        Attachment MR-3058.wip.patch [ 12500096 ]
        Hide
        Hitesh Shah added a comment -

        Patch basically changes how containers are killed. It now triggers a sigterm before a delayed sigkill.

        Tweaked the handling of getting the pid slightly so that it can still be obtained after the command completes. This does not really imply that we can actually get the pid but just implies we can get to the executor/process object which in turn may allow us to get the pid.

        Requires some testing to see if this addresses the underlying issues.

        @Karam, could you give it a try?

        Show
        Hitesh Shah added a comment - Patch basically changes how containers are killed. It now triggers a sigterm before a delayed sigkill. Tweaked the handling of getting the pid slightly so that it can still be obtained after the command completes. This does not really imply that we can actually get the pid but just implies we can get to the executor/process object which in turn may allow us to get the pid. Requires some testing to see if this addresses the underlying issues. @Karam, could you give it a try?
        Hide
        Amol Kekre added a comment -

        This is blocking gridmix, any eta?

        Show
        Amol Kekre added a comment - This is blocking gridmix, any eta?
        Arun C Murthy made changes -
        Assignee Vinod Kumar Vavilapalli [ vinodkv ] Hitesh Shah [ hitesh ]
        Hide
        Arun C Murthy added a comment -

        Hitesh - can you please take a look since you are already working on related areas?

        Currently the TT sends SIGTERM followed by SIGKILL...

        Show
        Arun C Murthy added a comment - Hitesh - can you please take a look since you are already working on related areas? Currently the TT sends SIGTERM followed by SIGKILL...
        Arun C Murthy made changes -
        Priority Major [ 3 ] Critical [ 2 ]
        Hide
        Amar Kamat added a comment -

        Vinod,
        Making Gridmix threads as daemon threads might not solve the root cause of this issue. Imagine user code spawning off non-daemon threads. I guess, the correct fix should force the task's jvm to exit once the task is done and ready to close. It should not matter what threads are running. Thoughts?

        Show
        Amar Kamat added a comment - Vinod, Making Gridmix threads as daemon threads might not solve the root cause of this issue. Imagine user code spawning off non-daemon threads. I guess, the correct fix should force the task's jvm to exit once the task is done and ready to close. It should not matter what threads are running. Thoughts?
        Vinod Kumar Vavilapalli made changes -
        Assignee Vinod Kumar Vavilapalli [ vinodkv ]
        Fix Version/s 0.23.0 [ 12315570 ]
        Component/s contrib/gridmix [ 12313086 ]
        Vinod Kumar Vavilapalli made changes -
        Attachment MAPREDUCE-3058-20110923.txt [ 12496226 ]
        Hide
        Vinod Kumar Vavilapalli added a comment -

        Making LoadJob.StatusReporter a daemon thread, as this was the thread that was hanging the JVM. Also, set ResourceUsageMatcherRunner to be a daemon thread, just in case.

        Amar/Ravi, please see if this okay. Thanks!

        Show
        Vinod Kumar Vavilapalli added a comment - Making LoadJob.StatusReporter a daemon thread, as this was the thread that was hanging the JVM. Also, set ResourceUsageMatcherRunner to be a daemon thread, just in case. Amar/Ravi, please see if this okay. Thanks!
        Hide
        Vinod Kumar Vavilapalli added a comment -

        We ran into this again, and it turns out this is a GridMix bug. The task code duly failed because of a datanode issues, but one of the threads in the GridMix's job - LoadJob.StatusReporter - is hanging.

        Show
        Vinod Kumar Vavilapalli added a comment - We ran into this again, and it turns out this is a GridMix bug. The task code duly failed because of a datanode issues, but one of the threads in the GridMix's job - LoadJob.StatusReporter - is hanging.
        Vinod Kumar Vavilapalli made changes -
        Field Original Value New Value
        Description While running GridMix V3 one got stuck for 15 hrs.
        After clicking on Job found one of its reduce got stuck.
        Looking at syslog of reducer it found that- :
        Task started at -:
        2011-09-19 17:57:22,002 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s).
        2011-09-19 17:57:22,002 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ReduceTask metrics system started

        While end of syslog says -:
        2011-09-19 18:06:49,818 INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink as <DATANODE1>
        2011-09-19 18:06:49,818 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block BP-1405370709-<NAMENODE>-1316452621953:blk_-7004355226367468317_79871 in pipeline <DATANODE2>, <DATANODE1>: bad datanode <DATANODE1>
        2011-09-19 18:06:49,818 DEBUG org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtocol: lastAckedSeqno = 26870
        2011-09-19 18:06:49,820 DEBUG org.apache.hadoop.ipc.Client: IPC Client (26613121) connection to <NAMENODE> from gridperf sending #454
        2011-09-19 18:06:49,826 DEBUG org.apache.hadoop.ipc.Client: IPC Client (26613121) connection to <<NAMENODE> from gridperf got value #454
        2011-09-19 18:06:49,827 DEBUG org.apache.hadoop.ipc.RPC: Call: getAdditionalDatanode 8
        2011-09-19 18:06:49,827 DEBUG org.apache.hadoop.hdfs.DFSClient: Connecting to datanode <DATANODE2>
        2011-09-19 18:06:49,827 DEBUG org.apache.hadoop.hdfs.DFSClient: Send buf size 131071
        2011-09-19 18:06:49,833 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception
        java.io.EOFException: Premature EOF: no length prefix available
                at org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:158)
                at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:860)
                at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:838)
                at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:929)
                at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:740)
                at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:415)
        2011-09-19 18:06:49,837 WARN org.apache.hadoop.mapred.YarnChild: Exception running child : java.io.EOFException: Premature EOF: no length prefix available
                at org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:158)
                at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:860)
                at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:838)
                at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:929)
                at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:740)
                at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:415)

        2011-09-19 18:06:49,837 DEBUG org.apache.hadoop.ipc.Client: IPC Client (26613121) connection to <APPMASTER> from job_1316452677984_0862 sending #455
        2011-09-19 18:06:49,839 DEBUG org.apache.hadoop.ipc.Client: IPC Client (26613121) connection to <APPMASTER> from job_1316452677984_0862 got value #455
        2011-09-19 18:06:49,840 DEBUG org.apache.hadoop.ipc.RPC: Call: statusUpdate 3
        2011-09-19 18:06:49,840 INFO org.apache.hadoop.mapred.Task: Runnning cleanup for the task
        2011-09-19 18:06:49,840 DEBUG org.apache.hadoop.ipc.Client: IPC Client (26613121) connection to <NAMENODE> from gridperf sending #456
        2011-09-19 18:06:49,858 DEBUG org.apache.hadoop.ipc.Client: IPC Client (26613121) connection to <NAMENODE> from gridperf got value #456
        2011-09-19 18:06:49,858 DEBUG org.apache.hadoop.ipc.RPC: Call: delete 18
        2011-09-19 18:06:49,858 DEBUG org.apache.hadoop.ipc.Client: IPC Client (26613121) connection to <APPMASTER> from job_1316452677984_0862 sending #457
        2011-09-19 18:06:49,859 DEBUG org.apache.hadoop.ipc.Client: IPC Client (26613121) connection to <APPMASTER> from job_1316452677984_0862 got value #457
        2011-09-19 18:06:49,859 DEBUG org.apache.hadoop.ipc.RPC: Call: reportDiagnosticInfo 1
        2011-09-19 18:06:49,859 DEBUG org.apache.hadoop.metrics2.impl.MetricsSystemImpl: refCount=1
        2011-09-19 18:06:49,859 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping ReduceTask metrics system...
        2011-09-19 18:06:49,859 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping metrics source UgiMetrics
        2011-09-19 18:06:49,859 DEBUG org.apache.hadoop.metrics2.impl.MetricsSystemImpl: class org.apache.hadoop.metrics2.lib.MetricsSourceBuilder$1
        2011-09-19 18:06:49,860 DEBUG org.apache.hadoop.metrics2.util.MBeans: Unregistering Hadoop:service=ReduceTask,name=UgiMetrics
        2011-09-19 18:06:49,860 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping metrics source JvmMetrics
        2011-09-19 18:06:49,860 DEBUG org.apache.hadoop.metrics2.impl.MetricsSystemImpl: class org.apache.hadoop.metrics2.source.JvmMetrics
        2011-09-19 18:06:49,860 DEBUG org.apache.hadoop.metrics2.util.MBeans: Unregistering Hadoop:service=ReduceTask,name=JvmMetrics
        2011-09-19 18:06:49,860 DEBUG org.apache.hadoop.metrics2.util.MBeans: Unregistering Hadoop:service=ReduceTask,name=MetricsSystem,sub=Stats
        2011-09-19 18:06:49,860 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ReduceTask metrics system stopped.
        2011-09-19 18:06:49,860 DEBUG org.apache.hadoop.metrics2.util.MBeans: Unregistering Hadoop:service=ReduceTask,name=MetricsSystem,sub=Control
        2011-09-19 18:06:49,860 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ReduceTask metrics system shutdown complete.

        Which means tasks is stopped withing 20 secs
        Whereas acutally tasks kept on running for more 15 hrs.
        From AM log. also found tasks was sending its update regularly
        ps -ef | grep java was also showing task is running
        While running GridMixV3, one of the jobs got stuck for 15 hrs. After clicking on the Job-page, found one of its reduces to be stuck. Looking at syslog of the stuck reducer, found this:
        Task-logs' head:

        {code}
        2011-09-19 17:57:22,002 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s).
        2011-09-19 17:57:22,002 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ReduceTask metrics system started
        {code}

        Task-logs' tail:
        {code}
        2011-09-19 18:06:49,818 INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink as <DATANODE1>
        2011-09-19 18:06:49,818 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block BP-1405370709-<NAMENODE>-1316452621953:blk_-7004355226367468317_79871 in pipeline <DATANODE2>, <DATANODE1>: bad datanode <DATANODE1>
        2011-09-19 18:06:49,818 DEBUG org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtocol: lastAckedSeqno = 26870
        2011-09-19 18:06:49,820 DEBUG org.apache.hadoop.ipc.Client: IPC Client (26613121) connection to <NAMENODE> from gridperf sending #454
        2011-09-19 18:06:49,826 DEBUG org.apache.hadoop.ipc.Client: IPC Client (26613121) connection to <<NAMENODE> from gridperf got value #454
        2011-09-19 18:06:49,827 DEBUG org.apache.hadoop.ipc.RPC: Call: getAdditionalDatanode 8
        2011-09-19 18:06:49,827 DEBUG org.apache.hadoop.hdfs.DFSClient: Connecting to datanode <DATANODE2>
        2011-09-19 18:06:49,827 DEBUG org.apache.hadoop.hdfs.DFSClient: Send buf size 131071
        2011-09-19 18:06:49,833 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception
        java.io.EOFException: Premature EOF: no length prefix available
                at org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:158)
                at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:860)
                at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:838)
                at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:929)
                at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:740)
                at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:415)
        2011-09-19 18:06:49,837 WARN org.apache.hadoop.mapred.YarnChild: Exception running child : java.io.EOFException: Premature EOF: no length prefix available
                at org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:158)
                at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:860)
                at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:838)
                at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:929)
                at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:740)
                at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:415)

        2011-09-19 18:06:49,837 DEBUG org.apache.hadoop.ipc.Client: IPC Client (26613121) connection to <APPMASTER> from job_1316452677984_0862 sending #455
        2011-09-19 18:06:49,839 DEBUG org.apache.hadoop.ipc.Client: IPC Client (26613121) connection to <APPMASTER> from job_1316452677984_0862 got value #455
        2011-09-19 18:06:49,840 DEBUG org.apache.hadoop.ipc.RPC: Call: statusUpdate 3
        2011-09-19 18:06:49,840 INFO org.apache.hadoop.mapred.Task: Runnning cleanup for the task
        2011-09-19 18:06:49,840 DEBUG org.apache.hadoop.ipc.Client: IPC Client (26613121) connection to <NAMENODE> from gridperf sending #456
        2011-09-19 18:06:49,858 DEBUG org.apache.hadoop.ipc.Client: IPC Client (26613121) connection to <NAMENODE> from gridperf got value #456
        2011-09-19 18:06:49,858 DEBUG org.apache.hadoop.ipc.RPC: Call: delete 18
        2011-09-19 18:06:49,858 DEBUG org.apache.hadoop.ipc.Client: IPC Client (26613121) connection to <APPMASTER> from job_1316452677984_0862 sending #457
        2011-09-19 18:06:49,859 DEBUG org.apache.hadoop.ipc.Client: IPC Client (26613121) connection to <APPMASTER> from job_1316452677984_0862 got value #457
        2011-09-19 18:06:49,859 DEBUG org.apache.hadoop.ipc.RPC: Call: reportDiagnosticInfo 1
        2011-09-19 18:06:49,859 DEBUG org.apache.hadoop.metrics2.impl.MetricsSystemImpl: refCount=1
        2011-09-19 18:06:49,859 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping ReduceTask metrics system...
        2011-09-19 18:06:49,859 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping metrics source UgiMetrics
        2011-09-19 18:06:49,859 DEBUG org.apache.hadoop.metrics2.impl.MetricsSystemImpl: class org.apache.hadoop.metrics2.lib.MetricsSourceBuilder$1
        2011-09-19 18:06:49,860 DEBUG org.apache.hadoop.metrics2.util.MBeans: Unregistering Hadoop:service=ReduceTask,name=UgiMetrics
        2011-09-19 18:06:49,860 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping metrics source JvmMetrics
        2011-09-19 18:06:49,860 DEBUG org.apache.hadoop.metrics2.impl.MetricsSystemImpl: class org.apache.hadoop.metrics2.source.JvmMetrics
        2011-09-19 18:06:49,860 DEBUG org.apache.hadoop.metrics2.util.MBeans: Unregistering Hadoop:service=ReduceTask,name=JvmMetrics
        2011-09-19 18:06:49,860 DEBUG org.apache.hadoop.metrics2.util.MBeans: Unregistering Hadoop:service=ReduceTask,name=MetricsSystem,sub=Stats
        2011-09-19 18:06:49,860 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ReduceTask metrics system stopped.
        2011-09-19 18:06:49,860 DEBUG org.apache.hadoop.metrics2.util.MBeans: Unregistering Hadoop:service=ReduceTask,name=MetricsSystem,sub=Control
        2011-09-19 18:06:49,860 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ReduceTask metrics system shutdown complete.
        {code}

        Which means that tasks is supposed to have stopped within 20 secs, whereas the process itself is stuck for more than 15 hours. From AM log, also found that this task was sending its update regularly. ps -ef | grep java was also showing that process is still alive.
        Hide
        Vinod Kumar Vavilapalli added a comment -

        Yes Mahadev, this was without speculative execution.

        I requested Karam to file this to look at any reduce-task side shutdown bug.

        Show
        Vinod Kumar Vavilapalli added a comment - Yes Mahadev, this was without speculative execution. I requested Karam to file this to look at any reduce-task side shutdown bug.
        Hide
        Mahadev konar added a comment - - edited

        @karam,
        Is speculative execution turned off with all the tests?

        Show
        Mahadev konar added a comment - - edited @karam, Is speculative execution turned off with all the tests?
        Karam Singh created issue -

          People

          • Assignee:
            Vinod Kumar Vavilapalli
            Reporter:
            Karam Singh
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development