Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-1763

Too many lost task trackers - Job failures

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Blocker
    • Resolution: Fixed
    • None
    • None
    • None
    • None
    • Version: 0.15.0-dev, r565628
      Compiled: Tue Aug 14 20:55:37 UTC 2007 by hadoopqa
      1400 Node cluster

    Description

      Steps to reproduce:
      1 .Run a map reduce application running more than 3000 mappers, each running longer than
      2. Observe the lost task trackers.

      Observations:
      1. Most of the lost taskTracker messages correspond to maps that have already completed
      2. Based on the logs below the taskTracker is unable to connect to the job tracker and so the jobTracker deletes the job after 20 minutes

      One example:
      task_200708210155_0003_m_000000_0 node1 KILLED 0.00% 21-Aug-2007 09:39:09 Lost task tracker <-- Please note the time

      Counters:
      Map-Reduce Framework
      Map input records 28,861
      Map output records 1,349,114
      Map input bytes 200,018,562
      Map output bytes 714,878,712

      Node 1 task tracker logs:
      2007-08-21 09:08:51,109 INFO org.apache.hadoop.mapred.TaskTracker: Task task_200708210155_0003_m_000000_0 is done. <-- Please note the time
      .
      .
      .
      2007-08-21 09:08:52,212 INFO org.mortbay.http.SocketListener: LOW ON THREADS ((40-40+0)<1) on SocketListener0@0.0.0.0:50060
      2007-08-21 09:08:52,217 WARN org.mortbay.http.SocketListener: OUT OF THREADS: SocketListener0@0.0.0.0:50060
      .
      .
      .
      2007-08-21 09:18:53,877 ERROR org.apache.hadoop.mapred.TaskTracker: Caught exception: java.net.SocketTimeoutException: timed out waiting for rpc response
      at org.apache.hadoop.ipc.Client.call(Client.java:472)
      at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:165)
      at org.apache.hadoop.mapred.$Proxy0.heartbeat(Unknown Source)
      at org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:941)
      at org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:840)
      at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1227)
      at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:1911)
      .
      .
      .
      2007-08-21 09:47:45,207 INFO org.apache.hadoop.mapred.TaskTracker: Resending 'status' to 'wm501219' with reponseId '5247
      2007-08-21 09:47:46,023 INFO org.apache.hadoop.mapred.TaskTracker: Recieved RenitTrackerAction from JobTracker
      2007-08-21 09:47:46,041 INFO org.apache.hadoop.mapred.TaskRunner: task_200708210155_0003_m_000000_0 done; removing files.
      2007-08-21 09:47:46,240 INFO org.apache.hadoop.mapred.TaskRunner: task_200708210155_0003_m_002237_0 done; removing files.

      Tasktracker is pretty active otherwise:
      tracker_wm511293.inktomisearch.com:50050 wm511293.inktomisearch.com 1 6 3

      JobTracker logs:
      2007-08-21 09:01:11,951 INFO org.apache.hadoop.mapred.JobTracker: Adding task 'task_200708210155_0003_m_000000_0' to tip tip_200708210155_0003_m_000000, for tracker 'tracker_wm511293.inktomisearch.com:50050'
      .
      2007-08-21 09:06:27,745 INFO org.apache.hadoop.mapred.JobTracker: Adding task 'task_200708210155_0003_m_000000_1' to tip tip_200708210155_0003_m_000000, for tracker 'tracker_wm511783.inktomisearch.com:50050'
      .
      2007-08-21 09:08:51,212 INFO org.apache.hadoop.mapred.JobInProgress: Task 'task_200708210155_0003_m_000000_0' has completed tip_200708210155_0003_m_000000 successfully.
      2007-08-21 09:08:51,213 INFO org.apache.hadoop.mapred.TaskInProgress: Task 'task_200708210155_0003_m_000000_0' has completed succesfully
      .
      2007-08-21 09:11:27,227 INFO org.apache.hadoop.mapred.TaskInProgress: Already complete TIP tip_200708210155_0003_m_000000 has completed task task_200708210155_0003_m_000000_1
      .
      2007-08-21 09:39:09,014 INFO org.apache.hadoop.mapred.TaskInProgress: Error from task_200708210155_0003_m_000000_0: Lost task tracker
      2007-08-21 09:39:09,014 INFO org.apache.hadoop.mapred.TaskInProgress: Task 'task_200708210155_0003_m_000000_0' has been lost.
      .
      2007-08-21 09:39:09,348 INFO org.apache.hadoop.mapred.JobTracker: Removed completed task 'task_200708210155_0003_m_000000_0' from 'tracker_wm511293.inktomisearch.com:50050'
      .
      2007-08-21 09:47:20,855 INFO org.apache.hadoop.mapred.TaskInProgress: Error from task_200708210155_0003_m_000000_1: Lost task tracker
      2007-08-21 09:47:20,855 WARN org.apache.hadoop.mapred.TaskInProgress: Recieved duplicate status update of 'KILLED' for 'task_200708210155_0003_m_000000_1' of TIP 'tip_200708210155_0003_m_000000'

      Notes:
      1. I do not see the taskTracker dying during that period
      2. Is retry logic not accurate/agressive enough? (did something change recently, this behavior is more evident in 0.15)
      3. Inconsistencies with jobTracker logs? Lost task tracker detection bad?
      4. TaskTracker:
      CPU usage: 9:10-9:20 50%
      9:20-9:40 0%
      Network Usage: 6M incl dfs operations
      5. JobTracker
      CPU udage: Avg: 9%
      Network Usage: Negligible

      Attachments

        1. 1763.patch
          1 kB
          Devaraj Das
        2. 1763.try.patch
          0.8 kB
          Devaraj Das

        Issue Links

          Activity

            People

              ddas Devaraj Das
              srikantk Srikanth Kakani
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: