Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-147

MapTask removed mapout files before the reduce tasks copy them

VotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Duplicate
    • 0.2.0
    • 0.2.0
    • None
    • None

    Description

      I was running a job on a cluster of 138 nodes. The job had 1050 map tasks and 128 reduce tasks. It stucked at the reduce stage.
      All the reduce tasks were trying to copy file from a map task with the following status show on the web interface:

      reduce > copy > task_m_ehz5q1@node1262.foo.com:60040

      However, the log on the machine node1262 (where the map task task_m_ehz5q1 ran) showed that the map task finished even before the
      reduce tasks copied the map output files:

      060417 103554 Server connection on port 60050 from 72.30.117.220: starting
      060417 103554 task_m_ehz5q1 Client connection to 0.0.0.0:60050: starting
      060417 103554 task_m_ehz5q1 1.0% /user/runping/runping/proj/part-00039:0+71
      060417 103554 Task task_m_ehz5q1 is done.
      060417 103554 parsing file:/local/hadoop/conf2/hadoop-default.xml

      ......................

      060417 103613 parsing file:/local/hadoop/conf2/hadoop-site.xml
      060417 103623 task_m_ehz5q1 done; removing files.
      060417 103633 parsing file:/local/hadoop/conf2/hadoop-default.xml
      060417 103633 parsing file:/local/hadoop/conf2/mapred-default.xml
      060417 103633 parsing file:/local/hadoop/conf2/hadoop-site.xml

      ...........................................

      060417 190241 SEVERE Can't open map output:/local/hadoop/mapred/local/task_m_ehz5q1/part-32.out
      java.io.FileNotFoundException: /local/hadoop/mapred/local/task_m_ehz5q1/part-32.out
      at org.apache.hadoop.fs.LocalFileSystem.openRaw(LocalFileSystem.java:115)
      at org.apache.hadoop.fs.FSDataInputStream$Checker.<init>(FSDataInputStream.java:46)
      at org.apache.hadoop.fs.FSDataInputStream.<init>(FSDataInputStream.java:228)
      at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:154)
      at org.apache.hadoop.mapred.MapOutputFile.write(MapOutputFile.java:116)
      at org.apache.hadoop.mapred.MapOutputFile.write(MapOutputFile.java:116)
      at org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:117)
      at org.apache.hadoop.io.ObjectWritable.write(ObjectWritable.java:64)
      at org.apache.hadoop.ipc.Server$Handler.run(Server.java:231)
      060417 190241 Unknown child with bad map output: task_m_ehz5q1. Ignored.
      060417 190241 Server handler 2 on 60040 caught: java.io.FileNotFoundException: /local/hadoop/mapred/local/task_m_ehz5q1/part-32.out
      java.io.FileNotFoundException: /local/hadoop/mapred/local/task_m_ehz5q1/part-32.out
      at org.apache.hadoop.fs.LocalFileSystem.openRaw(LocalFileSystem.java:115)
      at org.apache.hadoop.fs.FSDataInputStream$Checker.<init>(FSDataInputStream.java:46)
      at org.apache.hadoop.fs.FSDataInputStream.<init>(FSDataInputStream.java:228)
      at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:154)
      at org.apache.hadoop.mapred.MapOutputFile.write(MapOutputFile.java:116)
      at org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:117)
      at org.apache.hadoop.io.ObjectWritable.write(ObjectWritable.java:64)
      at org.apache.hadoop.ipc.Server$Handler.run(Server.java:231)
      060417 190241 parsing file:/local/hadoop/conf2/hadoop-default.xml
      060417 190241 parsing file:/local/hadoop/conf2/mapred-default.xml

      And the above exceptions repeated for many (not sure whether it is tru for all the reduce task) other reduce tasks.

      Another strange thing noticed from the logs.

      On another machine's log, I saw:

      060417 190528 parsing file:/local/hadoop/conf2/hadoop-site.xml
      060417 190528 task_r_24d8k4 copy failed: task_m_ehz5q1 from node1262.foo.com/72.30.117.220:60040
      java.io.IOException: timed out waiting for response
      at org.apache.hadoop.ipc.Client.call(Client.java:305)
      at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:141)
      at org.apache.hadoop.mapred.$Proxy2.getFile(Unknown Source)
      at org.apache.hadoop.mapred.ReduceTaskRunner.prepare(ReduceTaskRunner.java:110)
      at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:66)
      060417 190528 task_r_24d8k4 0.11523809% reduce > copy > task_m_ehz5q1@node1262.foo.com:60040
      060417 190528 task_r_24d8k4 Copying task_m_epatk8 output from node1387.foo.com.

      which is expected. However, before this line,
      I saw another copy activity in the log:

      060417 103608 parsing file:/local/hadoop/conf2/hadoop-site.xml
      060417 103608 task_r_a4yl3t Copying task_m_ehz5q1 output from node1262.foo.com.
      060417 103608 parsing file:/local/hadoop/conf2/hadoop-default.xml

      And the task task_r_a4yl3t does not belong to the concerned job,
      according to the Web interface. That is strange.

      And I checked a few other machines where some reduce tasks ran,
      and I saw the same thing.

      I suspect there was a conflict in job ID. If two jobs had the same ID,
      when one closes, it may also mark the other as "closed" too, thus trggering map tasks
      to clean up prematurely.

      A simple way to avoid potential jobid conflict is to use sequential numbers.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            runping Runping Qi
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment