[HADOOP-147] MapTask removed mapout files before the reduce tasks copy them - ASF JIRA

Voters

Watch issue

Watchers

Create sub-task

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Duplicate
Affects Version/s: 0.2.0
Fix Version/s: 0.2.0
Component/s: None
Labels:
None

Description

I was running a job on a cluster of 138 nodes. The job had 1050 map tasks and 128 reduce tasks. It stucked at the reduce stage.
All the reduce tasks were trying to copy file from a map task with the following status show on the web interface:

reduce > copy > task_m_ehz5q1@node1262.foo.com:60040

However, the log on the machine node1262 (where the map task task_m_ehz5q1 ran) showed that the map task finished even before the
reduce tasks copied the map output files:

060417 103554 Server connection on port 60050 from 72.30.117.220: starting
060417 103554 task_m_ehz5q1 Client connection to 0.0.0.0:60050: starting
060417 103554 task_m_ehz5q1 1.0% /user/runping/runping/proj/part-00039:0+71
060417 103554 Task task_m_ehz5q1 is done.
060417 103554 parsing file:/local/hadoop/conf2/hadoop-default.xml

......................

060417 103613 parsing file:/local/hadoop/conf2/hadoop-site.xml
060417 103623 task_m_ehz5q1 done; removing files.
060417 103633 parsing file:/local/hadoop/conf2/hadoop-default.xml
060417 103633 parsing file:/local/hadoop/conf2/mapred-default.xml
060417 103633 parsing file:/local/hadoop/conf2/hadoop-site.xml

...........................................

060417 190241 SEVERE Can't open map output:/local/hadoop/mapred/local/task_m_ehz5q1/part-32.out
java.io.FileNotFoundException: /local/hadoop/mapred/local/task_m_ehz5q1/part-32.out
at org.apache.hadoop.fs.LocalFileSystem.openRaw(LocalFileSystem.java:115)
at org.apache.hadoop.fs.FSDataInputStream$Checker.<init>(FSDataInputStream.java:46)
at org.apache.hadoop.fs.FSDataInputStream.<init>(FSDataInputStream.java:228)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:154)
at org.apache.hadoop.mapred.MapOutputFile.write(MapOutputFile.java:116)
at org.apache.hadoop.mapred.MapOutputFile.write(MapOutputFile.java:116)
at org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:117)
at org.apache.hadoop.io.ObjectWritable.write(ObjectWritable.java:64)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:231)
060417 190241 Unknown child with bad map output: task_m_ehz5q1. Ignored.
060417 190241 Server handler 2 on 60040 caught: java.io.FileNotFoundException: /local/hadoop/mapred/local/task_m_ehz5q1/part-32.out
java.io.FileNotFoundException: /local/hadoop/mapred/local/task_m_ehz5q1/part-32.out
at org.apache.hadoop.fs.LocalFileSystem.openRaw(LocalFileSystem.java:115)
at org.apache.hadoop.fs.FSDataInputStream$Checker.<init>(FSDataInputStream.java:46)
at org.apache.hadoop.fs.FSDataInputStream.<init>(FSDataInputStream.java:228)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:154)
at org.apache.hadoop.mapred.MapOutputFile.write(MapOutputFile.java:116)
at org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:117)
at org.apache.hadoop.io.ObjectWritable.write(ObjectWritable.java:64)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:231)
060417 190241 parsing file:/local/hadoop/conf2/hadoop-default.xml
060417 190241 parsing file:/local/hadoop/conf2/mapred-default.xml

And the above exceptions repeated for many (not sure whether it is tru for all the reduce task) other reduce tasks.

Another strange thing noticed from the logs.

On another machine's log, I saw:

060417 190528 parsing file:/local/hadoop/conf2/hadoop-site.xml
060417 190528 task_r_24d8k4 copy failed: task_m_ehz5q1 from node1262.foo.com/72.30.117.220:60040
java.io.IOException: timed out waiting for response
at org.apache.hadoop.ipc.Client.call(Client.java:305)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:141)
at org.apache.hadoop.mapred.$Proxy2.getFile(Unknown Source)
at org.apache.hadoop.mapred.ReduceTaskRunner.prepare(ReduceTaskRunner.java:110)
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:66)
060417 190528 task_r_24d8k4 0.11523809% reduce > copy > task_m_ehz5q1@node1262.foo.com:60040
060417 190528 task_r_24d8k4 Copying task_m_epatk8 output from node1387.foo.com.

which is expected. However, before this line,
I saw another copy activity in the log:

060417 103608 parsing file:/local/hadoop/conf2/hadoop-site.xml
060417 103608 task_r_a4yl3t Copying task_m_ehz5q1 output from node1262.foo.com.
060417 103608 parsing file:/local/hadoop/conf2/hadoop-default.xml

And the task task_r_a4yl3t does not belong to the concerned job,
according to the Web interface. That is strange.

And I checked a few other machines where some reduce tasks ran,
and I saw the same thing.

I suspect there was a conflict in job ID. If two jobs had the same ID,
when one closes, it may also mark the other as "closed" too, thus trggering map tasks
to clean up prematurely.

A simple way to avoid potential jobid conflict is to use sequential numbers.

Attachments

Activity

Comment

This comment will be Viewable by All Users Viewable by All Users

Cancel

People

Assignee:: Unassigned

Reporter:: Runping Qi

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 19/Apr/06 07:53

Updated:: 08/Jul/09 16:51

Resolved:: 31/May/06 01:48

Agile

View on Board

MapTask removed mapout files before the reduce tasks copy them