[HADOOP-1087] Reducer hangs pulling from incorrect file.out.index path. (when one of the mapred.local.dir is not accessible but becomes available later at reduce time) - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Won't Fix
Affects Version/s: 0.10.1
Fix Version/s: None
Component/s: None
Labels:
None

Description

2007-03-07 23:14:23,431 WARN org.apache.hadoop.mapred.TaskRunner: java.io.IOException: Server returned HTTP response code: 500 for URL: http://____:____/mapOutput?map=task_7810_m_000897_0&reduce=397
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1149)
at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:121)
at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.copyOutput(ReduceTaskRunner.java:236)
at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.run(ReduceTaskRunner.java:199)
2007-03-07 23:14:23,431 WARN org.apache.hadoop.mapred.TaskRunner: task_7810_r_000397_0 adding host ____.com to penalty box, next contact in 279 seconds

This happened when one of the drives was full and not accessible at map time.

and one mapper

public void mergeParts() throws IOException {
...
Path finalIndexFile = mapOutputFile.getOutputIndexFile(getTaskId());

failed on the first hash entry in mapred.local.dir and used the second entry

Afterwards, first dir entry became available and when reducer tried to pull through,
public static class MapOutputServlet extends HttpServlet {
...
Path indexFileName = conf.getLocalPath(mapId+"/file.out.index");

it used the first entry.

As a result, directory was empty and reducer kept on trying to pull from the incorrect path and hang.

(wasn't sure if this is a duplicate of ~~HADOOP-895~~ since it is not reproducible unless I get disk failure.)

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Koji Noguchi

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 07/Mar/07 23:58

Updated:: 08/Jul/09 16:52

Resolved:: 07/May/07 12:27