Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
-
None
Description
Ignoring IOExceptions during fetching of map outputs in MapOutputLocation.java:getFile (e.g. content-length doesn't match actual data recieved) leads to hung reduces since the MapOutputCopier puts the host in the penalty box and retries forever.
Possible steps:
a) Distinguish between failure to fetch output v/s lost maps. (related to HADOOP-1158)
b) Ensure the reduce doesn't keep fetching from 'lost maps'. (related to HADOOP-1183)
c) On detection of 'failure to fetch' we probably should have exponential back-offs (versus the same order back-offs as currently) for hosts in the 'penalty box'.
d) If fetches still fail for say 4 times (after exponential backoffs), we should declare the Reduce as 'failed'.
This situation could also arise from situations like full-disks on the reducer, whereby it isn't possible to save the map output on the local disk (say for large map outputs).
Thoughts?
Attachments
Issue Links
- is related to
-
HADOOP-1183 MapTask completion not recorded properly at the Reducer's end
- Closed
-
HADOOP-1158 JobTracker should collect statistics of failed map output fetches, and take decisions to reexecute map tasks and/or restart the (possibly faulty) Jetty server on the TaskTracker
- Closed