Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
In 1.0 there are a number of conditions that will cause a reducer to commit suicide and exit.
This includes if it is stalled, if the error percentage of total fetches is too high. In the new code it will only commit suicide when the total number of failures for a single task attempt is >= max(30, totalMaps/10). In the best case with the quadratic back-off to get a single map attempt to reach 30 failure it would take 20.5 hours. And unless there is only one reducer running the map task would have been restarted before then.
We should go back to include the same reducer suicide checks that are in 1.0