Details
-
Bug
-
Status: Closed
-
Blocker
-
Resolution: Fixed
-
2.6.0
-
None
-
Reviewed
Description
If a reducer encounters an error trying to fetch from a node then encounters a read timeout when trying to re-establish the connection then the reducer can fail. The read timeout exception can leak to the top of the Fetcher thread which will cause the reduce task to teardown. This type of error can repeat across reducer attempts causing jobs to fail due to a single bad node.
Attachments
Attachments
Issue Links
- breaks
-
MAPREDUCE-6957 shuffle hangs after a node manager connection timeout
- Resolved
- is broken by
-
MAPREDUCE-5891 Improved shuffle error handling across NM restarts
- Closed