Details
-
Improvement
-
Status: Closed
-
Critical
-
Resolution: Fixed
-
0.23.7
-
None
-
Reviewed
Description
We've encountered a lot of hanging issues during MR-AM recovery because the state machines don't always end up in the same states after recovery. This is especially true when speculative execution is enabled. It should be straightforward to restore task and task attempt states directly from the TaskInfo and TaskAttemptInfo records in the job history file to avoid relying on the task state machines ending up in the proper states with the proper number of attempts.
This should be a more robust solution that would also give us the option of recovering start time and log locations for tasks that were in-progress when the AM crashed.
Attachments
Attachments
Issue Links
- breaks
-
MAPREDUCE-5468 AM recovery does not work for map only jobs
- Closed
- is duplicated by
-
MAPREDUCE-5869 Wrong date and and time on job tracker page
- Resolved
- relates to
-
MAPREDUCE-4992 AM hangs in RecoveryService when recovering tasks with speculative attempts
- Closed
-
MAPREDUCE-5003 AM recovery should recreate records for attempts that were incomplete
- Patch Available