Details
-
Bug
-
Status: Closed
-
Blocker
-
Resolution: Fixed
-
0.23.3, 2.0.1-alpha
-
None
Description
If the AM reports final job status to the client but then crashes before unregistering with the RM then the RM can run another AM attempt. Currently AM re-attempts assume that the previous attempts did not reach a final job state, and that causes the job to rerun (from scratch, if the output format doesn't support recovery).
Re-running the job when we've already told the client the final status of the job is bad for a number of reasons. If the job failed, it's confusing at best since the client was already told the job failed but the subsequent attempt could succeed. If the job succeeded there could be data loss, as a subsequent job launched by the client tries to consume the job's output as input just as the re-attempt starts removing output files in preparation for the output commit.
Attachments
Attachments
Issue Links
- is duplicated by
-
YARN-243 Job Client doesn't give progress for Application Master Retries
- Resolved
- is related to
-
MAPREDUCE-4831 Task commit can occur more than once due to AM retries
- Resolved
-
MAPREDUCE-4813 AM timing out during job commit
- Closed
-
MAPREDUCE-4832 MR AM can get in a split brain situation
- Closed
-
MAPREDUCE-4913 TestMRAppMaster#testMRAppMasterMissingStaging occasionally exits
- Closed
- relates to
-
MAPREDUCE-5476 Job can fail when RM restarts after staging dir is cleaned but before MR successfully unregister with RM
- Closed