[MAPREDUCE-4819] AM can rerun job after reporting final job status to the client - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Blocker
Resolution: Fixed
Affects Version/s: 0.23.3, 2.0.1-alpha
Fix Version/s: 2.0.3-alpha, 0.23.6
Component/s: mr-am
Labels:
None

Description

If the AM reports final job status to the client but then crashes before unregistering with the RM then the RM can run another AM attempt. Currently AM re-attempts assume that the previous attempts did not reach a final job state, and that causes the job to rerun (from scratch, if the output format doesn't support recovery).

Re-running the job when we've already told the client the final status of the job is bad for a number of reasons. If the job failed, it's confusing at best since the client was already told the job failed but the subsequent attempt could succeed. If the job succeeded there could be data loss, as a subsequent job launched by the client tries to consume the job's output as input just as the re-attempt starts removing output files in preparation for the output commit.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

MAPREDUCE-4819.1.patch
28/Nov/12 18:14
11 kB
Bikas Saha
MAPREDUCE-4819.2.patch
29/Nov/12 16:50
35 kB
Bikas Saha
MAPREDUCE-4819.3.patch
29/Nov/12 18:05
44 kB
Bikas Saha
MR-4819-bobby-trunk.txt
02/Jan/13 14:58
47 kB
Robert Joseph Evans
MR-4819-bobby-trunk.txt
02/Jan/13 21:23
86 kB
Robert Joseph Evans
MR-4819-bobby-trunk.txt
03/Jan/13 20:49
92 kB
Robert Joseph Evans
MR-4819-bobby-trunk.txt
03/Jan/13 22:15
95 kB
Robert Joseph Evans
MR-4819-bobby-trunk.txt
03/Jan/13 22:25
95 kB
Robert Joseph Evans
MR-4819-bobby-trunk.txt
04/Jan/13 15:20
98 kB
Robert Joseph Evans
MR-4819-4832.txt
04/Jan/13 19:31
100 kB
Robert Joseph Evans

Issue Links

is duplicated by

YARN-243 Job Client doesn't give progress for Application Master Retries

Resolved

is related to

MAPREDUCE-4831 Task commit can occur more than once due to AM retries

Resolved

MAPREDUCE-4813 AM timing out during job commit

Closed

MAPREDUCE-4832 MR AM can get in a split brain situation

Closed

MAPREDUCE-4913 TestMRAppMaster#testMRAppMasterMissingStaging occasionally exits

Closed

relates to

MAPREDUCE-5476 Job can fail when RM restarts after staging dir is cleaned but before MR successfully unregister with RM

Closed

(1 relates to)

Activity

People

Assignee:: Bikas Saha

Reporter:: Jason Darrell Lowe

Votes:: 0 Vote for this issue

Watchers:: 15 Start watching this issue

Dates

Created:: 26/Nov/12 19:03

Updated:: 03/Sep/14 23:25

Resolved:: 04/Jan/13 20:44