[MAPREDUCE-4611] MR AM dies badly when Node is decomissioned - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Critical
Resolution: Fixed
Affects Version/s: 0.23.3, 2.0.0-alpha, 3.0.0-alpha1
Fix Version/s: 0.23.3, 2.0.2-alpha
Component/s: None
Labels:
None

Target Version/s:

0.23.3

Description

The MR AM always thinks that it is being killed by the RM when it gets a kill signal and it has not finished processing yet. In reality the RM kill signal is only sent when the client cannot communicate directly with the AM, which probably means that the AM is in a bad state already. The much more common case is that the node is marked as unhealthy or decomissioned.

I propose that in the short term the AM will only clean up if

The process has been asked by the client to exit (kill)
The process job has finished cleanly and is exiting already
This is that last retry of the AM retries.

The downside here is that the .staging directory will be leaked and the job will not show up in the history server on an kill from the RM in some cases.

At least until the full set of AM cleanup issues can be addressed, probably as part of MAPREDUCE-4428

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

MR-4611.txt
30/Aug/12 19:13
14 kB
Robert Joseph Evans

Activity

People

Assignee:: Robert Joseph Evans

Reporter:: Robert Joseph Evans

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 30/Aug/12 14:39

Updated:: 12/May/16 18:22

Resolved:: 31/Aug/12 20:49