Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
0.23.6
-
None
-
None
Description
Saw a case where a job recorded history for an app attempt that ended up in the ERROR state after the node the AM was running on was decommissioned. When the node was decommissioned, the RM marked all the containers on the node as killed and subsequently the application attempt was invalidated. When the AM attempt heartbeated in before the NM did (and therefore before the NM killed the AM) it discovered it was no longer a valid app attempt and exited in the ERROR state. However it also thought, incorrectly, that it was the last attempt and generated the history for the job.
Decommissioning a node should not cause an app attempt to end up in the ERROR state with history, as the subsequent app attempt should be the one to generate the definitive history for the job.