Thanks thoughts provided by Vinod Kumar Vavilapalli, had a offline discussion with Vinod, post summary here,
Basically there're 3 cases need cleanup.
a. Job completed (failed or succeeded, no matter it's lastRetry or not)
b. Failure happened, and captured by MRAppMasterShutDownHook
c. Failure happened, and doesn't capture by MRAppMasterShutDownHook
And for thoughts provided by Vinod,
1. YARN informs AM that it is the last retry as part of AM start-up or the register API
2. YARN informs the AM that this is the last retry as part of AM unregister
3. YARN has a way to run a separate cleanup container after it knows for sure that the application finished exhausting all its attempts
(1) can solve a. and part of b.
Why only part of b? Because it is possible MRAppMasterShutdownHook triggered but other possible failure happened causing cleanup not completed.
(2) can only solve a.
Reason is, if we don't have isLastRetry (or mayBeTheLastAttempt) properly set at register, we don't know if should do cleanup or not.
(3) can solve a. b. c.
Refer to YARN-2261 for more details.
I tried to work on (1) first, however, I found moving isLastRetry setup from MRAppMaster.init to RMCommunicator cause a lots code changes and lots of unit test failures, etc.
So my suggestion is quickly finish (2), make job completed case correct, which is the most usual case. And push (3) forward.
I'll upload a patch in method (2) for review soon.