On shutting down an AM, there're following work:
1. Finish OutputCommitter
2. Move the history file to AHS (Maybe move to after unregister in this Jira)
4. Delete staging dir
5. Send end job notifier
6. The implicit step of returning the final step to the client
Ideally, the 6 steps should be consistent. However, each steps may fail, while it seems not to be possible to make them a transaction to succeed all or fail all. Nevertheless, IMHO, we should do as much as we can to ensure the consistency of each steps.
Among the six steps, the most critical one is unregistration (correct me if I'm wrong), because it the only step that syncs with RM. It is the most harmful that AM and RM have different knowledge on the conclusion of the application. For this reason, unregister should be considered as the principle step, while how other steps behave should depend on the result of this step. Therefore, IMOH, unregister should be the first step to complete. On unregistration success, the following steps execute the ordinary logic, while on unregistration failure, the following steps handle the exceptions (e.g. not moving the job history file, not sending the job end notification and etc).
As Jason Lowe mentioned, moving job history file may fail. It's right, but the failure is independent of whether it is before or after unregistration. Now, moving job history file is before unregistration. If moving job history file fails, unregistration will not be invoked, and the application may be concluded as FAILED. This should be not reasonable. Similarly, other steps shouldn't be the reason of failing an application except unregistration. The failure of them should be isolated, such that AM can proceed to the end.
To sum up, IMHO, unregistration should be completed first, and be the step that judges the final state of the application. Given the result unregistration, the other steps decide what they should do, and the client see the final state. The other steps may fail or not fail, but the failure should be isolated. If fortunately none of steps fail (I guess it should be the most cases), the final states are consistent via every channels. If one step fails, it will only impact one part.
Moreover, I'm not sure whether we'd like to add one more state for AM, which is unregistering. Move the job to unregistering before calling unregister and then move the job to the final state after all the steps are gone through.