Rohith Sharma K S Thanks for the comments and suggestion.
As a side note : since
YARN-3840 removes the attempts from RMStateStore, it is very prone to get this issue ( YARN-4584) nevertheless of without RM HA is configured and fail fast is false.
As I commented in
YARN-4584, "If attempt 1~28 are removed and attempt 29~31 has been saved to appstore successfully, there will be no NPE for RM recovery." I think we need analyze the RM log more. Removing attempts will cause NPE only when RM continues to run when failing to operate(e.g. store/remove) on RMStateStore. Is there any other case might cause NPE? Maybe we need fix it.
About the solution, it is bit tricky to identify during recovery that whether-application-is-failed-to-store VS failed-attempts-were-removed-after-interval.
I think we do not need to identify these two cases, because it makes no different for recovery.
So I think you can club both your solution and Jian He's thought together, so that we can eliminate failed-attempts-were-removed-after-interval attempts. And assume that attempts recovered are of failed to store only.
In RMAppImpl#createNewAttempt(), the first new attempt id is nextAttemptId which is initialized to the minimum attempt ID in RMStateStore in RMAppImpl#recover(). So we have skipped recovering those failed-attempts-were-removed-after-interval attempts.
Regarding iterating appState.attempts, it can be sorted before iterating it. If attempts are sorted, then there should not be problem with nextAttemptId.
Yes, we could sort it.I will update the patch if needed.
attempt.recoveredFinalStatus is being set to always to FAILED. These attempts might be KILLED/FINISHED also.
These attempts might be KILLED actually, but we could not make sure about it. If it is not reasonable to set it to FAILED, how about adding another state(e.g. UNKOWN)? My concern that is it will make things complex.
getNumFailedAppAttempts() is violated if attempt is failed to store since this attempt is removed from attempts. And also note that if attempts is failed to store, then many information such as getNumFailedAppAttempts also wont be exact number since attempt failure is taken from attempt.
Yes, the number is not exact number. I have not figured out a good method to solve it now . Since RM HA is not so often and removed attempts are kept in memory, it might be acceptable.