After discussed with Yesha, we found the root cause here is because:
1. yarn client looping in submit application until it get ACCEPTED status from getApplicationReport(). If getApplicationReport() return ApplicationNoFound exception, it will go ahead to resubmit the application.
2. The call to getApplicationReport() will first go to check RM, if RM return ApplicationNoFound, it means RM doesn't have any info about this application. Basically, two possibility here: a. app is finished and RM remove track for this; b. app info haven't get persistent to RMStateStore before RM fail over/restart. Here the case belongs to case b.
3. Although app info haven't get persistent into RMStateStore yet, the app event already sent to ATS for handling so ATS will record this app and its initiated state - ACCEPTED. so getApplicationReport() will return ACCEPTED, and yarn client quit the loop in submit application but actually the app is already forgotten by RM.
As a quick solution, we should move RM notify ATS later to wait at least NEW_SAVING states so RM state store get persistent on this application already.