Some one else have already come up with the similar problem and fix it.
We can look the jira(https://issues.apache.org/jira/browse/YARN-3809) for detail.
But I think the fix have not solve the problem completely, blow was the problem I encountered:
There is about 1000 nodes in my hadoop cluster, and I submit about 1800 apps.
I failover my active rm and rm will failover all those 1800 apps.
When a application failover, It will wait for AM container register itself.
But there is a bug in my AM (I do it intentionally), and it will not register itself.
So the RM will wait for about 10mins for the AM expiration, and it will send a CLEANUP event to
ApplicationMasterLauncher thread pool. Because there is about 1800 apps, so it will hang the ApplicationMasterLauncher
thread pool for a large time. I have already use the patch(https://issues.apache.org/jira/secure/attachment/12740804/YARN-3809.03.patch), so
a CLEANUP event will hang a thread 10 * 20 = 200s. But I have 1800 apps, so for each of my thread, it will
hang 1800 / 50 * 200s = 7200s=20min.
Because the AM have register itself during 10mins， so it will retry and create a new application attempt.
The application attempt will accept a container from RM, and send a LAUNCH to ApplicationMasterLauncher thread pool.
Because the 1800 CLEANUP will hang the 50 thread pools about 20mins. So the application attempt will not
start the AM container during 10min.
And it will expire, and send a CLEANUP event to ApplicationMasterLauncher thread pools too.
As you can see, none of my application can really run it.
Each of them have 5 application attempts as follows, and each of them keep retrying.
So all of my apps have hang several hours, and none of them can really run.
I think this is a bug!!! We can treat CLEANUP and LAUNCH as different events.
And use some other thread to deal with LAUNCH event or use other way.
Sorry, I english is so poor. I don't know have I describe it clearly.