In my hadoop cluster, the resourceManager recover functionality is enabled with FileSystemRMStateStore.
I found this cause the yarn cluster running slowly and cluster usage rate is just 50 even there are many pending Apps.
The scenario is below.
In thread A, the RMAppImpl$RMAppNewlySavingTransition is calling storeNewApplication method defined in RMStateStore. This storeNewApplication method is synchronized.
In thread B, the FileSystemRMStateStore is calling storeApplicationStateInternal method. It's also synchronized.
This storeApplicationStateInternal method saves an ApplicationStateData into HDFS and it normally costs 90~300 milliseconds in my hadoop cluster.
Think thread B firstly comes into FileSystemRMStateStore.storeApplicationStateInternal method, then thread A will be blocked for a while because of synchronization. In ResourceManager there is only one RMStateStore instance. In my cluster it's FileSystemRMStateStore type.
Debug the RMAppNewlySavingTransition.transition method, the thread stack shows it's called form AsyncDispatcher.dispatch method. This method code is as below.
Above code shows AsyncDispatcher.dispatch method can process different type events.
In fact this AsyncDispatcher instance is just ResourceManager.rmDispatcher created in ResourceManager.serviceInit method.
You can find many eventTypes and handlers are registered in ResourceManager.rmDispatcher.
In above scenario thread B blocks thread A, then many following events processing are blocked.
In my testing cluster, there is only one queue and the client submits 1000 applications concurrently, the yarn cluster usage rate is 50. Many apps are pending. If I disable resourceManager recover functionality, the cluster usage can be 100.
To solve this issue, I removed synchronized modifier on some methods defined in RMStateStore.
Instead, in these methods I defined a dedicated lock object before calling dispatcher.getEventHandler().handle.
In this way, the yarn cluster usage rate can be 100 and the whole cluster is good running.
Please see my attached patch.