While going through code for checking
YARN-2978 , found one issue.
During construction of GetQueueInfoResponse in ClientRMService#getQueueInfo, we first collect application attempts from scheduler and then get apps from a ConcurrentHashMap in RMContext. Although the operation(get/put/remove,etc) itself on a ConcurrentHashMap is thread-safe, but a series of multiple ConcurrentHashMap#get (say, in a for loop) are not.
For instance, in code below, we are calling rmContext.getRMApps()#get in a loop. Now a ConcurrentHashMap#get can return null if the key doesnt exist. But there is no null check inside this for loop before dereferencing the value returned i.e. rmApp. Although all the applicationattempts have been fetched for the queue just above the for loop, but as this block of code is not synchronized, there is a possibility that another thread may delete RMApp from the ConcurrentHashMap at the same time. This can happen when an app finishes/completes and number of completed apps exceed the config yarn.resourcemanager.max-completed-applications.
I think there should be a null check inside this for loop, otherwise a NPE can occur.