Details
Description
When running massive queries successively, at some point RM just hangs and stops allocating resources. At the point RM get hangs, YARN throw NullPointerException at RegularContainerAllocator.getLocalityWaitFactor.
There's sufficient space given to yarn.nodemanager.local-dirs (not a node health issue, RM didn't report any node being unhealthy). There is no fixed trigger for this (query or operation).
This problem goes away on restarting ResourceManager. No NM restart is required.