It looks like this can occur when a call which walks down the queue tree (in this case, getQueueInfo()) happens at the same time as an assignContainers call which does not start from the root queue, which is specifically one for a reservedContainer where scheduleAsynchronously is false. Essentially, it isn't safe to hold a lock on a queue while locking on a parent queue (as I now see noted in other methods in LeafQueue :/).
YARN-3243 is potentially a long term fix, but it would be nice to fix this right away as it clearly is already problematic. Also, YARN-3243 depends on a number of other sizable changes which have gone in recently, meaning it will be difficult to apply it as a fix to older codebases, for which it would be very nice to have a fix.
I've attached a patch somewhat along the lines suggest by Sunil G, it simply moves the acquisition of the absoluteMaxAvailCapacity outside the lock on the leaf queue - it will lock parent queues individually as it ascends, but it never holds a parent and child lock simultaneously, which is the unacceptable state. It follows the pattern for other methods in LeafQueue like recoverContainer which access parent queues - they all are careful to make sure the parent queue access occurs outside any lock on themselves.
Unfortunately it's not possible to just do this in root.assignContainers because of the reservedContainer case which will not invoke assignContainers on the root queue at any point. Instead, absoluteMaxAvailCapacity is determined outside any lock on the leaf queue in assignContainers before entering the synchronized method which continues the logic as it is today.
This looks to me to be the way to fix the issue with the smallest code change today pending other changes coming down the line.