Details
-
Sub-task
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
None
-
None
-
None
Description
We have more than 1 thousand queues and several hundreds of tenants in a busy cluster. We get a lot of complains/questions from owner/operator of queues about "Why my queue/app can't get resource for a long while? "
It's really hard to answer such questions.
So we added a diagnostic REST endpoint "/ws/v1/cluster/schedule/dryrun/
{parentQueueName}" which returns the sorted list of it's children according to it's SchedulingPolicy.getComparator(). All scheduling parameters of the children are also displayed, such as minShare, usage, demand, weight, priority etc.
Usually we just call "/ws/v1/cluster/schedule/dryrun/root", and the result self-explains to the questions.
I feel it's really useful for multi-tenant clusters, and hope it could be merged into the mainline.