Naganarasimha G R Thanks for the comments
Wangda Tan Please comment if you have any further suggestions.
So user needs to delete a queue(say a2) then he needs to remove the queue from its parent's "yarn.scheduler.capacity.<parent queue>.queues" config and also mention its state(yarn.scheduler.capacity.<root...a2>.state) as DELETED right ?
Do not need to remove the queue from its parent's "yarn.scheduler.capacity.<parent queue>.queues" config, just mention its state(yarn.scheduler.capacity.<root...a2>.state) as DELETED.
How to delete intermediate queues? i presume we need NOT configure state for each of its children right ? or do we plan to support delete of only leaf queue?
We need not configure the state for each of its children. Just mark delete for the queue itself.
Do we need to consider the moving of queues(along with its apps) from one queue hiearchy to another ? IMO it complicates but not sure about the real world usecases.
we can consider this scenario later.
In case of HA, i think it further complicates as if both the RM's are initialiased with old queue settings and then if new queue is updated then CS is aware of deleted queue else if the RM starts of with updated xml(with deleted queue) then deleted queue information is not available and if failover happens to this RM then apps running on the deleted queue cannot be recovered as the queue doesnt exist. so do we need to start maintaining the deleted queue in statestore or need handling of creating queue objects for the queues whose state has been marked as deleted (then we need to consider 2nd point) ?
Yes, this is the fundamental issue with the "configuration-based" approach. This api-based approach would solve this issue: https://issues.apache.org/jira/browse/YARN-5734. But for "configuration-based" approach, in RM HA case, we have to make sure the configuration file for every RM nodes is updated.
do we need to consider showing of the deleted queues in the webui ? may be in another jira but the code needs to be updated.
Yes, we could file a separate jira, and do it later.
The basic workflow could be: before we can actually delete the queue, we should make sure the queue in STOPPED state which means this queue can not accept any new applications, and all apps (including pending request) have been finished (for now, we could simply wait. or add a command/flag to force kill later). Then, we could delete the queue and split capacity.