Problem statement : There should be a option to control no of tasks
run from the queue at any point of time.There should be mechanism to
cap the capacity available for a queue/job.
There are 2 approaches :
1.Define a configuration variable ,
for reduces. This value is the maximum slots that can be used in a
queue at any point of time. So for example assuming above config value
is 100 , not more than 100 tasks would be in the queue at any point of
time, assuming each task takes one slot. Typically the queue capacity
should be equal to this limit. However, since the queue capacity is
expressed as a percentage, it is likely to change, for e.g. if nodes
go down or new ones are added. We were thinking of handling this
discrepency by capping the queue capacity at the limit if required.
So, if queue capacity is more than this limit, excess capacity will be
used by the other queues. If queue capacity is less than the above
limit , then the limit would be the queue capacity - as in the current
2.Define a configuration variable ,
"mapred.capacity-scheduler.queue.<queue-name>.fixedCapacity". This is
a Boolean variable , once set to true , would make sure that queue
capacity is the hard limit for the queue. So for example: cluster size
is 200 , and queue capacity is 10% so hard limit is always 10% of the
cluster capacity. The problem with this approach is that the limit
becomes dynamic , that is , if extra nodes are added to the cluster
hard limit can actually increase . Given the use case this might not
For the expressed use case, solution 1 seems more deterministic and
controlled. Does this work ?
Note : All the calculations in Capacity scheduler are slot based ,so we
have been using task and slot interchangeably. If the queue gets high
RAM jobs, it might hit the limit earlier with fewer tasks. But this
keeps the implementation simple and easy to follow.