I'm sorry I'm coming in late on this.
Are these new knobs at the job level? I think this the wrong direction. In particular, just limiting the number of slots won't do any good. The high ram job processing is a much better model. So rather than declaring max number of maps or reduces, we should allow "large task" jobs where each task is given multiple slots.
Couldn't agree more. See
I see the following issues:
As of today, the framework will not be able to guard against users who take up a node and consume all resources (cpu, memory, disk etc.) and starve other user tasks running on the machines. This goes against the spirit of shared compute/storage clusters. I can see an argument being made for this feature once we can figure how to charge
the user based on the task's total resource consumption; however we are a long way away from this. We have taken a step along this direction by introducing the High RAM Jobs feature in the Capacity Scheduler, we have a long way to go.
Given that we mostly agree that the high-ram jobs are the right model, these features interact very badly with each other.
Consider a high-ram job which has mapred.max.maps.per.cluster set to 100 and a few thousand tasks (a fraction of which is sufficient to exhaust the capacity of it's queue).
Currently the CapacityScheduler starts 'reserving' slots (after charging the user, queue etc.) to satisfy the resource requirements of the job.
Now we have 2 choices if we choose to incorporate mapred.max.
- Do not 'reserve' more tasktrackers than mapred.max.maps.per.cluster
- Reserve as much as needed but start 'unreserving' as soon as we have scheduled 'mapred.max.maps.per.cluster' number of maps.
Either way we are in trouble.
- The high-ram job is seriously starved. The scheduler has to pick only 100 nodes and no more. If they happened to be bad bets (other long running tasks etc.) the high-ram job needs to wait for a long while. Furthermore, the other bad side-effect is that the high-ram job gets pinned to the first 100 nodes and this really hurts locality of its tasks.
- The cluster is severely under-utilized since we may reserve all of the queue's capacity and then realize that we have to start 'unreserving'. Once the first wave of maps of the high-ram job are done we rinse and repeat for each wave of mapred.max.maps.per.cluster maps, there-by keeping the cluster idle for large amounts of time.
I think we should revert this before it is released.
Unfortunately, yes. +1