Thanks Wangda Tan for the feedback. I agree that it would be a useful feature to be able to give some applications better spread, regardless of allocation type. Now we just have to figure out how to get there.
My concern is that I don't think we'd want to implement it using the same simple approach if it's going to apply to all container types. For example, in our case we almost always want NODE_LOCAL and RACK_LOCAL to get scheduled as quickly as possible so I'd want the limit to be high, as opposed to OFF_SWITCH where I want the limit to be 3-5 to keep a nice balance between scheduling performance and clustering.
The reason this check was introduced in the first place (iirc) was to prevent network-heavy applications from loading up on specific nodes. The OFF_SWITCH check was a simple way of achieving this at a global level. The feature I think you're asking for (please correct me if I misunderstood) is that applications should be able to request that container spread be prioritized over timely scheduling (kind of like locality delay does today). I completely agree this would be a useful knob for applications to have. It is a trade-off though. An application that wants really good spread would be sacrificing scheduling opportunities that would probably be given to applications behind them in the queue (like locality delay).
So maybe there are two things to do:
1) Have the global OFF_SWITCH check to handle the simple case of avoiding too many network-heavy applications on a node.
2) A feature where applications can specify a max_containers_assigned_per_node_per_heartbeat. I think this would be checked down in LeafQueue.assignContainers().
Even with #2 in place, I don't think #1 could immediately go away because the network-heavy applications would need to start properly specifying this limit.
The other approach to get rid of #1 would be when network is a resource. Such applications could then request lots of network resource, which should prevent clustering.
Does that make any sort of sense?