Thanks Wangda Tan for the comments.
2. node-delay = min(rack-delay, node-delay).
If a cluster has 40 nodes, user requests 3 containers on node1:
Assume the configured-rack-delay=50,
rack-delay = min(3 (#requested-container) * 1 (#requested-resource-name) / 40, 50) = 0.
node-delay = min(rack-delay, 40) = 0
In above example, no matter how rack-delay specified/computed, if we can keep the node-delay to 40, we have better chance to get node-local containers allocated.
It is true that we won't get good locality in this example. iiuc, we didn't get good locality before the patch either. i.e. canAssign() would return true for NODE-LOCAL and OFF-SWITCH without delay. With the patch, canAssign() will return true for NODE-LOCAL, RACK-LOCAL, and OFF-SWITCH without delay. I believe the original intent of using localityWaitFactor was to avoid delaying small resource asks (could be a small job, or could be the tail of a large job). Unfortunately the algorithm still delayed RACK-LOCAL assignments. This made no sense to me - Accept OFF-SWITCH without delay, yet don't accept RACK-LOCAL?? I agree that we could change things here to get better locality for small requests, but to me this could have significant impact on small job latency so it would make me nervous to do so as part of this jira.
3. Don't restore missed-opportunity if rack-local container allocated.
The benefit of this change is obvious - we can get faster rack-local container allocation. But I feel this can also affect node-local container allocation (If the application asks only a small subset of nodes in a rack), may lead to some performance regression for locality I/O sensitive applications.
You're correct that it can affect node local container allocation. I will make this behavior configurable. The reason I didn't in the first place was that I felt the circumstances where we lose out are rare (not currently getting NODE-LOCAL assignments because otherwise missedOpportunities resets, AND not getting OFF-SWITCH assignments because missedOpportunities doesn't reset for OFF-SWITCH so it will quickly allocated everything to OFF-SWITCH as soon as it hits that threshold). On the other hand, the effects of not doing it are dramatic. We have been having cases where 5% of NMs are down for maintenance and some jobs take about an order of magnitude longer to run than normal.
So, here are the changes I propose:
1) I need to change the way rackLocalityDelay is specified because it doesn't handle the case where the configuration value is larger than the cluster size. I was thinking of just scaling it. Let's say node-locality-delay=5000, rack-locality-delay=5100, cluster_size is 3000. In the existing code, node-locality-delay would automatically get lowered to 3000. Instead, it will lower rack-locality-delay to 3000, and node-locality-delay will be proportionally adjusted (5000 * 3000 / 5100) = 2941.
2) Add a configurable boolean that controls whether a rack-local assignment resets missed_opportunities to 0 (old behavior), OR node-locality-delay (new behavior). Default of new behavior.
Let me know what you think of that approach.