The idea of a locality delay gives more flexibility, I like it.
Though I think it may complicate things quite a bit on the scheduler to be able to do the right book-keeping. Today, because the delay is at app level there is not delay counting at allocation requests level.
If we move the delay at allocation request level, it means we'd have to keep a counter at 'rack' level that gets decremented on every allocation attempt and when hits zero we go rack if node was not fulfilled.
This means that the allocation request in the scheduler will have to have a new delay-counter property.
The complexity comes that when an AM places a new allocation request, the AM must provide the non-empty allocation requests in full.
Requesting 5 containers for node1/rack1:
location=* - containers=5
location=rack1 - containers=5
location=node1 - containers=5
Requesting 5 additional containers for node2/rack1 (with original allocation still pending):
location=* - containers=10
location=rack1 - containers=10
location=node2 - containers=5
The current contract allows the scheduler just to put the */rack container requests without having to do a lookup and aggregation.
If we are now keeping a delay counter associated at */rack level and we do a put, we'll reset the delay-counter for the node1 request family. If we keep the delay-counter of node1 request family and use it for the node2 request family we'll be shorting the node2 request expected locality delay.
The delay-locality per container request has value but I think it may require much more work.
Given that, don't you think getting the current approach suggested/implemented by Arun/Sandy makes sense in the short term?