Details
-
Bug
-
Status: Patch Available
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
A patch for a unit test is attached to reproduce the issue. It creates a container request with only racks specified (nodes=null) and relax locality set to false. With the node-locality-delay conf set appropriately, we wait indefinitely for a container allocation and the test will timeout.
My understanding of what causes this issue is as follows. The RegularContainerAllocator delays a rack local allocation based on the node-locality-delay parameter. This delay is based on missed opportunities. However, the corresponding off-switch request is skipped but does not count towards a missed opportunity (because relax locality is set to false). So the allocator waits indefinitely.