Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
2.6.0
-
None
-
None
Description
The capacity scheduler will delay scheduling a container on a rack-local node in hopes that a node-local opportunity will come along (YARN-80). It does this by counting the number of missed scheduling opportunities the application has had. When the count reaches a certain threshold, the app will accept the rack-local node. The documented recommendation is to set this threshold to the #nodes in the cluster.
However, there are some early-out optimizations that can lead to this delay being a very long time.
Example in allocateContainersToNode():
// Try to schedule more if there are no reservations to fulfill if (node.getReservedContainer() == null) { if (calculator.computeAvailableContainers(node.getAvailableResource(), minimumAllocation) > 0) { if (LOG.isDebugEnabled()) { LOG.debug("Trying to schedule on node: " + node.getNodeName() + ", available: " + node.getAvailableResource()); } root.assignContainers(clusterResource, node, false); }
So, in a large cluster that is completely full (AvailableResource on each node is 0), SchedulingOpportunities will only increase at the rate of container completion rate, not the heartbeat rate, which I think was the original assumption of YARN-80. On a large cluster, this can lead to an hour+ of skipped scheduling opportunities meaning the fifo'ness of a queue is ignored for a very long time.
Maybe there should be a time-based limit on this delay as well as a count of missed-scheduling opportunities.
Attachments
Issue Links
- is related to
-
SLIDER-799 AM to decide when to relax placement policy from specific host to rack/cluster
- Resolved