Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-11573

Add config option to make container allocation prefer nodes without reserved containers

    XMLWordPrintableJSON

Details

    • Reviewed

    Description

      Applications could be stuck when the container allocation logic does not consider more nodes, but only nodes that are having reserved containers.
      This behavior can even block new AMs to be allocated on nodes so they don't reach the running status.
      A jira that mentions the same thing is YARN-9598:

      Nodes which have been reserved should be skipped when iterating candidates in RegularContainerAllocator#allocate, otherwise scheduler may generate allocation or reservation proposal on these node which will always be rejected in FiCaScheduler#commonCheckContainerAllocation.

      Since this jira implements 2 other points, I decided to create this one and implement the 3rd point separately.

      Notes:

      1. FiCaSchedulerApp#commonCheckContainerAllocation will log this:

      Trying to allocate from reserved container in async scheduling mode
      

      in case RegularContainerAllocator creates a reservation proposal for nodes having reserved container.

      2. A better way is to prevent generating an AM container (or even normal container) allocation proposal on a node if it already has a reservation on it and we still have more nodes to check in the preferred node set. Completely disabling task containers from being allocated to worker nodes could limit the downscaling ability that we have currently.

      3. CALL HIERARCHY

      1. org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#nodeUpdate
      2. org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#allocateContainersToNode(org.apache.hadoop.yarn.api.records.NodeId, boolean)
      3. org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#allocateContainersToNode(org.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.CandidateNodeSet<org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode>, boolean)
      3.1. This is the place where it is decided whether to call allocateContainerOnSingleNode or allocateContainersOnMultiNodes
      4. org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#allocateContainersOnMultiNodes
      5. org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#allocateOrReserveNewContainers
      6. org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue#assignContainers
      7. org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractParentQueue#assignContainersToChildQueues
      8. org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractLeafQueue#assignContainers
      9. org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp#assignContainers
      10. org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#assignContainers
      11. org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#allocate
      12. org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#tryAllocateOnNode
      13. org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#assignContainersOnNode
      14. org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#assignNodeLocalContainers
      15. org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#assignContainer

      Logs these lines as an example:

      2023-08-23 17:44:08,129 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator: assignContainers: node=<host> application=application_1692304118418_3151 priority=0 pendingAsk=<per-allocation-resource=<memory:5632, vCores:1>,repeat=1> type=OFF_SWITCH
      

      4. DETAILS OF RegularContainerAllocator#allocate

      Method definition

      4.1. Defining ordered list of nodes to allocate containers on: LINK

          Iterator<FiCaSchedulerNode> iter = schedulingPS.getPreferredNodeIterator(
              candidates);
      

      4.2. org.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.AppPlacementAllocator#getPreferredNodeIterator
      4.3. org.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.MultiNodeSortingManager#getMultiNodeSortIterator (LINK)
      In this method, the MultiNodeLookupPolicy is resolved here
      4.4. org.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.MultiNodeSorter#getMultiNodeLookupPolicy
      4.5. This is where the MultiNodeLookupPolicy implementation of getPreferredNodeIterator is invoked

      5. GOING UP THE CALL HIERARCHY UNTIL CapacityScheduler#allocateOrReserveNewContainers

      1. CSAssigment is created here in method: CapacityScheduler#allocateOrReserveNewContainers
      2. org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#submitResourceCommitRequest
      3. org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#tryCommit
      4. org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp#accept
      5. org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp#commonCheckContainerAllocation
      --> This returns false and logs this line:

      2023-08-23 17:44:08,130 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: Trying to allocate from reserved container in async scheduling mode
      

      PROPOSED FIX

      In method: RegularContainerAllocator#allocate

      There's a loop that iterates over candidate nodes: https://github.com/apache/hadoop/blob/9342ecf6ccd5c7ef443a0eb722852d2addc1d5db/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/allocator/RegularContainerAllocator.java#L853-L895

      We need to skip the nodes that are having a reservation, example code:

      if (reservedContainer == null) {
      840	        // Do not schedule if there are any reservations to fulfill on the node
      841	        if (node.getReservedContainer() != null) {
      842	          LOG.debug("Skipping scheduling on node {} since it has already been"
      843	                  + " reserved by {}", node.getNodeID(),
      844	              node.getReservedContainer().getContainerId());
      845	          ActivitiesLogger.APP.recordSkippedAppActivityWithoutAllocation(
      846	              activitiesManager, node, application, schedulerKey,
      847	              ActivityDiagnosticConstant.NODE_HAS_BEEN_RESERVED);
      848	          continue;
      849	        }
      

      NOTE: This code block is copied from [^YARN-9598.001.patch#file-5]

      More notes for the implementation

      1. This new behavior need to be hidden behind a feature flag (CS config).
      In my understanding, the [^YARN-9598.001.patch#file-5] skips all the nodes with reservations, regardless of the container's type whether it's an AM container or a task container.
      2. Only skip the actual node with existing reservation if there are more nodes to process with the iterator.
      3. Add testcase to cover this scenario

       

      Attachments

        Issue Links

          Activity

            People

              snemeth Szilard Nemeth
              snemeth Szilard Nemeth
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: