Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-5139 [Umbrella] Move YARN scheduler towards global scheduler
  3. YARN-10293

Reserved Containers not allocated from available space of other nodes in CandidateNodeSet in MultiNodePlacement (YARN-10259)

    XMLWordPrintableJSON

    Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 3.3.0
    • Fix Version/s: 3.4.0
    • Component/s: None
    • Labels:
      None

      Description

      Reserved Containers not allocated from available space of other nodes in CandidateNodeSet in MultiNodePlacement. YARN-10259 has fixed two issues related to it https://issues.apache.org/jira/browse/YARN-10259?focusedCommentId=17105987&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17105987

      Have found one more bug in the CapacityScheduler.java code which causes the same issue with slight difference in the repro.

      Repro:

      Nodes : Available : Used
      Node1 - 8GB, 8vcores - 8GB. 8cores
      Node2 - 8GB, 8vcores - 8GB. 8cores
      Node3 - 8GB, 8vcores - 8GB. 8cores

      Queues -> A and B both 50% capacity, 100% max capacity

      MultiNode enabled + Preemption enabled

      1. JobA submitted to A queue and which used full cluster 24GB and 24 vcores

      2. JobB Submitted to B queue with AM size of 1GB

      2020-05-21 12:12:27,313 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=systest  IP=172.27.160.139       OPERATION=Submit Application Request    TARGET=ClientRMService  RESULT=SUCCESS  APPID=application_1590046667304_0005    CALLERCONTEXT=CLI       QUEUENAME=dummy
      

      3. Preemption happens and used capacity is lesser than 1.0f

      2020-05-21 12:12:48,222 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptMetrics: Non-AM container preempted, current appAttemptId=appattempt_1590046667304_0004_000001, containerId=container_e09_1590046667304_0004_01_000024, resource=<memory:1024, vCores:1>
      

      4. JobB gets a Reserved Container as part of CapacityScheduler#allocateOrReserveNewContainer

      2020-05-21 12:12:48,226 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_e09_1590046667304_0005_01_000001 Container Transitioned from NEW to RESERVED
      2020-05-21 12:12:48,226 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: Reserved container=container_e09_1590046667304_0005_01_000001, on node=host: tajmera-fullnodes-3.tajmera-fullnodes.root.hwx.site:8041 #containers=8 available=<memory:0, vCores:0> used=<memory:8192, vCores:8> with resource=<memory:1024, vCores:1>
      

      Why RegularContainerAllocator reserved the container when the used capacity is <= 1.0f ?

      The reason is even though the container is preempted - nodemanager has to stop the container and heartbeat and update the available and unallocated resources to ResourceManager.
      

      5. Now, no new allocation happens and reserved container stays at reserved.

      After reservation the used capacity becomes 1.0f, below will be in a loop and no new allocate or reserve happens. The reserved container cannot be allocated as reserved node does not have space. node2 has space for 1GB, 1vcore but CapacityScheduler#allocateOrReserveNewContainers not getting called causing the Hang.

      [INFINITE LOOP] CapacityScheduler#allocateContainersOnMultiNodes -> CapacityScheduler#allocateFromReservedContainer -> Re-reserve the container on node

      2020-05-21 12:13:33,242 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Trying to fulfill reservation for application application_1590046667304_0005 on node: tajmera-fullnodes-3.tajmera-fullnodes.root.hwx.site:8041
      2020-05-21 12:13:33,242 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: assignContainers: partition= #applications=1
      2020-05-21 12:13:33,242 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: Reserved container=container_e09_1590046667304_0005_01_000001, on node=host: tajmera-fullnodes-3.tajmera-fullnodes.root.hwx.site:8041 #containers=8 available=<memory:0, vCores:0> used=<memory:8192, vCores:8> with resource=<memory:1024, vCores:1>
      2020-05-21 12:13:33,243 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Allocation proposal accepted
      

      CapacityScheduler#allocateOrReserveNewContainers won't be called as below check in allocateContainersOnMultiNodes fails

       if (getRootQueue().getQueueCapacities().getUsedCapacity(
              candidates.getPartition()) >= 1.0f
              && preemptionManager.getKillableResource(
      

        Attachments

        1. YARN-10293-001.patch
          18 kB
          Prabhu Joseph
        2. YARN-10293-002.patch
          17 kB
          Prabhu Joseph
        3. YARN-10293-003-WIP.patch
          19 kB
          Prabhu Joseph
        4. YARN-10293-004.patch
          19 kB
          Prabhu Joseph
        5. YARN-10293-005.patch
          18 kB
          Prabhu Joseph

          Issue Links

            Activity

              People

              • Assignee:
                prabhujoseph Prabhu Joseph
                Reporter:
                prabhujoseph Prabhu Joseph
              • Votes:
                0 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: