Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-9430

Recovering containers does not check available resources on node

    XMLWordPrintableJSON

Details

    • Bug
    • Status: In Progress
    • Critical
    • Resolution: Unresolved
    • None
    • None
    • None
    • None

    Description

      I have a testcase that checks if some GPU devices gone offline and recovery happens, only the containers that fit into the node's resources will be recovered. Unfortunately, this is not the case: RM does not check available resources on node during recovery.

      Detailed explanation:

      Testcase:
      1. There are 2 nodes running NodeManagers
      2. nvidia-smi is replaced with a fake bash script that reports 2 GPU devices per node, initially. This means 4 GPU devices in the cluster altogether.
      3. RM / NM recovery is enabled
      4. The test starts off a sleep job, requesting 4 containers, 1 GPU device for each (AM does not request GPUs)
      5. Before restart, the fake bash script is adjusted to report 1 GPU device per node (2 in the cluster) after restart.
      6. Restart is initiated.

       

      Expected behavior:
      After restart, only the AM and 2 normal containers should have been started, as there are only 2 GPU devices in the cluster.

       

      Actual behaviour:
      AM + 4 containers are allocated, this is all containers started originally with step 4.

      App id was: 1553977186701_0001

      Logs:

       

      2019-03-30 13:22:30,299 DEBUG org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Processing event for appattempt_1553977186701_0001_000001 of type RECOVER
      2019-03-30 13:22:30,366 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Added Application Attempt appattempt_1553977186701_0001_000001 to scheduler from user: systest
       2019-03-30 13:22:30,366 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: appattempt_1553977186701_0001_000001 is recovering. Skipping notifying ATTEMPT_ADDED
       2019-03-30 13:22:30,367 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1553977186701_0001_000001 State change from NEW to LAUNCHED on event = RECOVER
      2019-03-30 13:22:33,257 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler: Recovering container [container_e84_1553977186701_0001_01_000001, CreateTime: 1553977260732, Version: 0, State: RUNNING, Capability: <memory:1024, vCores:1>, Diagnostics: , ExitStatus: -1000, NodeLabelExpression: Priority: 0]
      2019-03-30 13:22:33,275 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler: Recovering container [container_e84_1553977186701_0001_01_000004, CreateTime: 1553977272802, Version: 0, State: RUNNING, Capability: <memory:1024, vCores:1, yarn.io/gpu: 1>, Diagnostics: , ExitStatus: -1000, NodeLabelExpression: Priority: 0]
      2019-03-30 13:22:33,275 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode: Assigned container container_e84_1553977186701_0001_01_000004 of capacity <memory:1024, vCores:1, yarn.io/gpu: 1> on host snemeth-gpu-2.vpc.cloudera.com:8041, which has 2 containers, <memory:2048, vCores:2, yarn.io/gpu: 1> used and <memory:37252, vCores:6> available after allocation
      2019-03-30 13:22:33,276 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler: Recovering container [container_e84_1553977186701_0001_01_000005, CreateTime: 1553977272803, Version: 0, State: RUNNING, Capability: <memory:1024, vCores:1, yarn.io/gpu: 1>, Diagnostics: , ExitStatus: -1000, NodeLabelExpression: Priority: 0]
       2019-03-30 13:22:33,276 DEBUG org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: Processing container_e84_1553977186701_0001_01_000005 of type RECOVER
       2019-03-30 13:22:33,276 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_e84_1553977186701_0001_01_000005 Container Transitioned from NEW to RUNNING
       2019-03-30 13:22:33,276 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode: Assigned container container_e84_1553977186701_0001_01_000005 of capacity <memory:1024, vCores:1, yarn.io/gpu: 1> on host snemeth-gpu-2.vpc.cloudera.com:8041, which has 3 containers, <memory:3072, vCores:3, yarn.io/gpu: 2> used and <memory:36228, vCores:5, yarn.io/gpu: -1> available after allocation
      2019-03-30 13:22:33,279 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler: Recovering container [container_e84_1553977186701_0001_01_000003, CreateTime: 1553977272166, Version: 0, State: RUNNING, Capability: <memory:1024, vCores:1, yarn.io/gpu: 1>, Diagnostics: , ExitStatus: -1000, NodeLabelExpression: Priority: 0]
       2019-03-30 13:22:33,280 DEBUG org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: Processing container_e84_1553977186701_0001_01_000003 of type RECOVER
       2019-03-30 13:22:33,280 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_e84_1553977186701_0001_01_000003 Container Transitioned from NEW to RUNNING
       2019-03-30 13:22:33,280 DEBUG org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Processing event for application_1553977186701_0001 of type APP_RUNNING_ON_NODE
       2019-03-30 13:22:33,280 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode: Assigned container container_e84_1553977186701_0001_01_000003 of capacity <memory:1024, vCores:1, yarn.io/gpu: 1> on host snemeth-gpu-3.vpc.cloudera.com:8041, which has 2 containers, <memory:2048, vCores:2, yarn.io/gpu: 2> used and <memory:37252, vCores:6, yarn.io/gpu: -1> available after allocation
       2019-03-30 13:22:33,280 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt: SchedulerAttempt appattempt_1553977186701_0001_000001 is recovering container container_e84_1553977186701_0001_01_000003
      

       

      There are multiple logs like this:

      Assigned container container_e84_1553977186701_0001_01_000005 of capacity <memory:1024, vCores:1, yarn.io/gpu: 1> on host snemeth-gpu-2.vpc.cloudera.com:8041, which has 3 containers, <memory:3072, vCores:3, yarn.io/gpu: 2> used and <memory:36228, vCores:5, yarn.io/gpu: -1> available after allocation

      Note the -1 value for the yarn.io/gpu resource!

      The issue lies in this method: https://github.com/apache/hadoop/blob/e40e2d6ad5cbe782c3a067229270738b501ed27e/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerNode.java#L179

      The problem is that method deductUnallocatedResource does not check if the resource of the container is subtracted from unallocated resource, the unallocated resource remains above zero.
      Here is the ResourceManager call hierarchy for the method (from top to bottom):

      1. org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler#handle
      2. org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler#addNode
      3. org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler#recoverContainersOnNode
      4. org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode#recoverContainer
      5. org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode#allocateContainer
      6. org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode#allocateContainer(org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainer, boolean)
      deduct is called here!

      Testcase that reproduces the issue:
      Add this testcase to TestFSSchedulerNode:

       

      @Test
       public void testRecovery() {
       RMNode node = createNode();
       FSSchedulerNode schedulerNode = new FSSchedulerNode(node, false);
      RMContainer container1 = createContainer(Resource.newInstance(4096, 4),
       null);
       RMContainer container2 = createContainer(Resource.newInstance(4096, 4),
       null);
       
       schedulerNode.allocateContainer(container1);
       schedulerNode.containerStarted(container1.getContainerId());
       schedulerNode.allocateContainer(container2);
       schedulerNode.containerStarted(container2.getContainerId());
       assertEquals("All resources of node should have been allocated",
       nodeResource, schedulerNode.getAllocatedResource());
      
       RMContainer container3 = createContainer(Resource.newInstance(1000, 1),
       null);
       when(container3.getState()).thenReturn(RMContainerState.NEW);
       assertEquals("All resources of node should have been allocated",
       nodeResource, schedulerNode.getAllocatedResource());
       
       schedulerNode.recoverContainer(container3);
      assertEquals("No resource should have been unallocated",
       Resources.none(), schedulerNode.getUnallocatedResource());
       assertEquals("All resources of node should have been allocated",
       nodeResource, schedulerNode.getAllocatedResource());
       }
      

       

       

      Result of testcase:

      java.lang.AssertionError: No resource should have been unallocated 
      Expected :<memory:0, vCores:0>
      Actual :<memory:-1000, vCores:-1>

      IT'S IMMEDIATELY CLEAR THAT NOT ONLY GPU (OR OTHER RESOURCE TYPES), BUT ANY RESOURCES ARE AFFECTED BY THIS ISSUE!

       

      Possible fix:
      1. A condition needs to be introduced to check if there is enough resources on the node, we should proceed with the container's recovery only if this is true.
      2. An error log should be added. For a quick look, this is seemingly enough so no exception is required, but this needs a more thorough investigation and manual test on cluster!

       

      Attachments

        Activity

          People

            rkhandelwal Riya Khandelwal
            snemeth Szilard Nemeth
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated: