[YARN-9430] Recovering containers does not check available resources on node - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: In Progress
Priority: Critical
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
None

Description

I have a testcase that checks if some GPU devices gone offline and recovery happens, only the containers that fit into the node's resources will be recovered. Unfortunately, this is not the case: RM does not check available resources on node during recovery.

Detailed explanation:

Testcase:
1. There are 2 nodes running NodeManagers
2. nvidia-smi is replaced with a fake bash script that reports 2 GPU devices per node, initially. This means 4 GPU devices in the cluster altogether.
3. RM / NM recovery is enabled
4. The test starts off a sleep job, requesting 4 containers, 1 GPU device for each (AM does not request GPUs)
5. Before restart, the fake bash script is adjusted to report 1 GPU device per node (2 in the cluster) after restart.
6. Restart is initiated.

Expected behavior:
After restart, only the AM and 2 normal containers should have been started, as there are only 2 GPU devices in the cluster.

Actual behaviour:
AM + 4 containers are allocated, this is all containers started originally with step 4.

App id was: 1553977186701_0001

Logs:

2019-03-30 13:22:30,299 DEBUG org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Processing event for appattempt_1553977186701_0001_000001 of type RECOVER
2019-03-30 13:22:30,366 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Added Application Attempt appattempt_1553977186701_0001_000001 to scheduler from user: systest
 2019-03-30 13:22:30,366 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: appattempt_1553977186701_0001_000001 is recovering. Skipping notifying ATTEMPT_ADDED
 2019-03-30 13:22:30,367 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1553977186701_0001_000001 State change from NEW to LAUNCHED on event = RECOVER
2019-03-30 13:22:33,257 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler: Recovering container [container_e84_1553977186701_0001_01_000001, CreateTime: 1553977260732, Version: 0, State: RUNNING, Capability: <memory:1024, vCores:1>, Diagnostics: , ExitStatus: -1000, NodeLabelExpression: Priority: 0]
2019-03-30 13:22:33,275 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler: Recovering container [container_e84_1553977186701_0001_01_000004, CreateTime: 1553977272802, Version: 0, State: RUNNING, Capability: <memory:1024, vCores:1, yarn.io/gpu: 1>, Diagnostics: , ExitStatus: -1000, NodeLabelExpression: Priority: 0]
2019-03-30 13:22:33,275 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode: Assigned container container_e84_1553977186701_0001_01_000004 of capacity <memory:1024, vCores:1, yarn.io/gpu: 1> on host snemeth-gpu-2.vpc.cloudera.com:8041, which has 2 containers, <memory:2048, vCores:2, yarn.io/gpu: 1> used and <memory:37252, vCores:6> available after allocation
2019-03-30 13:22:33,276 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler: Recovering container [container_e84_1553977186701_0001_01_000005, CreateTime: 1553977272803, Version: 0, State: RUNNING, Capability: <memory:1024, vCores:1, yarn.io/gpu: 1>, Diagnostics: , ExitStatus: -1000, NodeLabelExpression: Priority: 0]
 2019-03-30 13:22:33,276 DEBUG org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: Processing container_e84_1553977186701_0001_01_000005 of type RECOVER
 2019-03-30 13:22:33,276 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_e84_1553977186701_0001_01_000005 Container Transitioned from NEW to RUNNING
 2019-03-30 13:22:33,276 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode: Assigned container container_e84_1553977186701_0001_01_000005 of capacity <memory:1024, vCores:1, yarn.io/gpu: 1> on host snemeth-gpu-2.vpc.cloudera.com:8041, which has 3 containers, <memory:3072, vCores:3, yarn.io/gpu: 2> used and <memory:36228, vCores:5, yarn.io/gpu: -1> available after allocation
2019-03-30 13:22:33,279 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler: Recovering container [container_e84_1553977186701_0001_01_000003, CreateTime: 1553977272166, Version: 0, State: RUNNING, Capability: <memory:1024, vCores:1, yarn.io/gpu: 1>, Diagnostics: , ExitStatus: -1000, NodeLabelExpression: Priority: 0]
 2019-03-30 13:22:33,280 DEBUG org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: Processing container_e84_1553977186701_0001_01_000003 of type RECOVER
 2019-03-30 13:22:33,280 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_e84_1553977186701_0001_01_000003 Container Transitioned from NEW to RUNNING
 2019-03-30 13:22:33,280 DEBUG org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Processing event for application_1553977186701_0001 of type APP_RUNNING_ON_NODE
 2019-03-30 13:22:33,280 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode: Assigned container container_e84_1553977186701_0001_01_000003 of capacity <memory:1024, vCores:1, yarn.io/gpu: 1> on host snemeth-gpu-3.vpc.cloudera.com:8041, which has 2 containers, <memory:2048, vCores:2, yarn.io/gpu: 2> used and <memory:37252, vCores:6, yarn.io/gpu: -1> available after allocation
 2019-03-30 13:22:33,280 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt: SchedulerAttempt appattempt_1553977186701_0001_000001 is recovering container container_e84_1553977186701_0001_01_000003

There are multiple logs like this:

Assigned container container_e84_1553977186701_0001_01_000005 of capacity <memory:1024, vCores:1, yarn.io/gpu: 1> on host snemeth-gpu-2.vpc.cloudera.com:8041, which has 3 containers, <memory:3072, vCores:3, yarn.io/gpu: 2> used and <memory:36228, vCores:5, yarn.io/gpu: -1> available after allocation

Note the -1 value for the yarn.io/gpu resource!

The issue lies in this method: https://github.com/apache/hadoop/blob/e40e2d6ad5cbe782c3a067229270738b501ed27e/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerNode.java#L179

The problem is that method deductUnallocatedResource does not check if the resource of the container is subtracted from unallocated resource, the unallocated resource remains above zero.
Here is the ResourceManager call hierarchy for the method (from top to bottom):

1. org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler#handle
2. org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler#addNode
3. org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler#recoverContainersOnNode
4. org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode#recoverContainer
5. org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode#allocateContainer
6. org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode#allocateContainer(org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainer, boolean)
deduct is called here!

Testcase that reproduces the issue:
Add this testcase to TestFSSchedulerNode:

@Test
 public void testRecovery() {
 RMNode node = createNode();
 FSSchedulerNode schedulerNode = new FSSchedulerNode(node, false);
RMContainer container1 = createContainer(Resource.newInstance(4096, 4),
 null);
 RMContainer container2 = createContainer(Resource.newInstance(4096, 4),
 null);
 
 schedulerNode.allocateContainer(container1);
 schedulerNode.containerStarted(container1.getContainerId());
 schedulerNode.allocateContainer(container2);
 schedulerNode.containerStarted(container2.getContainerId());
 assertEquals("All resources of node should have been allocated",
 nodeResource, schedulerNode.getAllocatedResource());

 RMContainer container3 = createContainer(Resource.newInstance(1000, 1),
 null);
 when(container3.getState()).thenReturn(RMContainerState.NEW);
 assertEquals("All resources of node should have been allocated",
 nodeResource, schedulerNode.getAllocatedResource());
 
 schedulerNode.recoverContainer(container3);
assertEquals("No resource should have been unallocated",
 Resources.none(), schedulerNode.getUnallocatedResource());
 assertEquals("All resources of node should have been allocated",
 nodeResource, schedulerNode.getAllocatedResource());
 }

Result of testcase:

java.lang.AssertionError: No resource should have been unallocated 
Expected :<memory:0, vCores:0>
Actual :<memory:-1000, vCores:-1>

IT'S IMMEDIATELY CLEAR THAT NOT ONLY GPU (OR OTHER RESOURCE TYPES), BUT ANY RESOURCES ARE AFFECTED BY THIS ISSUE!

Possible fix:
1. A condition needs to be introduced to check if there is enough resources on the node, we should proceed with the container's recovery only if this is true.
2. An error log should be added. For a quick look, this is seemingly enough so no exception is required, but this needs a more thorough investigation and manual test on cluster!

Attachments

Activity

People

Assignee:: Riya Khandelwal

Reporter:: Szilard Nemeth

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 31/Mar/19 15:49

Updated:: 02/Aug/23 09:34