Affects Version/s: None
Fix Version/s: None
I have a testcase that checks if some GPU devices gone offline and recovery happens, only the containers that fit into the node's resources will be recovered. Unfortunately, this is not the case: RM does not check available resources on node during recovery.
1. There are 2 nodes running NodeManagers
2. nvidia-smi is replaced with a fake bash script that reports 2 GPU devices per node, initially. This means 4 GPU devices in the cluster altogether.
3. RM / NM recovery is enabled
4. The test starts off a sleep job, requesting 4 containers, 1 GPU device for each (AM does not request GPUs)
5. Before restart, the fake bash script is adjusted to report 1 GPU device per node (2 in the cluster) after restart.
6. Restart is initiated.
After restart, only the AM and 2 normal containers should have been started, as there are only 2 GPU devices in the cluster.
AM + 4 containers are allocated, this is all containers started originally with step 4.
App id was: 1553977186701_0001
There are multiple logs like this:
Note the -1 value for the yarn.io/gpu resource!
The issue lies in this method: https://github.com/apache/hadoop/blob/e40e2d6ad5cbe782c3a067229270738b501ed27e/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerNode.java#L179
The problem is that method deductUnallocatedResource does not check if the resource of the container is subtracted from unallocated resource, the unallocated resource remains above zero.
Here is the ResourceManager call hierarchy for the method (from top to bottom):
Testcase that reproduces the issue:
Add this testcase to TestFSSchedulerNode:
Result of testcase:
IT'S IMMEDIATELY CLEAR THAT NOT ONLY GPU (OR OTHER RESOURCE TYPES), BUT ANY RESOURCES ARE AFFECTED BY THIS ISSUE!
1. A condition needs to be introduced to check if there is enough resources on the node, we should proceed with the container's recovery only if this is true.
2. An error log should be added. For a quick look, this is seemingly enough so no exception is required, but this needs a more thorough investigation and manual test on cluster!