[YARN-9294] Potential race condition in setting GPU cgroups & execute command in the selected cgroup - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Critical
Resolution: Unresolved
Affects Version/s: 2.10.0
Fix Version/s: None
Component/s: yarn
Labels:
None

Description

Environment is latest branch-2 head

OS: RHEL 7.4

Observation
Out of ~10 container allocations with GPU requirement, at least 1 of the allocated containers would lose GPU isolation. Even if I asked for 1 GPU, I could still have visibility to all GPUs on the same machine when running nvidia-smi.

The funny thing is even though I have visibility to all GPUs at the moment of executing container-executor (say ordinal 0,1,2,3), but cgroups jailed the process's access to only that single GPU after sometime.

The underlying process trying to access GPU would take the initial information as source of truth and try to access physical 0 GPU which is not really available to the process. This results in a [CUDA_ERROR_INVALID_DEVICE: invalid device ordinal] error.

Validated the container-executor commands are correct:

PrivilegedOperationExecutor command: [/export/apps/hadoop/nodemanager/latest/bin/container-executor, --module-gpu, --container_id, container_e22_1549663278916_0249_01_000001, --excluded_gpus, 0,1,2,3]

PrivilegedOperationExecutor command: 
[/export/apps/hadoop/nodemanager/latest/bin/container-executor, khu, khu, 0, application_1549663278916_0249, /grid/a/tmp/yarn/nmPrivate/container_e22_1549663278916_0249_01_000001.tokens, /grid/a/tmp/yarn, /grid/a/tmp/userlogs, /export/apps/jdk/JDK-1_8_0_172/jre/bin/java, -classpath, ..., -Xmx256m, org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer, khu, application_1549663278916_0249, container_e22_1549663278916_0249_01_000001, ltx1-hcl7552.grid.linkedin.com, 8040, /grid/a/tmp/yarn]

So most likely a race condition between these two operations?

cc jhung

Another potential theory is the cgroups creation for the container actually failed but the error was swallowed silently.

Attachments

Issue Links

relates to

YARN-2194 Cgroups cease to work in RHEL7

Resolved

Activity

People

Assignee:: Keqiu Hu

Reporter:: Keqiu Hu

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 11/Feb/19 01:11

Updated:: 20/Feb/19 19:29