Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
0.16.0
-
None
-
None
Description
When launching a GPU job that uses a Docker image and the unified containerizer the Nvidia drivers are not correctly mounted. As an experiment I launched a task using both mesos-execute and Aurora using the same Docker image and ran nvidia-smi. During the experiment I noticed that the /usr/local/nvidia folder was not being mounted properly. To confirm this was the issue I tar'ed the drivers up (/run/mesos/isolators/gpu/nvidia_352.39) and manually added it to the Docker image. When this was done the task was able to launch correctly.
Here is the resulting mountinfo for the mesos-execute task. Notice how /usr/local/nvidia is mounted from the /mesos directory.
140 102 8:17 /mesos_work/provisioner/containers/11c497a2-a300-4c9e-a474-79aad1f28f11/backends/copy/rootfses/8ee046a6-bacb-42ff-b039-2cabda5d0e62 / rw,relatime master:24 - ext4 /dev/sdb1 rw,errors=remount-ro,data=ordered 141 140 8:17 /mesos_work/slaves/67025326-9dfd-4cbb-a008-454a40bce2f5-S1/frameworks/67025326-9dfd-4cbb-a008-454a40bce2f5-0009/executors/gpu-test/runs/11c497a2-a300-4c9e-a474-79aad1f28f11 /mnt/mesos/sandbox rw,relatime master:24 - ext4 /dev/sdb1 rw,errors=remount-ro,data=ordered 142 140 0:15 /mesos/isolators/gpu/nvidia_352.39 /usr/local/nvidia rw,nosuid,relatime master:5 - tmpfs tmpfs rw,size=26438160k,mode=755 143 140 0:3 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw 144 143 0:3 /sys /proc/sys ro,relatime - proc proc rw 145 140 0:14 / /sys ro,nosuid,nodev,noexec,relatime - sysfs sysfs rw 146 140 0:38 / /dev rw,nosuid - tmpfs tmpfs rw,mode=755 147 146 0:39 / /dev/pts rw,nosuid,noexec,relatime - devpts devpts rw,mode=600,ptmxmode=666 148 146 0:40 / /dev/shm rw,nosuid,nodev - tmpfs tmpfs rw
Here is the mountinfo when using Aurora. Notice how /usr/local/nvidia is missing.
72 71 8:1 / / rw,relatime master:1 - ext4 /dev/sda1 rw,errors=remount-ro,data=ordered 73 72 0:5 / /dev rw,relatime master:2 - devtmpfs udev rw,size=10240k,nr_inodes=16521649,mode=755 74 73 0:11 / /dev/pts rw,nosuid,noexec,relatime master:3 - devpts devpts rw,gid=5,mode=620,ptmxmode=000 75 73 0:17 / /dev/shm rw,nosuid,nodev master:4 - tmpfs tmpfs rw 76 73 0:13 / /dev/mqueue rw,relatime master:21 - mqueue mqueue rw 77 73 0:30 / /dev/hugepages rw,relatime master:23 - hugetlbfs hugetlbfs rw 78 72 0:15 / /run rw,nosuid,relatime master:5 - tmpfs tmpfs rw,size=26438160k,mode=755 79 78 0:18 / /run/lock rw,nosuid,nodev,noexec,relatime master:6 - tmpfs tmpfs rw,size=5120k 80 78 0:32 / /run/rpc_pipefs rw,relatime master:25 - rpc_pipefs rpc_pipefs rw 82 72 0:14 / /sys rw,nosuid,nodev,noexec,relatime master:7 - sysfs sysfs rw 83 82 0:16 / /sys/kernel/security rw,nosuid,nodev,noexec,relatime master:8 - securityfs securityfs rw 84 82 0:19 / /sys/fs/cgroup ro,nosuid,nodev,noexec master:9 - tmpfs tmpfs ro,mode=755 85 84 0:20 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime master:10 - cgroup cgroup rw,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd 86 84 0:22 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime master:13 - cgroup cgroup rw,cpuset 87 84 0:23 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime master:14 - cgroup cgroup rw,cpu,cpuacct 88 84 0:24 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime master:15 - cgroup cgroup rw,devices 89 84 0:25 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime master:16 - cgroup cgroup rw,freezer 90 84 0:26 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime master:17 - cgroup cgroup rw,net_cls,net_prio 91 84 0:27 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime master:18 - cgroup cgroup rw,blkio 92 84 0:28 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime master:19 - cgroup cgroup rw,perf_event 93 82 0:21 / /sys/fs/pstore rw,nosuid,nodev,noexec,relatime master:11 - pstore pstore rw 94 82 0:6 / /sys/kernel/debug rw,relatime master:22 - debugfs debugfs rw 95 72 0:3 / /proc rw,nosuid,nodev,noexec,relatime master:12 - proc proc rw 96 95 0:29 / /proc/sys/fs/binfmt_misc rw,relatime master:20 - autofs systemd-1 rw,fd=22,pgrp=1,timeout=300,minproto=5,maxproto=5,direct 97 96 0:34 / /proc/sys/fs/binfmt_misc rw,relatime master:27 - binfmt_misc binfmt_misc rw 98 72 8:17 / /mnt/01 rw,relatime master:24 - ext4 /dev/sdb1 rw,errors=remount-ro,data=ordered 99 98 8:17 /mesos_work/provisioner/containers/3790dd16-d1e2-4974-ba21-095a029b8c7d/backends/copy/rootfses/7ce26962-10a7-40ec-843b-c76e7e29c88d /mnt/01/mesos_work/slaves/67025326-9dfd-4cbb-a008-454a40bce2f5-S1/frameworks/13e02526-f2b7-4677-bb23-0faeeac65be9-0000/executors/thermos-root-devel-gpu_test-0-beeb742b-28c1-46f3-b49f-23443b6efcc2/runs/3790dd16-d1e2-4974-ba21-095a029b8c7d/taskfs rw,relatime master:24 - ext4 /dev/sdb1 rw,errors=remount-ro,data=ordered 100 99 8:17 /mesos_work/slaves/67025326-9dfd-4cbb-a008-454a40bce2f5-S1/frameworks/13e02526-f2b7-4677-bb23-0faeeac65be9-0000/executors/thermos-root-devel-gpu_test-0-beeb742b-28c1-46f3-b49f-23443b6efcc2/runs/3790dd16-d1e2-4974-ba21-095a029b8c7d/sandbox /mnt/01/mesos_work/slaves/67025326-9dfd-4cbb-a008-454a40bce2f5-S1/frameworks/13e02526-f2b7-4677-bb23-0faeeac65be9-0000/executors/thermos-root-devel-gpu_test-0-beeb742b-28c1-46f3-b49f-23443b6efcc2/runs/3790dd16-d1e2-4974-ba21-095a029b8c7d/taskfs/mnt/mesos/sandbox rw,relatime master:24 - ext4 /dev/sdb1 rw,errors=remount-ro,data=ordered 67 78 0:33 / /run/user/1001 rw,nosuid,nodev,relatime master:26 - tmpfs tmpfs rw,size=13219080k,mode=700,uid=1001,gid=1001