Uploaded image for project: 'Aurora'
  1. Aurora
  2. AURORA-1763

GPU drivers are missing when using a Docker image

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 0.16.0
    • None
    • Executor
    • None

    Description

      When launching a GPU job that uses a Docker image and the unified containerizer the Nvidia drivers are not correctly mounted. As an experiment I launched a task using both mesos-execute and Aurora using the same Docker image and ran nvidia-smi. During the experiment I noticed that the /usr/local/nvidia folder was not being mounted properly. To confirm this was the issue I tar'ed the drivers up (/run/mesos/isolators/gpu/nvidia_352.39) and manually added it to the Docker image. When this was done the task was able to launch correctly.

      Here is the resulting mountinfo for the mesos-execute task. Notice how /usr/local/nvidia is mounted from the /mesos directory.

      140 102 8:17 /mesos_work/provisioner/containers/11c497a2-a300-4c9e-a474-79aad1f28f11/backends/copy/rootfses/8ee046a6-bacb-42ff-b039-2cabda5d0e62 / rw,relatime master:24 - ext4 /dev/sdb1 rw,errors=remount-ro,data=ordered
      141 140 8:17 /mesos_work/slaves/67025326-9dfd-4cbb-a008-454a40bce2f5-S1/frameworks/67025326-9dfd-4cbb-a008-454a40bce2f5-0009/executors/gpu-test/runs/11c497a2-a300-4c9e-a474-79aad1f28f11 /mnt/mesos/sandbox rw,relatime master:24 - ext4 /dev/sdb1 rw,errors=remount-ro,data=ordered
      142 140 0:15 /mesos/isolators/gpu/nvidia_352.39 /usr/local/nvidia rw,nosuid,relatime master:5 - tmpfs tmpfs rw,size=26438160k,mode=755
      143 140 0:3 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw
      144 143 0:3 /sys /proc/sys ro,relatime - proc proc rw
      145 140 0:14 / /sys ro,nosuid,nodev,noexec,relatime - sysfs sysfs rw
      146 140 0:38 / /dev rw,nosuid - tmpfs tmpfs rw,mode=755
      147 146 0:39 / /dev/pts rw,nosuid,noexec,relatime - devpts devpts rw,mode=600,ptmxmode=666
      148 146 0:40 / /dev/shm rw,nosuid,nodev - tmpfs tmpfs rw

      Here is the mountinfo when using Aurora. Notice how /usr/local/nvidia is missing.

      72 71 8:1 / / rw,relatime master:1 - ext4 /dev/sda1 rw,errors=remount-ro,data=ordered
      73 72 0:5 / /dev rw,relatime master:2 - devtmpfs udev rw,size=10240k,nr_inodes=16521649,mode=755
      74 73 0:11 / /dev/pts rw,nosuid,noexec,relatime master:3 - devpts devpts rw,gid=5,mode=620,ptmxmode=000
      75 73 0:17 / /dev/shm rw,nosuid,nodev master:4 - tmpfs tmpfs rw
      76 73 0:13 / /dev/mqueue rw,relatime master:21 - mqueue mqueue rw
      77 73 0:30 / /dev/hugepages rw,relatime master:23 - hugetlbfs hugetlbfs rw
      78 72 0:15 / /run rw,nosuid,relatime master:5 - tmpfs tmpfs rw,size=26438160k,mode=755
      79 78 0:18 / /run/lock rw,nosuid,nodev,noexec,relatime master:6 - tmpfs tmpfs rw,size=5120k
      80 78 0:32 / /run/rpc_pipefs rw,relatime master:25 - rpc_pipefs rpc_pipefs rw
      82 72 0:14 / /sys rw,nosuid,nodev,noexec,relatime master:7 - sysfs sysfs rw
      83 82 0:16 / /sys/kernel/security rw,nosuid,nodev,noexec,relatime master:8 - securityfs securityfs rw
      84 82 0:19 / /sys/fs/cgroup ro,nosuid,nodev,noexec master:9 - tmpfs tmpfs ro,mode=755
      85 84 0:20 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime master:10 - cgroup cgroup rw,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd
      86 84 0:22 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime master:13 - cgroup cgroup rw,cpuset
      87 84 0:23 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime master:14 - cgroup cgroup rw,cpu,cpuacct
      88 84 0:24 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime master:15 - cgroup cgroup rw,devices
      89 84 0:25 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime master:16 - cgroup cgroup rw,freezer
      90 84 0:26 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime master:17 - cgroup cgroup rw,net_cls,net_prio
      91 84 0:27 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime master:18 - cgroup cgroup rw,blkio
      92 84 0:28 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime master:19 - cgroup cgroup rw,perf_event
      93 82 0:21 / /sys/fs/pstore rw,nosuid,nodev,noexec,relatime master:11 - pstore pstore rw
      94 82 0:6 / /sys/kernel/debug rw,relatime master:22 - debugfs debugfs rw
      95 72 0:3 / /proc rw,nosuid,nodev,noexec,relatime master:12 - proc proc rw
      96 95 0:29 / /proc/sys/fs/binfmt_misc rw,relatime master:20 - autofs systemd-1 rw,fd=22,pgrp=1,timeout=300,minproto=5,maxproto=5,direct
      97 96 0:34 / /proc/sys/fs/binfmt_misc rw,relatime master:27 - binfmt_misc binfmt_misc rw
      98 72 8:17 / /mnt/01 rw,relatime master:24 - ext4 /dev/sdb1 rw,errors=remount-ro,data=ordered
      99 98 8:17 /mesos_work/provisioner/containers/3790dd16-d1e2-4974-ba21-095a029b8c7d/backends/copy/rootfses/7ce26962-10a7-40ec-843b-c76e7e29c88d /mnt/01/mesos_work/slaves/67025326-9dfd-4cbb-a008-454a40bce2f5-S1/frameworks/13e02526-f2b7-4677-bb23-0faeeac65be9-0000/executors/thermos-root-devel-gpu_test-0-beeb742b-28c1-46f3-b49f-23443b6efcc2/runs/3790dd16-d1e2-4974-ba21-095a029b8c7d/taskfs rw,relatime master:24 - ext4 /dev/sdb1 rw,errors=remount-ro,data=ordered
      100 99 8:17 /mesos_work/slaves/67025326-9dfd-4cbb-a008-454a40bce2f5-S1/frameworks/13e02526-f2b7-4677-bb23-0faeeac65be9-0000/executors/thermos-root-devel-gpu_test-0-beeb742b-28c1-46f3-b49f-23443b6efcc2/runs/3790dd16-d1e2-4974-ba21-095a029b8c7d/sandbox /mnt/01/mesos_work/slaves/67025326-9dfd-4cbb-a008-454a40bce2f5-S1/frameworks/13e02526-f2b7-4677-bb23-0faeeac65be9-0000/executors/thermos-root-devel-gpu_test-0-beeb742b-28c1-46f3-b49f-23443b6efcc2/runs/3790dd16-d1e2-4974-ba21-095a029b8c7d/taskfs/mnt/mesos/sandbox rw,relatime master:24 - ext4 /dev/sdb1 rw,errors=remount-ro,data=ordered
      67 78 0:33 / /run/user/1001 rw,nosuid,nodev,relatime master:26 - tmpfs tmpfs rw,size=13219080k,mode=700,uid=1001,gid=1001

      Attachments

        Activity

          People

            Unassigned Unassigned
            jpinkul Justin Pinkul
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated: