Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-6383

NvidiaGpuAllocator::resources cannot load symbol nvmlGetDeviceMinorNumber - can the device minor number be ascertained reliably using an older set of API calls?

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Won't Fix
    • 1.0.1
    • None
    • None

    Description

      We're attempting to deploy Mesos on a cluster with 2 Nvidia GPUs per host. We are not in a position to upgrade the Nvidia drivers in the near future, and are currently at driver version 319.72

      When attempting to launch an agent with the following command and take advantage of Nvidia GPU support (master address elided):

      ./bin/mesos-agent.sh --master=<masterIP>:<masterPort> --work_dir=/tmp/mesos --isolation="cgroups/devices,gpu/nvidia"

      I receive the following error message:

      Failed to create a containerizer: Failed call to NvidiaGpuAllocator::resources: Failed to nvml::initialize: Failed to load symbol 'nvmlDeviceGetMinorNumber': Error looking up symbol 'nvmlDeviceGetMinorNumber' in 'libnvidia-ml.so.1' : /usr/lib64/libnvidia-ml.so.1: undefined symbol: nvmlDeviceGetMinorNumber

      Based on the change log for the NVML module, it seems that nvmlDeviceGetMinorNumber is only available for driver versions 331 and later as per info under the Changes between NVML v5.319 Update and v331 heading in the NVML API reference.

      Is there is an alternate method of obtaining this information at runtime to enable support for older versions of the Nvidia driver? Based on discussion in the design document, obtaining this information from the nvidia-smi command output is a feasible alternative.

      I am willing to submit a PR that amends the behaviour of NvidiaGpuAllocator such that it first attempts calls to nvml::nvmlGetDeviceMinorNumber via libnvidia-ml, and if the symbol cannot be found, falls back on --nvidia-smi="/path/to/nvidia-smi" option obtained from mesos-agent if provided or attempts to run nvidia-smi if found on path and parses the output to obtain this information. Otherwise, raise an exception indicating all this was attempted.

      Would a function or class for parsing nvidia-smi output be a useful contribution?

      Attachments

        Activity

          People

            klueska Kevin Klues
            dylanht Dylan Bethune-Waddell
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: