Affects Version/s: None
Fix Version/s: 3.3.0
During internal end-to-end testing, I found the following issue:
- GPU is enabled
- yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables is set to "/usr/bin/ls" - Any existing valid binary file
- yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices is set to "0:0,1:1,2:2", so auto-discovery is turned off.
If REST endpoint http://quasar-tsjqpq-3.vpc.cloudera.com:8042/ws/v1/node/resources/yarn.io%2Fgpu is called, the following exception is thrown in NM:
Let's break this down:
1. org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuResourcePlugin#getNMResourceInfo just calls to the
2. In org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuDiscoverer#getGpuDeviceInformation, the following calls to the NvidiaBinaryHelper.getGpuDeviceInformation:
3. org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.NvidiaBinaryHelper#getGpuDeviceInformation finally throws the exception.
This is only happens in case of the parameter called "pathOfGpuBinary" is null.
Since this method is only called from GpuDiscoverer#getGpuDeviceInformation, that passes it's field called "pathOfGpuBinary" as the only one parameter, we can be sure if this field is null, then we have the exception.
4. The only method that can set the "pathOfGpuBinary" fields is with this call chain:
5. GpuDiscoverer#initialize contains this code:
, so org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuDiscoverer#pathOfGpuBinary is set ONLY IF auto discovery is enabled.
Since our tests don't have auto discovery enabled, we have this exception. In this sense, the exception message is very misleading for me:
Related jira: https://issues.apache.org/jira/browse/YARN-9337
I think this exception message is very misleading and of course, it does not make any sense at all to try to execute the discovery binary.