Description
Nodemanager will not start
1. If Autodiscovery is enabled:
- If nvidia-smi path is misconfigured or the file does not exist.
- There is 0 GPU found
- If the file exists but it is not pointing to an nvidia-smi
- if the binary is ok but there is an IOException
2. If the manually configured GPU devices are misconfigured
- Any index:minor number format failure will cause a problem
- 0 configured device will cause a problem
- NumberFormatException is not handled
It would be a better option to add warnings about the configuration, set 0 available GPUs and let the node work and run non-gpu jobs.