Type: New Feature
Affects Version/s: None
Fix Version/s: 3.1.0
As varieties of workloads are moving to YARN, including machine learning / deep learning which can speed up by leveraging GPU computation power. Workloads should be able to request GPU from YARN as simple as CPU and memory.
To make a complete GPU story, we should support following pieces:
1) GPU discovery/configuration: Admin can either config GPU resources and architectures on each node, or more advanced, NodeManager can automatically discover GPU resources and architectures and report to ResourceManager
2) GPU scheduling: YARN scheduler should account GPU as a resource type just like CPU and memory.
3) GPU isolation/monitoring: once launch a task with GPU resources, NodeManager should properly isolate and monitor task's resource usage.
YARN-4122/ YARN-5517are all adding a new GPU resource type to Resource protocol instead of leveraging YARN-3926.
YARN-4122proposed to use CGroups to do isolation which cannot solve the problem listed at https://github.com/NVIDIA/nvidia-docker/wiki/GPU-isolation#challenges such as minor device number mapping; load nvidia_uvm module; mismatch of CUDA/driver versions, etc.