[YARN-6223] [Umbrella] Natively support GPU configuration/discovery/scheduling/isolation on YARN - ASF JIRA

Voters

Watch issue

Watchers

Create sub-task

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 3.1.0
Component/s: None
Labels:
None

Description

As varieties of workloads are moving to YARN, including machine learning / deep learning which can speed up by leveraging GPU computation power. Workloads should be able to request GPU from YARN as simple as CPU and memory.

To make a complete GPU story, we should support following pieces:
1) GPU discovery/configuration: Admin can either config GPU resources and architectures on each node, or more advanced, NodeManager can automatically discover GPU resources and architectures and report to ResourceManager

2) GPU scheduling: YARN scheduler should account GPU as a resource type just like CPU and memory.

3) GPU isolation/monitoring: once launch a task with GPU resources, NodeManager should properly isolate and monitor task's resource usage.

For #2, ~~YARN-3926~~ can support it natively. For #3, ~~YARN-3611~~ has introduced an extensible framework to support isolation for different resource types and different runtimes.

Related JIRAs:
There're a couple of JIRAs (~~YARN-4122~~/~~YARN-5517~~) filed with similar goals but different solutions:
For scheduling:

~~YARN-4122~~/~~YARN-5517~~ are all adding a new GPU resource type to Resource protocol instead of leveraging ~~YARN-3926~~.

For isolation:

And ~~YARN-4122~~ proposed to use CGroups to do isolation which cannot solve the problem listed at https://github.com/NVIDIA/nvidia-docker/wiki/GPU-isolation#challenges such as minor device number mapping; load nvidia_uvm module; mismatch of CUDA/driver versions, etc.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

YARN-6223.wip.3.patch
14/Jul/17 22:49
128 kB
Wangda Tan
YARN-6223.wip.2.patch
28/Jun/17 02:38
69 kB
Wangda Tan
YARN-6223.wip.1.patch
01/Apr/17 03:40
31 kB
Wangda Tan
YARN-6223.Natively-support-GPU-on-YARN-v1.pdf
01/Apr/17 03:40
169 kB
Wangda Tan

Issue Links

Add Link

is related to

YARN-8200 Backport resource types/GPU features to branch-3.0/branch-2

Resolved

Delete this link

relates to

YARN-4122 Add support for GPU as a resource

Resolved

Delete this link

YARN-5983 [Umbrella] Support for FPGA as a Resource in YARN

Resolved

Delete this link

requires

YARN-3926 [Umbrella] Extend the YARN resource model for easier resource-type management and profiles

Resolved

Delete this link

Sub-Tasks

Create Sub-Task

1.	Add support for GPU as a resource	Resolved	Jun Gong	Actions
2.	Add support in NodeManager to isolate GPU devices by using CGroups	Resolved	Wangda Tan	Actions
3.	[YARN-6223] Native code changes to support isolate GPU devices by using CGroups	Resolved	Wangda Tan	Actions
4.	Document GPU isolation feature	Resolved	Wangda Tan	Actions
5.	Support GPU isolation for docker container	Resolved	Wangda Tan	Actions
6.	Add support to show GPU in UI including metrics	Resolved	Wangda Tan	Actions
7.	GPU Isolation: Incorrect minor device numbers written to devices.deny file	Resolved	Jonathan Hung	Actions
8.	Use "docker volume inspect" to make sure that volumes for GPU drivers/libs are properly mounted.	Resolved	Wangda Tan	Actions
9.	Ensure volume to include GPU base libraries after created by plugin	Resolved	Wangda Tan	Actions
10.	Gpu Information page could be empty for nodes without GPU	Resolved	Sunil G	Actions
11.	GPU volume creation command fails when work preserving is disabled at NM	Resolved	Zian Chen	Actions
12.	Document YARN Ambari Integration Guide for GPU	Resolved	Zian Chen	Actions

Activity

Comment

This comment will be Viewable by All Users Viewable by All Users

Cancel

People

Assignee:: Wangda Tan

Reporter:: Wangda Tan

Votes:: 4 Vote for this issue

Watchers:: 55 Start watching this issue

Dates

Created:: 23/Feb/17 00:46

Updated:: 14/Nov/18 17:38

Resolved:: 06/Apr/18 18:32

Agile

View on Board

[Umbrella] Natively support GPU configuration/discovery/scheduling/isolation on YARN

Details

Description

Attachments

Attachments

Issue Links

Sub-Tasks

Activity

People

Dates

Agile

Slack

Issue deployment