[YARN-1857] CapacityScheduler headroom doesn't account for other AM's running - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Closed
Priority: Critical
Resolution: Fixed
Affects Version/s: 2.3.0
Fix Version/s: 2.6.0
Component/s: capacityscheduler
Labels:
None

Target Version/s:

2.6.0
Hadoop Flags:

Reviewed

Description

Its possible to get an application to hang forever (or a long time) in a cluster with multiple users. The reason why is that the headroom sent to the application is based on the user limit but it doesn't account for other Application masters using space in that queue. So the headroom (user limit - user consumed) can be > 0 even though the cluster is 100% full because the other space is being used by application masters from other users.

For instance if you have a cluster with 1 queue, user limit is 100%, you have multiple users submitting applications. One very large application by user 1 starts up, runs most of its maps and starts running reducers. other users try to start applications and get their application masters started but not tasks. The very large application then gets to the point where it has consumed the rest of the cluster resources with all reduces. But at this point it needs to still finish a few maps. The headroom being sent to this application is only based on the user limit (which is 100% of the cluster capacity) its using lets say 95% of the cluster for reduces and then other 5% is being used by other users running application masters. The MRAppMaster thinks it still has 5% so it doesn't know that it should kill a reduce in order to run a map.

This can happen in other scenarios also. Generally in a large cluster with multiple queues this shouldn't cause a hang forever but it could cause the application to take much longer.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

YARN-1857.7.patch
07/Oct/14 19:07
11 kB
Craig Welch
YARN-1857.6.patch
06/Oct/14 23:41
11 kB
Craig Welch
YARN-1857.5.patch
06/Oct/14 19:27
10 kB
Craig Welch
YARN-1857.4.patch
03/Oct/14 22:57
10 kB
Craig Welch
YARN-1857.3.patch
17/Sep/14 04:33
10 kB
Craig Welch
YARN-1857.2.patch
12/Sep/14 22:39
10 kB
Craig Welch
YARN-1857.1.patch
14/Aug/14 18:46
9 kB
Craig Welch
YARN-1857.patch
08/Jul/14 18:29
10 kB
Chen He
YARN-1857.patch
05/May/14 19:18
10 kB
Chen He
YARN-1857.patch
02/May/14 18:07
10 kB
Chen He

Activity

People

Assignee:: Chen He

Reporter:: Thomas Graves

Votes:: 0 Vote for this issue

Watchers:: 13 Start watching this issue

Dates

Created:: 19/Mar/14 20:33

Updated:: 01/Dec/14 03:09

Resolved:: 07/Oct/14 20:49