[YARN-3415] Non-AM containers can be counted towards amResourceUsage of a Fair Scheduler queue - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Critical
Resolution: Fixed
Affects Version/s: 2.6.0
Fix Version/s: 2.8.0, 3.0.0-alpha1
Component/s: fairscheduler
Labels:
None

Target Version/s:

2.7.0, 2.6.1
Hadoop Flags:

Reviewed

Description

We encountered this problem while running a spark cluster. The amResourceUsage for a queue became artificially high and then the cluster got deadlocked because the maxAMShare constrain kicked in and no new AM got admitted to the cluster.

I have described the problem in detail here: https://github.com/apache/spark/pull/5233#issuecomment-87160289

In summary - the condition for adding the container's memory towards amResourceUsage is fragile. It depends on the number of live containers belonging to the app. We saw that the spark AM went down without explicitly releasing its requested containers and then one of those containers memory was counted towards amResource.

cc - sandyr

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

YARN-3415.002.patch
01/Apr/15 23:14
12 kB
Zhihai Xu
YARN-3415.001.patch
31/Mar/15 09:49
12 kB
Zhihai Xu
YARN-3415.000.patch
29/Mar/15 22:58
7 kB
Zhihai Xu

Activity

People

Assignee:: Zhihai Xu

Reporter:: Rohit Agarwal

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 28/Mar/15 07:22

Updated:: 5 days ago 09:39

Resolved:: 02/Apr/15 21:01