[YARN-3809] Failed to launch new attempts because ApplicationMasterLauncher's threads all hang - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.8.0, 2.7.1, 3.0.0-alpha1
Component/s: resourcemanager
Labels:
None

Hadoop Flags:

Reviewed

Description

ApplicationMasterLauncher create a thread pool whose size is 10 to deal with AMLauncherEventType(LAUNCH and CLEANUP).

In our cluster, there was many NM with 10+ AM running on it, and one shut down for some reason. After RM found the NM LOST, it cleaned up AMs running on it. Then ApplicationMasterLauncher need handle these 10+ CLEANUP event. ApplicationMasterLauncher's thread pool would be filled up, and they all hang in the code containerMgrProxy.stopContainers(stopRequest) because NM was down, the default RPC time out is 15 mins. It means that in 15 mins ApplicationMasterLauncher could not handle new event such as LAUNCH, then new attempts will fails to launch because of time out.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

YARN-3809.03.patch
20/Jun/15 06:04
6 kB
Jun Gong
YARN-3809.02.patch
19/Jun/15 14:08
6 kB
Jun Gong
YARN-3809.01.patch
16/Jun/15 14:50
3 kB
Jun Gong

Issue Links

is related to

MAPREDUCE-6659 Mapreduce App master waits long to kill containers on lost nodes.

Open

Activity

People

Assignee:: Jun Gong

Reporter:: Jun Gong

Votes:: 0 Vote for this issue

Watchers:: 16 Start watching this issue

Dates

Created:: 16/Jun/15 04:49

Updated:: 06/Jan/17 01:41

Resolved:: 24/Jun/15 16:29