[YARN-270] RM scheduler event handler thread gets behind - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Critical
Resolution: Unresolved
Affects Version/s: 0.23.5
Fix Version/s: None
Component/s: resourcemanager
Labels:
None

Target Version/s:

2.1.0-beta

Description

We had a couple of incidents on a 2800 node cluster where the RM scheduler event handler thread got behind processing events and basically become unusable. It was still processing apps, but taking a long time (1 hr 45 minutes) to accept new apps. this actually happened twice within 5 days.

We are using the capacity scheduler and at the time had between 400 and 500 applications running. There were another 250 apps that were in the SUBMITTED state in the RM but the scheduler hadn't processed those to put in pending state yet. We had about 15 queues none of them hierarchical. We also had plenty of space lefts on the cluster.

Attachments

Issue Links

is related to

MAPREDUCE-5124 AM lacks flow control for task events

Resolved

Sub-Tasks

1.	Make NodeManagers to NOT blindly heartbeat irrespective of whether previous heartbeat is processed or not.	Resolved	Xuan Gong
2.	Make RM provide heartbeat interval to NM	Closed	Xuan Gong
3.	RM changes to handle NM heartbeat during overload	Resolved	Xuan Gong
4.	Each NM heartbeat should not generate an event for the Scheduler	Closed	Xuan Gong
5.	When RM is catching up with node updates then it should not expire acquired containers	Resolved	Xuan Gong

Activity

People

Assignee:: Thomas Graves

Reporter:: Thomas Graves

Votes:: 0 Vote for this issue

Watchers:: 22 Start watching this issue

Dates

Created:: 13/Dec/12 18:57

Updated:: 02/Apr/13 14:31