[YARN-4741] RM is flooded with RMNodeFinishedContainersPulledByAMEvents in the async dispatcher event queue - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Duplicate
Affects Version/s: 2.6.0
Fix Version/s: None
Component/s: resourcemanager
Labels:
None

Description

We had a pretty major incident with the RM where it was continually flooded with RMNodeFinishedContainersPulledByAMEvents in the async dispatcher event queue.

In our setup, we had the RM HA or stateful restart disabled, but NM work-preserving restart enabled. Due to other issues, we did a cluster-wide NM restart.

Some time during the restart (which took multiple hours), we started seeing the async dispatcher event queue building. Normally it would log 1,000. In this case, it climbed all the way up to tens of millions of events.

When we looked at the RM log, it was full of the following messages:

2016-02-18 01:47:29,530 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Invalid event FINISHED_CONTAINERS_PULLED_BY_AM on Node  worker-node-foo.bar.net:8041
2016-02-18 01:47:29,535 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Can't handle this event at current state
2016-02-18 01:47:29,535 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Invalid event FINISHED_CONTAINERS_PULLED_BY_AM on Node  worker-node-foo.bar.net:8041
2016-02-18 01:47:29,538 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Can't handle this event at current state
2016-02-18 01:47:29,538 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Invalid event FINISHED_CONTAINERS_PULLED_BY_AM on Node  worker-node-foo.bar.net:8041

And that node in question was restarted a few minutes earlier.

When we inspected the RM heap, it was full of RMNodeFinishedContainersPulledByAMEvents.

Suspecting the NM work-preserving restart, we disabled it and did another cluster-wide rolling restart. Initially that seemed to have helped reduce the queue size, but the queue built back up to several millions and continued for an extended period. We had to restart the RM to resolve the problem.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

nm.log
29/Feb/16 21:25
4.15 MB
Sangjin Lee

Issue Links

is fixed by

YARN-5262 Optimize sending RMNodeFinishedContainersPulledByAMEvent for every AM heartbeat

Closed

YARN-5483 Optimize RMAppAttempt#pullJustFinishedContainers

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Sangjin Lee

Votes:: 0 Vote for this issue

Watchers:: 22 Start watching this issue

Dates

Created:: 26/Feb/16 23:43

Updated:: 04/Aug/22 08:23

Resolved:: 04/Aug/22 08:23