[YARN-6102] RMActiveService context to be updated with new RMContext on failover - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 2.8.0, 2.7.3
Fix Version/s: 2.9.0, 3.0.0-beta1
Component/s: None
Labels:
None

Hadoop Flags:

Reviewed

Description

2017-01-17 16:42:17,911 FATAL [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:dispatch(200)) - Error in dispatcher thread
java.lang.Exception: No handler for registered for class org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeEventType
        at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:196)
        at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:120)
        at java.lang.Thread.run(Thread.java:745)
2017-01-17 16:42:17,914 INFO  [AsyncDispatcher ShutDown handler] event.AsyncDispatcher (AsyncDispatcher.java:run(303)) - Exiting, bbye..

The same stack i was also noticed in TestResourceTrackerOnHA exits abnormally, after some analysis, i was able to reproduce.

Once the nodeHeartBeat is sent to RM, inside org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService.nodeHeartbeat(NodeHeartbeatRequest), before sending it to dispatcher through
this.rmContext.getDispatcher().getEventHandler().handle(nodeStatusEvent); if RM failover is called, the dispatcher is reset
The new dispatcher is however first started and then the events are registered at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.reinitialize(boolean)

So event order will look like
1. Send Node heartbeat to ResourceTrackerService
2. In ResourceTrackerService.nodeHeartbeat, before passing to dispatcher call RM failover
3. In RM Failover, current active will reset dispatcher @reinitialize i.e ( resetDispatcher(); + createAndInitActiveServices(); )

Now between resetDispatcher(); and createAndInitActiveServices(); , the ResourceTrackerService.nodeHeartbeat invokes dipatcher

This will cause the above error as at point of time when STATUS_UPDATE event is given to dispatcher in ResourceTrackerService , the new dispatcher(from the failover) may be started but not yet registered for events
Using same steps(with pausing JVM at debug), i was able to reproduce this in production cluster also. for STATUS_UPDATE active service event, when the service is yet to forward the event to RM dispatcher but a failover is called and dispatcher reset is between resetDispatcher(); & createAndInitActiveServices();

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

eventOrder.JPG
17/Jan/17 10:49
30 kB
Ajith S
YARN-6102.01.patch
10/Jul/17 10:43
23 kB
Rohith Sharma K S
YARN-6102.02.patch
11/Jul/17 12:24
43 kB
Rohith Sharma K S
YARN-6102.03.patch
11/Jul/17 17:03
44 kB
Rohith Sharma K S
YARN-6102.04.patch
12/Jul/17 12:38
47 kB
Rohith Sharma K S
YARN-6102.05.patch
12/Jul/17 14:11
48 kB
Rohith Sharma K S
YARN-6102.06.patch
20/Jul/17 06:07
57 kB
Rohith Sharma K S
YARN-6102.07.patch
20/Jul/17 08:55
57 kB
Rohith Sharma K S
YARN-6102-branch-2.001.patch
24/Jul/17 06:48
46 kB
Rohith Sharma K S
YARN-6102-branch-2.002.patch
24/Jul/17 09:07
46 kB
Rohith Sharma K S
YARN-6102-branch-2.002-addednum.patch
26/Jul/17 09:26
1.0 kB
Rohith Sharma K S
YARN-6102-branch-2.003-addendum.patch
10/Nov/17 15:53
9 kB
Subramaniam Krishnan
YARN-6102-YARN-5355-branch-2.addendum.patch
01/Aug/17 19:04
7 kB
Varun Saxena

Issue Links

is duplicated by

YARN-6847 [ATSv2] NPE in RM while starting timeline collector on recovery after explicit failover

Resolved

is related to

YARN-2398 TestResourceTrackerOnHA crashes

Resolved

Activity

People

Assignee:: Rohith Sharma K S

Reporter:: Ajith S

Votes:: 0 Vote for this issue

Watchers:: 15 Start watching this issue

Dates

Created:: 17/Jan/17 09:36

Updated:: 10/Nov/17 21:49

Resolved:: 10/Nov/17 21:49