Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
2.8.0, 2.7.3
-
None
-
None
-
Reviewed
Description
2017-01-17 16:42:17,911 FATAL [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:dispatch(200)) - Error in dispatcher thread java.lang.Exception: No handler for registered for class org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeEventType at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:196) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:120) at java.lang.Thread.run(Thread.java:745) 2017-01-17 16:42:17,914 INFO [AsyncDispatcher ShutDown handler] event.AsyncDispatcher (AsyncDispatcher.java:run(303)) - Exiting, bbye..
The same stack i was also noticed in TestResourceTrackerOnHA exits abnormally, after some analysis, i was able to reproduce.
Once the nodeHeartBeat is sent to RM, inside org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService.nodeHeartbeat(NodeHeartbeatRequest), before sending it to dispatcher through
this.rmContext.getDispatcher().getEventHandler().handle(nodeStatusEvent); if RM failover is called, the dispatcher is reset
The new dispatcher is however first started and then the events are registered at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.reinitialize(boolean)
So event order will look like
1. Send Node heartbeat to ResourceTrackerService
2. In ResourceTrackerService.nodeHeartbeat, before passing to dispatcher call RM failover
3. In RM Failover, current active will reset dispatcher @reinitialize i.e ( resetDispatcher(); + createAndInitActiveServices(); )
Now between resetDispatcher(); and createAndInitActiveServices(); , the ResourceTrackerService.nodeHeartbeat invokes dipatcher
This will cause the above error as at point of time when STATUS_UPDATE event is given to dispatcher in ResourceTrackerService , the new dispatcher(from the failover) may be started but not yet registered for events
Using same steps(with pausing JVM at debug), i was able to reproduce this in production cluster also. for STATUS_UPDATE active service event, when the service is yet to forward the event to RM dispatcher but a failover is called and dispatcher reset is between resetDispatcher(); & createAndInitActiveServices();