Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
2.3.0, 2.5.0, 2.7.1
-
None
-
Reviewed
Description
Under some circumstances the node is removed before an expired container event is processed causing the RM to exit:
2015-10-04 21:14:01,063 INFO org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: Expired:container_1436927988321_1307950_01_000012 Timed out after 600 secs 2015-10-04 21:14:01,063 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1436927988321_1307950_01_000012 Container Transitioned from ACQUIRED to EXPIRED 2015-10-04 21:14:01,063 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerApp: Completed container: container_1436927988321_1307950_01_000012 in state: EXPIRED event:EXPIRE 2015-10-04 21:14:01,063 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=system_op OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS APPID=application_1436927988321_1307950 CONTAINERID=container_1436927988321_1307950_01_000012 2015-10-04 21:14:01,063 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type CONTAINER_EXPIRED to the scheduler java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.completedContainer(FairScheduler.java:849) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1273) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:122) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:585) at java.lang.Thread.run(Thread.java:745) 2015-10-04 21:14:01,063 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye..
The stack trace is from 2.3.0 but the same issue has been observed in 2.5.0 and 2.6.0 by different customers.