Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
-
Reviewed
Description
We observed a race condition inside YARN with the following stack trace:
18/11/07 06:45:09.559 SchedulerEventDispatcher:Event Processor ERROR EventDispatcher: Error in handling event type NODE_UPDATE to the Event Dispatcher java.util.NoSuchElementException at java.util.concurrent.ConcurrentSkipListMap.firstKey(ConcurrentSkipListMap.java:2036) at java.util.concurrent.ConcurrentSkipListSet.first(ConcurrentSkipListSet.java:396) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.getNextPendingAsk(AppSchedulingInfo.java:373) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.isOverAMShareLimit(FSAppAttempt.java:941) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.assignContainer(FSAppAttempt.java:1373) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.assignContainer(FSLeafQueue.java:353) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.assignContainer(FSParentQueue.java:204) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.attemptScheduling(FairScheduler.java:1094) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.nodeUpdate(FairScheduler.java:961) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1183) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:132) at org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66) at java.lang.Thread.run(Thread.java:748)
This is basically the same as the one described in YARN-7382, but the root cause is different.
When we create an application attempt, we create an FSAppAttempt object. This contains an AppSchedulingInfo which contains a set of SchedulerRequestKey. Initially, this set is empty and only initialized a bit later on a separate thread during a state transition:
2019-05-07 15:58:02,659 INFO [RM StateStore dispatcher] recovery.RMStateStore (RMStateStore.java:transition(239)) - Storing info for app: application_1557237478804_0001 2019-05-07 15:58:02,684 INFO [RM Event dispatcher] rmapp.RMAppImpl (RMAppImpl.java:handle(903)) - application_1557237478804_0001 State change from NEW_SAVING to SUBMITTED on event = APP_NEW_SAVED 2019-05-07 15:58:02,690 INFO [SchedulerEventDispatcher:Event Processor] fair.FairScheduler (FairScheduler.java:addApplication(490)) - Accepted application application_1557237478804_0001 from user: bacskop, in queue: root.bacskop, currently num of applications: 1 2019-05-07 15:58:02,698 INFO [RM Event dispatcher] rmapp.RMAppImpl (RMAppImpl.java:handle(903)) - application_1557237478804_0001 State change from SUBMITTED to ACCEPTED on event = APP_ACCEPTED 2019-05-07 15:58:02,731 INFO [RM Event dispatcher] resourcemanager.ApplicationMasterService (ApplicationMasterService.java:registerAppAttempt(434)) - Registering app attempt : appattempt_1557237478804_0001_000001 2019-05-07 15:58:02,732 INFO [RM Event dispatcher] attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(920)) - appattempt_1557237478804_0001_000001 State change from NEW to SUBMITTED on event = START 2019-05-07 15:58:02,746 INFO [SchedulerEventDispatcher:Event Processor] scheduler.SchedulerApplicationAttempt (SchedulerApplicationAttempt.java:<init>(207)) - *** In the constructor of SchedulerApplicationAttempt 2019-05-07 15:58:02,747 INFO [SchedulerEventDispatcher:Event Processor] scheduler.SchedulerApplicationAttempt (SchedulerApplicationAttempt.java:<init>(230)) - *** Contents of appSchedulingInfo: [] 2019-05-07 15:58:02,752 INFO [SchedulerEventDispatcher:Event Processor] fair.FairScheduler (FairScheduler.java:addApplicationAttempt(546)) - Added Application Attempt appattempt_1557237478804_0001_000001 to scheduler from user: bacskop 2019-05-07 15:58:02,756 INFO [RM Event dispatcher] scheduler.AppSchedulingInfo (AppSchedulingInfo.java:updatePendingResources(257)) - *** Adding scheduler key: SchedulerRequestKey{priority=0, allocationRequestId=-1, containerToUpdate=null} for attempt: appattempt_1557237478804_0001_000001 2019-05-07 15:58:02,759 INFO [RM Event dispatcher] attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(920)) - appattempt_1557237478804_0001_000001 State change from SUBMITTED to SCHEDULED on event = ATTEMPT_ADDED 2019-05-07 15:58:02,892 INFO [main] impl.YarnClientImpl (YarnClientImpl.java:submitApplication(310)) - Submitted application application_1557237478804_0001
(some extra lines are printed with ***).
So at 15:58:02,747 the set is empty and populated with a single element at 15:58:02,756 on "RM Event dispatcher". This means there's a tiny time window during which a NODE_UPDATE can cause a NoSuchElementException.
Attachments
Attachments
Issue Links
- is related to
-
YARN-7382 NoSuchElementException in FairScheduler after failover causes RM crash
- Resolved