Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-9552

FairScheduler: NODE_UPDATE can cause NoSuchElementException

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 3.3.0, 3.2.2, 3.1.4
    • fairscheduler
    • None
    • Reviewed

    Description

      We observed a race condition inside YARN with the following stack trace:

      18/11/07 06:45:09.559 SchedulerEventDispatcher:Event Processor ERROR EventDispatcher: Error in handling event type NODE_UPDATE to the Event Dispatcher
      java.util.NoSuchElementException
              at java.util.concurrent.ConcurrentSkipListMap.firstKey(ConcurrentSkipListMap.java:2036)
              at java.util.concurrent.ConcurrentSkipListSet.first(ConcurrentSkipListSet.java:396)
              at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.getNextPendingAsk(AppSchedulingInfo.java:373)
              at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.isOverAMShareLimit(FSAppAttempt.java:941)
              at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.assignContainer(FSAppAttempt.java:1373)
              at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.assignContainer(FSLeafQueue.java:353)
              at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.assignContainer(FSParentQueue.java:204)
              at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.attemptScheduling(FairScheduler.java:1094)
              at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.nodeUpdate(FairScheduler.java:961)
              at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1183)
              at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:132)
              at org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66)
              at java.lang.Thread.run(Thread.java:748)
      

      This is basically the same as the one described in YARN-7382, but the root cause is different.

      When we create an application attempt, we create an FSAppAttempt object. This contains an AppSchedulingInfo which contains a set of SchedulerRequestKey. Initially, this set is empty and only initialized a bit later on a separate thread during a state transition:

      2019-05-07 15:58:02,659 INFO  [RM StateStore dispatcher] recovery.RMStateStore (RMStateStore.java:transition(239)) - Storing info for app: application_1557237478804_0001
      2019-05-07 15:58:02,684 INFO  [RM Event dispatcher] rmapp.RMAppImpl (RMAppImpl.java:handle(903)) - application_1557237478804_0001 State change from NEW_SAVING to SUBMITTED on event = APP_NEW_SAVED
      2019-05-07 15:58:02,690 INFO  [SchedulerEventDispatcher:Event Processor] fair.FairScheduler (FairScheduler.java:addApplication(490)) - Accepted application application_1557237478804_0001 from user: bacskop, in queue: root.bacskop, currently num of applications: 1
      2019-05-07 15:58:02,698 INFO  [RM Event dispatcher] rmapp.RMAppImpl (RMAppImpl.java:handle(903)) - application_1557237478804_0001 State change from SUBMITTED to ACCEPTED on event = APP_ACCEPTED
      2019-05-07 15:58:02,731 INFO  [RM Event dispatcher] resourcemanager.ApplicationMasterService (ApplicationMasterService.java:registerAppAttempt(434)) - Registering app attempt : appattempt_1557237478804_0001_000001
      2019-05-07 15:58:02,732 INFO  [RM Event dispatcher] attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(920)) - appattempt_1557237478804_0001_000001 State change from NEW to SUBMITTED on event = START
      2019-05-07 15:58:02,746 INFO  [SchedulerEventDispatcher:Event Processor] scheduler.SchedulerApplicationAttempt (SchedulerApplicationAttempt.java:<init>(207)) - *** In the constructor of SchedulerApplicationAttempt
      2019-05-07 15:58:02,747 INFO  [SchedulerEventDispatcher:Event Processor] scheduler.SchedulerApplicationAttempt (SchedulerApplicationAttempt.java:<init>(230)) - *** Contents of appSchedulingInfo: []
      2019-05-07 15:58:02,752 INFO  [SchedulerEventDispatcher:Event Processor] fair.FairScheduler (FairScheduler.java:addApplicationAttempt(546)) - Added Application Attempt appattempt_1557237478804_0001_000001 to scheduler from user: bacskop
      2019-05-07 15:58:02,756 INFO  [RM Event dispatcher] scheduler.AppSchedulingInfo (AppSchedulingInfo.java:updatePendingResources(257)) - *** Adding scheduler key: SchedulerRequestKey{priority=0, allocationRequestId=-1, containerToUpdate=null}  for attempt: appattempt_1557237478804_0001_000001
      2019-05-07 15:58:02,759 INFO  [RM Event dispatcher] attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(920)) - appattempt_1557237478804_0001_000001 State change from SUBMITTED to SCHEDULED on event = ATTEMPT_ADDED
      2019-05-07 15:58:02,892 INFO  [main] impl.YarnClientImpl (YarnClientImpl.java:submitApplication(310)) - Submitted application application_1557237478804_0001
      

      (some extra lines are printed with ***).

      So at 15:58:02,747 the set is empty and populated with a single element at 15:58:02,756 on "RM Event dispatcher". This means there's a tiny time window during which a NODE_UPDATE can cause a NoSuchElementException.

      Attachments

        1. YARN-9552-001.patch
          3 kB
          Peter Bacsko
        2. YARN-9552-002.patch
          6 kB
          Peter Bacsko
        3. YARN-9552-003.patch
          6 kB
          Peter Bacsko
        4. YARN-9552-004.patch
          6 kB
          Peter Bacsko
        5. YARN-9552-branch-3.2.001.patch
          6 kB
          Peter Bacsko
        6. YARN-9552-branch-3.1.001.patch
          6 kB
          Peter Bacsko
        7. YARN-9552-branch-3.2.002.patch
          6 kB
          Peter Bacsko
        8. YARN-9552-branch-3.1.002.patch
          6 kB
          Peter Bacsko
        9. YARN-9552-branch-3.2.003.patch
          6 kB
          Peter Bacsko

        Issue Links

          Activity

            People

              pbacsko Peter Bacsko
              pbacsko Peter Bacsko
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: