Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-7382

NoSuchElementException in FairScheduler after failover causes RM crash

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • 2.9.0, 3.0.0
    • 2.9.0, 3.0.0
    • fairscheduler
    • None
    • Reviewed

    Description

      While running an MR job (e.g. sleep) and an RM failover occurs, once the maps gets to 100%, the now active RM will crash due to:

      2017-10-18 15:02:05,347 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1508361403235_0001_01_000002 Container Transitioned from RUNNING to COMPLETED
      2017-10-18 15:02:05,347 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=systest  OPERATION=AM Released Container TARGET=SchedulerApp     RESULT=SUCCESS  APPID=application_1508361403235_0001    CONTAINERID=container_1508361403235_0001_01_000002      RESOURCE=<memory:1024, vCores:1>
      2017-10-18 15:02:05,349 FATAL org.apache.hadoop.yarn.event.EventDispatcher: Error in handling event type NODE_UPDATE to the Event Dispatcher
      java.util.NoSuchElementException
              at java.util.concurrent.ConcurrentSkipListMap.firstKey(ConcurrentSkipListMap.java:2036)
              at java.util.concurrent.ConcurrentSkipListSet.first(ConcurrentSkipListSet.java:396)
              at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.getNextPendingAsk(AppSchedulingInfo.java:371)
              at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.isOverAMShareLimit(FSAppAttempt.java:901)
              at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.assignContainer(FSAppAttempt.java:1326)
              at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.assignContainer(FSLeafQueue.java:371)
              at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.assignContainer(FSParentQueue.java:221)
              at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.assignContainer(FSParentQueue.java:221)
              at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.attemptScheduling(FairScheduler.java:1019)
              at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.nodeUpdate(FairScheduler.java:887)
              at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1104)
              at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:128)
              at org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66)
              at java.lang.Thread.run(Thread.java:748)
      2017-10-18 15:02:05,360 INFO org.apache.hadoop.yarn.event.EventDispatcher: Exiting, bbye..
      

      This leaves the cluster with no RMs!

      Attachments

        1. YARN-7382.001.patch
          6 kB
          Robert Kanter

        Issue Links

          Activity

            People

              rkanter Robert Kanter
              rkanter Robert Kanter
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: