Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-2273

NPE in ContinuousScheduling thread when we lose a node

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.3.0, 2.4.1
    • Fix Version/s: 2.6.0
    • Labels:
      None
    • Environment:

      cdh5.0.2 wheezy

    • Hadoop Flags:
      Reviewed

      Description

      One DN experienced memory errors and entered a cycle of rebooting and rejoining the cluster. After the second time the node went away, the RM produced this:

      2014-07-09 21:47:36,571 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Application attempt appattempt_1404858438119_4352_000001 released container container_1404858438119_4352_01_000004 on node: host: node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 available=<memory:8192, vCores:8> used=<memory:0, vCores:0> with event: KILL
      2014-07-09 21:47:36,571 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: <memory:335872, vCores:328>
      2014-07-09 21:47:36,571 ERROR org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[ContinuousScheduling,5,main] threw an Exception.
      java.lang.NullPointerException
      	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044)
      	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040)
      	at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329)
      	at java.util.TimSort.sort(TimSort.java:203)
      	at java.util.TimSort.sort(TimSort.java:173)
      	at java.util.Arrays.sort(Arrays.java:659)
      	at java.util.Collections.sort(Collections.java:217)
      	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012)
      	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124)
      	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306)
      	at java.lang.Thread.run(Thread.java:744)
      

      A few cycles later YARN was crippled. The RM was running and jobs could be submitted but containers were not assigned and no progress was made. Restarting the RM resolved it.

        Attachments

        1. YARN-2273-5.patch
          7 kB
          Wei Yan
        2. YARN-2273.patch
          7 kB
          Wei Yan
        3. YARN-2273.patch
          7 kB
          Wei Yan
        4. YARN-2273-replayException.patch
          5 kB
          Wei Yan
        5. YARN-2273.patch
          2 kB
          Wei Yan
        6. YARN-2273.patch
          1 kB
          Wei Yan

          Issue Links

            Activity

              People

              • Assignee:
                ywskycn Wei Yan
                Reporter:
                skeltoac Andy Skelton
              • Votes:
                0 Vote for this issue
                Watchers:
                14 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: