Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-2273

NPE in ContinuousScheduling thread when we lose a node

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 2.3.0, 2.4.1
    • 2.6.0
    • None
    • cdh5.0.2 wheezy

    • Reviewed

    Description

      One DN experienced memory errors and entered a cycle of rebooting and rejoining the cluster. After the second time the node went away, the RM produced this:

      2014-07-09 21:47:36,571 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Application attempt appattempt_1404858438119_4352_000001 released container container_1404858438119_4352_01_000004 on node: host: node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 available=<memory:8192, vCores:8> used=<memory:0, vCores:0> with event: KILL
      2014-07-09 21:47:36,571 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: <memory:335872, vCores:328>
      2014-07-09 21:47:36,571 ERROR org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[ContinuousScheduling,5,main] threw an Exception.
      java.lang.NullPointerException
      	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044)
      	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040)
      	at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329)
      	at java.util.TimSort.sort(TimSort.java:203)
      	at java.util.TimSort.sort(TimSort.java:173)
      	at java.util.Arrays.sort(Arrays.java:659)
      	at java.util.Collections.sort(Collections.java:217)
      	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012)
      	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124)
      	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306)
      	at java.lang.Thread.run(Thread.java:744)
      

      A few cycles later YARN was crippled. The RM was running and jobs could be submitted but containers were not assigned and no progress was made. Restarting the RM resolved it.

      Attachments

        1. YARN-2273.patch
          1 kB
          Wei Yan
        2. YARN-2273.patch
          2 kB
          Wei Yan
        3. YARN-2273-replayException.patch
          5 kB
          Wei Yan
        4. YARN-2273.patch
          7 kB
          Wei Yan
        5. YARN-2273.patch
          7 kB
          Wei Yan
        6. YARN-2273-5.patch
          7 kB
          Wei Yan

        Issue Links

          Activity

            People

              ywskycn Wei Yan
              skeltoac Andy Skelton
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: