Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-10295

CapacityScheduler NPE can cause apps to get stuck without resources

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.1.0, 3.2.0
    • 3.2.2
    • capacityscheduler
    • None
    • Reviewed

    Description

      When the CapacityScheduler Asynchronous scheduling is enabled and log level is set to DEBUG there is an edge-case where a NullPointerException can cause the scheduler thread to exit and the apps to get stuck without allocated resources. Consider the following log:

      2020-05-27 10:13:49,106 INFO  fica.FiCaSchedulerApp (FiCaSchedulerApp.java:apply(681)) - Reserved container=container_e10_1590502305306_0660_01_000115, on node=host: ctr-e148-1588963324989-31443-01-000002.hwx.site:25454 #containers=14 available=<memory:2048, vCores:11> used=<memory:182272, vCores:14> with resource=<memory:4096, vCores:1>
      2020-05-27 10:13:49,134 INFO  fica.FiCaSchedulerApp (FiCaSchedulerApp.java:internalUnreserve(743)) - Application application_1590502305306_0660 unreserved  on node host: ctr-e148-1588963324989-31443-01-000002.hwx.site:25454 #containers=14 available=<memory:2048, vCores:11> used=<memory:182272, vCores:14>, currently has 0 at priority 11; currentReservation <memory:0, vCores:0> on node-label=
      2020-05-27 10:13:49,134 INFO  capacity.CapacityScheduler (CapacityScheduler.java:tryCommit(3042)) - Allocation proposal accepted
      2020-05-27 10:13:49,163 ERROR yarn.YarnUncaughtExceptionHandler (YarnUncaughtExceptionHandler.java:uncaughtException(68)) - Thread Thread[Thread-4953,5,main] threw an Exception.
      java.lang.NullPointerException
      	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1580)
      	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1767)
      	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1505)
      	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:546)
      	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:593)
      

      A container gets allocated on a host, but the host doesn't have enough memory, so after a short while it gets unreserved. However because the scheduler thread is running asynchronously it might have entered into the following if block located in CapacityScheduler.java#L1602, because at the time node.getReservedContainer() wasn't null. Calling it a second time for getting the ApplicationAttemptId would be an NPE, as the container got unreserved in the meantime.

      // Do not schedule if there are any reservations to fulfill on the node
      if (node.getReservedContainer() != null) {
          if (LOG.isDebugEnabled()) {
              LOG.debug("Skipping scheduling since node " + node.getNodeID()
                      + " is reserved by application " + node.getReservedContainer()
                      .getContainerId().getApplicationAttemptId());
           }
           return null;
      }
      

      A fix would be to store the container object before the if block.

      Only branch-3.1/3.2 is affected, because the newer branches have YARN-9664 which indirectly fixed this.

      Attachments

        1. YARN-10295.001.branch-3.1.patch
          1 kB
          Benjamin Teke
        2. YARN-10295.001.branch-3.2.patch
          1 kB
          Benjamin Teke
        3. YARN-10295.002.branch-3.1.patch
          2 kB
          Benjamin Teke
        4. YARN-10295.002.branch-3.2.patch
          2 kB
          Benjamin Teke

        Activity

          People

            bteke Benjamin Teke
            bteke Benjamin Teke
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: