[YARN-10295] CapacityScheduler NPE can cause apps to get stuck without resources - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.1.0, 3.2.0
Fix Version/s: 3.2.2
Component/s: capacityscheduler
Labels:
None

Hadoop Flags:

Reviewed

Description

When the CapacityScheduler Asynchronous scheduling is enabled and log level is set to DEBUG there is an edge-case where a NullPointerException can cause the scheduler thread to exit and the apps to get stuck without allocated resources. Consider the following log:

2020-05-27 10:13:49,106 INFO  fica.FiCaSchedulerApp (FiCaSchedulerApp.java:apply(681)) - Reserved container=container_e10_1590502305306_0660_01_000115, on node=host: ctr-e148-1588963324989-31443-01-000002.hwx.site:25454 #containers=14 available=<memory:2048, vCores:11> used=<memory:182272, vCores:14> with resource=<memory:4096, vCores:1>
2020-05-27 10:13:49,134 INFO  fica.FiCaSchedulerApp (FiCaSchedulerApp.java:internalUnreserve(743)) - Application application_1590502305306_0660 unreserved  on node host: ctr-e148-1588963324989-31443-01-000002.hwx.site:25454 #containers=14 available=<memory:2048, vCores:11> used=<memory:182272, vCores:14>, currently has 0 at priority 11; currentReservation <memory:0, vCores:0> on node-label=
2020-05-27 10:13:49,134 INFO  capacity.CapacityScheduler (CapacityScheduler.java:tryCommit(3042)) - Allocation proposal accepted
2020-05-27 10:13:49,163 ERROR yarn.YarnUncaughtExceptionHandler (YarnUncaughtExceptionHandler.java:uncaughtException(68)) - Thread Thread[Thread-4953,5,main] threw an Exception.
java.lang.NullPointerException
	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1580)
	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1767)
	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1505)
	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:546)
	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:593)

A container gets allocated on a host, but the host doesn't have enough memory, so after a short while it gets unreserved. However because the scheduler thread is running asynchronously it might have entered into the following if block located in CapacityScheduler.java#L1602, because at the time node.getReservedContainer() wasn't null. Calling it a second time for getting the ApplicationAttemptId would be an NPE, as the container got unreserved in the meantime.

// Do not schedule if there are any reservations to fulfill on the node
if (node.getReservedContainer() != null) {
    if (LOG.isDebugEnabled()) {
        LOG.debug("Skipping scheduling since node " + node.getNodeID()
                + " is reserved by application " + node.getReservedContainer()
                .getContainerId().getApplicationAttemptId());
     }
     return null;
}

A fix would be to store the container object before the if block.

Only branch-3.1/3.2 is affected, because the newer branches have ~~YARN-9664~~ which indirectly fixed this.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

YARN-10295.001.branch-3.1.patch
29/May/20 14:56
1 kB
Benjamin Teke
YARN-10295.001.branch-3.2.patch
29/May/20 14:29
1 kB
Benjamin Teke
YARN-10295.002.branch-3.1.patch
05/Jun/20 10:02
2 kB
Benjamin Teke
YARN-10295.002.branch-3.2.patch
05/Jun/20 10:06
2 kB
Benjamin Teke

Activity

People

Assignee:: Benjamin Teke

Reporter:: Benjamin Teke

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 29/May/20 13:08

Updated:: 10/Jun/21 08:15

Resolved:: 10/Jun/20 16:17