[YARN-10178] Global Scheduler async thread crash caused by 'Comparison method violates its general contract - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.2.1
Fix Version/s: 3.4.0, 2.10.2, 3.3.2, 3.2.4
Component/s: capacity scheduler
Labels:
None

Target Version/s:

3.4.0

Description

Stack trace:

ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, Thread-6066574, that exited unexpectedly: java.lang.IllegalArgumentException: Comparison method violates its general contract!                                                                     at java.util.TimSort.mergeHi(TimSort.java:899)
        at java.util.TimSort.mergeAt(TimSort.java:516)
        at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
        at java.util.TimSort.sort(TimSort.java:254)
        at java.util.Arrays.sort(Arrays.java:1512)
        at java.util.ArrayList.sort(ArrayList.java:1462)
        at java.util.Collections.sort(Collections.java:177)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:221)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:777)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:791)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1635)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1629)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1732)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1481)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:569)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:616)

In JDK 8, Arrays.sort by default is using the timsort algorithm, and timsort has a few requirements:

1.x.compareTo(y) != y.compareTo(x)
2.x>y,y>z --> x > z
3.x=y, x.compareTo(z) == y.compareTo(z)

If the Array / List does not satisfy any of these requirements, TimSort will throw a java.lang.IllegalArgumentException.

1. If we take a look into PriorityUtilizationQueueOrderingPolicy.compare method, we can see that Capacity Scheduler these queue fields in order to compare resource usage:

AbsoluteUsedCapacity
UsedCapacity
ConfiguredMinResource
AbsoluteCapacity

2. In CS, during the execution of AsyncScheduleThread while the queues are being sorted in PriorityUtilizationQueueOrderingPolicy, for choosing the queue to assign the container to this IllegalArgumentException is thrown.

3. If we take a look into the ResourceCommitterService method, it tries to commit a CSAssignment coming from the ResourceCommitRequest, look tryCommit function, the queue resource usage is being updated.

public boolean tryCommit(Resource cluster, ResourceCommitRequest r,
    boolean updatePending) {
  long commitStart = System.nanoTime();
  ResourceCommitRequest<FiCaSchedulerApp, FiCaSchedulerNode> request =
      (ResourceCommitRequest<FiCaSchedulerApp, FiCaSchedulerNode>) r;
 
  ...
  boolean isSuccess = false;
  if (attemptId != null) {
    FiCaSchedulerApp app = getApplicationAttempt(attemptId);
    // Required sanity check for attemptId - when async-scheduling enabled,
    // proposal might be outdated if AM failover just finished
    // and proposal queue was not be consumed in time
    if (app != null && attemptId.equals(app.getApplicationAttemptId())) {
      if (app.accept(cluster, request, updatePending)
          && app.apply(cluster, request, updatePending)) { // apply this resource
        ...
        }
    }
  }
  return isSuccess;
}
}

public boolean apply(Resource cluster, ResourceCommitRequest<FiCaSchedulerApp,
    FiCaSchedulerNode> request, boolean updatePending) {
...
    if (!reReservation) {
        getCSLeafQueue().apply(cluster, request); 
    }
...
}

4. org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue#apply invokes org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue#allocateResource:

void allocateResource(Resource clusterResource,
    Resource resource, String nodePartition) {
  try {
    writeLock.lock(); // only lock leaf queue lock
    queueUsage.incUsed(nodePartition, resource);
 
    ++numContainers;
 
    CSQueueUtils.updateQueueStatistics(resourceCalculator, clusterResource,
        this, labelManager, nodePartition); // there will update queue statistics
  } finally {
    writeLock.unlock();
  }
}

5. We can see that ResourceCommitterService will only lock the Leaf Queue to update the queue statistics, but the AsyncScheduleThread do only lock the Root Queue (in ParentQueue#sortAndGetChildrenAllocationIterator)

private Iterator<CSQueue> sortAndGetChildrenAllocationIterator(
      String partition) {
    try {
      readLock.lock();
      return queueOrderingPolicy.getAssignmentIterator(partition);
    } finally {
      readLock.unlock();
    }
  }

so if multiple threads are comparing queue usage statistics and ResourceCommitterService applies Leaf Queue changes in statistics in a concurrent manner, it will break the TimSort algorithm's requirements, causing a thread crash.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

YARN-10178.001.patch
23/Jan/21 07:05
5 kB
Qi Zhu
YARN-10178.002.patch
23/Jan/21 11:21
5 kB
Qi Zhu
YARN-10178.003.patch
29/Jan/21 06:03
17 kB
Qi Zhu
YARN-10178.004.patch
29/Jan/21 09:19
18 kB
Qi Zhu
YARN-10178.005.patch
05/Feb/21 08:53
18 kB
Qi Zhu
YARN-10178.006.patch
15/Dec/21 18:50
7 kB
Andras Gyori
YARN-10178.branch-2.10.001.patch
17/Dec/21 12:47
5 kB
Andras Gyori

Issue Links

fixes

YARN-8737 Race condition in ParentQueue when reinitializing and sorting child queues in the meanwhile

Resolved

is related to

YARN-10058 Capacity Scheduler dispatcher hang when async thread crash

Resolved

YARN-11639 ConcurrentModificationException and NPE in PriorityUtilizationQueueOrderingPolicy

Resolved

YARN-11684 PriorityQueueComparator violates general contract

Resolved

Activity

People

Assignee:: Andras Gyori

Reporter:: tuyu

Votes:: 0 Vote for this issue

Watchers:: 16 Start watching this issue

Dates

Created:: 03/Mar/20 06:22

Updated:: 18/Nov/24 13:24

Resolved:: 21/Dec/21 19:56