Details
Description
Stack trace:
ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, Thread-6066574, that exited unexpectedly: java.lang.IllegalArgumentException: Comparison method violates its general contract! at java.util.TimSort.mergeHi(TimSort.java:899)
at java.util.TimSort.mergeAt(TimSort.java:516)
at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
at java.util.TimSort.sort(TimSort.java:254)
at java.util.Arrays.sort(Arrays.java:1512)
at java.util.ArrayList.sort(ArrayList.java:1462)
at java.util.Collections.sort(Collections.java:177)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:221)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:777)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:791)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1635)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1629)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1732)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1481)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:569)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:616)
In JDK 8, Arrays.sort by default is using the timsort algorithm, and timsort has a few requirements:
1.x.compareTo(y) != y.compareTo(x) 2.x>y,y>z --> x > z 3.x=y, x.compareTo(z) == y.compareTo(z)
If the Array / List does not satisfy any of these requirements, TimSort will throw a java.lang.IllegalArgumentException.
1. If we take a look into PriorityUtilizationQueueOrderingPolicy.compare method, we can see that Capacity Scheduler these queue fields in order to compare resource usage:
AbsoluteUsedCapacity UsedCapacity ConfiguredMinResource AbsoluteCapacity
2. In CS, during the execution of AsyncScheduleThread while the queues are being sorted in PriorityUtilizationQueueOrderingPolicy, for choosing the queue to assign the container to this IllegalArgumentException is thrown.
3. If we take a look into the ResourceCommitterService method, it tries to commit a CSAssignment coming from the ResourceCommitRequest, look tryCommit function, the queue resource usage is being updated.
public boolean tryCommit(Resource cluster, ResourceCommitRequest r, boolean updatePending) { long commitStart = System.nanoTime(); ResourceCommitRequest<FiCaSchedulerApp, FiCaSchedulerNode> request = (ResourceCommitRequest<FiCaSchedulerApp, FiCaSchedulerNode>) r; ... boolean isSuccess = false; if (attemptId != null) { FiCaSchedulerApp app = getApplicationAttempt(attemptId); // Required sanity check for attemptId - when async-scheduling enabled, // proposal might be outdated if AM failover just finished // and proposal queue was not be consumed in time if (app != null && attemptId.equals(app.getApplicationAttemptId())) { if (app.accept(cluster, request, updatePending) && app.apply(cluster, request, updatePending)) { // apply this resource ... } } } return isSuccess; } }
public boolean apply(Resource cluster, ResourceCommitRequest<FiCaSchedulerApp, FiCaSchedulerNode> request, boolean updatePending) { ... if (!reReservation) { getCSLeafQueue().apply(cluster, request); } ... }
4. org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue#apply invokes org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue#allocateResource:
void allocateResource(Resource clusterResource, Resource resource, String nodePartition) { try { writeLock.lock(); // only lock leaf queue lock queueUsage.incUsed(nodePartition, resource); ++numContainers; CSQueueUtils.updateQueueStatistics(resourceCalculator, clusterResource, this, labelManager, nodePartition); // there will update queue statistics } finally { writeLock.unlock(); } }
5. We can see that ResourceCommitterService will only lock the Leaf Queue to update the queue statistics, but the AsyncScheduleThread do only lock the Root Queue (in ParentQueue#sortAndGetChildrenAllocationIterator)
private Iterator<CSQueue> sortAndGetChildrenAllocationIterator( String partition) { try { readLock.lock(); return queueOrderingPolicy.getAssignmentIterator(partition); } finally { readLock.unlock(); } }
so if multiple threads are comparing queue usage statistics and ResourceCommitterService applies Leaf Queue changes in statistics in a concurrent manner, it will break the TimSort algorithm's requirements, causing a thread crash.
Attachments
Attachments
Issue Links
- fixes
-
YARN-8737 Race condition in ParentQueue when reinitializing and sorting child queues in the meanwhile
- Resolved
- is related to
-
YARN-10058 Capacity Scheduler dispatcher hang when async thread crash
- Patch Available