Details
-
Bug
-
Status: Patch Available
-
Major
-
Resolution: Unresolved
-
3.1.2
-
None
Description
On our cluster with a large number of NMs, preemption monitor thread consistently got java.util.ConcurrentModificationException when specific conditions met. (And preemption doesn't work, of course)
What We found as conditions are as follow. (All 4 conditions should be met)
- There are at least two non-exclusive partitions except default partition (let me call the partitions as X and Y partition)
- app1 in the queue belonging to default partition (let me call the queue as 'dev' queue) borrowed resources from both X, Y partitions
- app2, app3 submitted to queues belonging to each X, Y partition is 'PENDING' because resources are consumed by app1
- Preemption monitor can clear borrowed resources from X or Y when the container of app1 is preempted.
Main problem is that FifoCandiatesSelector.selectCandidates tried to remove HashMap key(partition name) while iterating HashMap.
Logically, it is correct because we didn't traverse the same partition again on this 'selectCandidates'. However HashMap structure does not allow modification while iterating.
I made test case to reproduce the error case(testResourceTypesInterQueuePreemptionWithThreePartitions).
We found and patched our cluster on 3.1.2 but it seems trunk still has the same problem.
I attached patch based on the trunk.
Thanks!
{{2020-09-07 12:20:37,105 ERROR monitor.SchedulingMonitor (SchedulingMonitor.java:run(116)) - Exception raised while executing preemption checker, skip this run..., exception=
java.util.ConcurrentModificationException
at java.util.HashMap$HashIterator.nextNode(HashMap.java:1437)
at java.util.HashMap$KeyIterator.next(HashMap.java:1461)
at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.FifoCandidatesSelector.selectCandidates(FifoCandidatesSelector.java:105)
at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.containerBasedPreemptOrKill(ProportionalCapacityPreemptionPolicy.java:489)
at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.editSchedule(ProportionalCapacityPreemptionPolicy.java:320)
at org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor.invokePolicy(SchedulingMonitor.java:99)
at org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor$PolicyInvoker.run(SchedulingMonitor.java:111)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)}}
Attachments
Attachments
Issue Links
- links to