Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
3.2.0
-
None
Description
After some queues deleted, the preemption checker in SchedulingMonitor was always skipped because of YarnRuntimeException for every run.
Error logs:
ERROR [SchedulingMonitor (ProportionalCapacityPreemptionPolicy)] org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor: Exception raised while executing preemption checker, skip this run..., exception= org.apache.hadoop.yarn.exceptions.YarnRuntimeException: This shouldn't happen, cannot find TempQueuePerPartition for queueName=1535075839208 at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.getQueueByPartition(ProportionalCapacityPreemptionPolicy.java:701) at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.IntraQueueCandidatesSelector.computeIntraQueuePreemptionDemand(IntraQueueCandidatesSelector.java:302) at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.IntraQueueCandidatesSelector.selectCandidates(IntraQueueCandidatesSelector.java:128) at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.containerBasedPreemptOrKill(ProportionalCapacityPreemptionPolicy.java:514) at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.editSchedule(ProportionalCapacityPreemptionPolicy.java:348) at org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor.invokePolicy(SchedulingMonitor.java:99) at org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor$PolicyInvoker.run(SchedulingMonitor.java:111) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:186) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:300) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1147) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:622) at java.lang.Thread.run(Thread.java:834)
I think there is something wrong with partitionToUnderServedQueues field in ProportionalCapacityPreemptionPolicy. Items of partitionToUnderServedQueues can be add but never be removed, except rebuilding this policy. For example, once under-served queue "a" is added into this structure, it will always be there and never be removed, intra-queue preemption checker will try to get all queues info for partitionToUnderServedQueues in IntraQueueCandidatesSelector#selectCandidates and will throw YarnRuntimeException if not found. So that after queue "a" is deleted from queue structure, the preemption checker will always fail.