[YARN-8709] CS preemption monitor always fails since one under-served queue was deleted - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.2.0
Fix Version/s: 2.10.0, 3.2.0, 2.9.2, 3.0.4, 3.1.2
Component/s: capacityscheduler, scheduler preemption
Labels:
None

Description

After some queues deleted, the preemption checker in SchedulingMonitor was always skipped because of YarnRuntimeException for every run.

Error logs:

ERROR [SchedulingMonitor (ProportionalCapacityPreemptionPolicy)] org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor: Exception raised while executing preemption checker, skip this run..., exception=
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: This shouldn't happen, cannot find TempQueuePerPartition for queueName=1535075839208
        at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.getQueueByPartition(ProportionalCapacityPreemptionPolicy.java:701)
        at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.IntraQueueCandidatesSelector.computeIntraQueuePreemptionDemand(IntraQueueCandidatesSelector.java:302)
        at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.IntraQueueCandidatesSelector.selectCandidates(IntraQueueCandidatesSelector.java:128)
        at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.containerBasedPreemptOrKill(ProportionalCapacityPreemptionPolicy.java:514)
        at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.editSchedule(ProportionalCapacityPreemptionPolicy.java:348)
        at org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor.invokePolicy(SchedulingMonitor.java:99)
        at org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor$PolicyInvoker.run(SchedulingMonitor.java:111)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:186)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:300)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1147)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:622)
        at java.lang.Thread.run(Thread.java:834)

I think there is something wrong with partitionToUnderServedQueues field in ProportionalCapacityPreemptionPolicy. Items of partitionToUnderServedQueues can be add but never be removed, except rebuilding this policy. For example, once under-served queue "a" is added into this structure, it will always be there and never be removed, intra-queue preemption checker will try to get all queues info for partitionToUnderServedQueues in IntraQueueCandidatesSelector#selectCandidates and will throw YarnRuntimeException if not found. So that after queue "a" is deleted from queue structure, the preemption checker will always fail.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

YARN-8709.001.patch
29/Aug/18 11:33
7 kB
Tao Yang
YARN-8709.002.patch
08/Sep/18 00:49
8 kB
Tao Yang

Activity

People

Assignee:: Tao Yang

Reporter:: Tao Yang

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 24/Aug/18 11:52

Updated:: 14/Oct/19 15:38

Resolved:: 10/Sep/18 21:07