Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-8709

CS preemption monitor always fails since one under-served queue was deleted

    XMLWordPrintableJSON

Details

    Description

      After some queues deleted, the preemption checker in SchedulingMonitor was always skipped because of YarnRuntimeException for every run.

      Error logs:

      ERROR [SchedulingMonitor (ProportionalCapacityPreemptionPolicy)] org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor: Exception raised while executing preemption checker, skip this run..., exception=
      org.apache.hadoop.yarn.exceptions.YarnRuntimeException: This shouldn't happen, cannot find TempQueuePerPartition for queueName=1535075839208
              at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.getQueueByPartition(ProportionalCapacityPreemptionPolicy.java:701)
              at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.IntraQueueCandidatesSelector.computeIntraQueuePreemptionDemand(IntraQueueCandidatesSelector.java:302)
              at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.IntraQueueCandidatesSelector.selectCandidates(IntraQueueCandidatesSelector.java:128)
              at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.containerBasedPreemptOrKill(ProportionalCapacityPreemptionPolicy.java:514)
              at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.editSchedule(ProportionalCapacityPreemptionPolicy.java:348)
              at org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor.invokePolicy(SchedulingMonitor.java:99)
              at org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor$PolicyInvoker.run(SchedulingMonitor.java:111)
              at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
              at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
              at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:186)
              at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:300)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1147)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:622)
              at java.lang.Thread.run(Thread.java:834)
      

      I think there is something wrong with partitionToUnderServedQueues field in ProportionalCapacityPreemptionPolicy. Items of partitionToUnderServedQueues can be add but never be removed, except rebuilding this policy. For example, once under-served queue "a" is added into this structure, it will always be there and never be removed, intra-queue preemption checker will try to get all queues info for partitionToUnderServedQueues in IntraQueueCandidatesSelector#selectCandidates and will throw YarnRuntimeException if not found. So that after queue "a" is deleted from queue structure, the preemption checker will always fail.

      Attachments

        1. YARN-8709.001.patch
          7 kB
          Tao Yang
        2. YARN-8709.002.patch
          8 kB
          Tao Yang

        Activity

          People

            Tao Yang Tao Yang
            Tao Yang Tao Yang
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: