Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-8709

CS preemption monitor always fails since one under-served queue was deleted

VotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    Description

      After some queues deleted, the preemption checker in SchedulingMonitor was always skipped because of YarnRuntimeException for every run.

      Error logs:

      ERROR [SchedulingMonitor (ProportionalCapacityPreemptionPolicy)] org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor: Exception raised while executing preemption checker, skip this run..., exception=
      org.apache.hadoop.yarn.exceptions.YarnRuntimeException: This shouldn't happen, cannot find TempQueuePerPartition for queueName=1535075839208
              at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.getQueueByPartition(ProportionalCapacityPreemptionPolicy.java:701)
              at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.IntraQueueCandidatesSelector.computeIntraQueuePreemptionDemand(IntraQueueCandidatesSelector.java:302)
              at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.IntraQueueCandidatesSelector.selectCandidates(IntraQueueCandidatesSelector.java:128)
              at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.containerBasedPreemptOrKill(ProportionalCapacityPreemptionPolicy.java:514)
              at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.editSchedule(ProportionalCapacityPreemptionPolicy.java:348)
              at org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor.invokePolicy(SchedulingMonitor.java:99)
              at org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor$PolicyInvoker.run(SchedulingMonitor.java:111)
              at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
              at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
              at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:186)
              at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:300)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1147)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:622)
              at java.lang.Thread.run(Thread.java:834)
      

      I think there is something wrong with partitionToUnderServedQueues field in ProportionalCapacityPreemptionPolicy. Items of partitionToUnderServedQueues can be add but never be removed, except rebuilding this policy. For example, once under-served queue "a" is added into this structure, it will always be there and never be removed, intra-queue preemption checker will try to get all queues info for partitionToUnderServedQueues in IntraQueueCandidatesSelector#selectCandidates and will throw YarnRuntimeException if not found. So that after queue "a" is deleted from queue structure, the preemption checker will always fail.

      Attachments

        1. YARN-8709.001.patch
          7 kB
          Tao Yang
        2. YARN-8709.002.patch
          8 kB
          Tao Yang

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Tao Yang Tao Yang
            Tao Yang Tao Yang
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment