Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-5540

scheduler spends too much time looking at empty priorities

    XMLWordPrintableJSON

Details

    • Reviewed

    Description

      We're starting to see the capacity scheduler run out of scheduling horsepower when running 500-1000 applications on clusters with 4K nodes or so.

      This seems to be amplified by TEZ applications. TEZ applications have many more priorities (sometimes in the hundreds) than typical MR applications and therefore the loop in the scheduler which examines every priority within every running application, starts to be a hotspot. The priorities appear to stay around forever, even when there is no remaining resource request at that priority causing us to spend a lot of time looking at nothing.

      jstack snippet:

      "ResourceManager Event Processor" #28 prio=5 os_prio=0 tid=0x00007fc2d453e800 nid=0x22f3 runnable [0x00007fc2a8be2000]
         java.lang.Thread.State: RUNNABLE
              at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.getResourceRequest(SchedulerApplicationAttempt.java:210)
              - eliminated <0x00000005e73e5dc0> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp)
              at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:852)
              - locked <0x00000005e73e5dc0> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp)
              - locked <0x00000003006fcf60> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue)
              at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:527)
              - locked <0x00000003001b22f8> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue)
              at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:415)
              - locked <0x00000003001b22f8> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue)
              at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1224)
              - locked <0x0000000300041e40> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler)
      

      Attachments

        1. YARN-5540.001.patch
          13 kB
          Jason Darrell Lowe
        2. YARN-5540.002.patch
          13 kB
          Jason Darrell Lowe
        3. YARN-5540.003.patch
          13 kB
          Jason Darrell Lowe
        4. YARN-5540.004.patch
          13 kB
          Jason Darrell Lowe
        5. YARN-5540-branch-2.7.004.patch
          8 kB
          Jason Darrell Lowe
        6. YARN-5540-branch-2.8.004.patch
          10 kB
          Jason Darrell Lowe
        7. YARN-5540-branch-2.8.004.patch
          10 kB
          Jason Darrell Lowe

        Activity

          People

            jlowe Jason Darrell Lowe
            nroberts Nathan Roberts
            Votes:
            0 Vote for this issue
            Watchers:
            14 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: