Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-11639

ConcurrentModificationException and NPE in PriorityUtilizationQueueOrderingPolicy

    XMLWordPrintableJSON

Details

    Description

      When dynamic queue creation is enabled in weight mode and the deletion policy coincides with the PriorityQueueResourcesForSorting, RM stops assigning resources because of either ConcurrentModificationException or NPE in PriorityUtilizationQueueOrderingPolicy.

      Reproduced the NPE issue in Java8 and Java11 environment:

      ... INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Removing queue: root.dyn.PmvkMgrEBQppu
      2024-01-02 17:00:59,399 ERROR org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[Thread-11,5,main] threw an Exception.
      java.lang.NullPointerException
      	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy$PriorityQueueResourcesForSorting.<init>(PriorityUtilizationQueueOrderingPolicy.java:225)
      	at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
      	at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1654)
      	at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
      	at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
      	at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913)
      	at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
      	at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578)
      	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:260)
      	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:1100)
      	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:1111)
      	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:942)
      	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:1124)
      	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:942)
      	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1724)
      	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1659)
      	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1816)
      	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1562)
      	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:558)
      	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:605)
      

      Observed the ConcurrentModificationException in Java8 environment, but could not reproduce yet:

      2023-10-27 02:50:37,584 ERROR org.apache.hadoop.yarn.YarnUncaughtExceptionHandler:Thread Thread[Thread-15,5, main] threw an Exception.
      java.util.ConcurrentModificationException
      at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1388)
      at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
      at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
      at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
      at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
      at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
      at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtiliza
      ueOrderingPolicy.Java:260)
      

      The immediate (temporary) remedy to keep the cluster going is to restart the RM.
      The workaround is to disable the deletion of dynamically created child queues.

      Attachments

        Issue Links

          Activity

            People

              bender Ferenc Erdelyi
              bender Ferenc Erdelyi
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: