Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-4546

ResourceManager crash due to scheduling opportunity overflow

    Details

      Description

      If a resource request lingers long enough unsatisfied then the scheduling opportunities count for the request can overflow and cause an RM crash.

        Activity

        Hide
        jlowe Jason Lowe added a comment -

        When the overflow occurs the RM crashes with a stacktrace like this:

        2015-12-26 20:18:39,731 [ResourceManager Event Processor] FATAL resourcemanager.ResourceManager: Error in handling event type NODE_UPDATE to the scheduler
        java.lang.IllegalArgumentException: count cannot be negative: -2147483648
                at com.google.common.base.Preconditions.checkArgument(Preconditions.java:115)
                at com.google.common.collect.Multisets.checkNonnegative(Multisets.java:943)
                at com.google.common.collect.AbstractMapBasedMultiset.setCount(AbstractMapBasedMultiset.java:277)
                at com.google.common.collect.HashMultiset.setCount(HashMultiset.java:34)
                at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.addSchedulingOpportunity(SchedulerApplicationAttempt.java:485)
                at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:872)
                at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:586)
                at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:447)
                at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1019)
                at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1061)
                at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:115)
                at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:682)
                at java.lang.Thread.run(Thread.java:745)
        2015-12-26 20:18:39,732 [ResourceManager Event Processor] INFO resourcemanager.ResourceManager: Exiting, bbye..
        

        In this particular case the resource request went unsatisfied for a long time due to the use of node labels and the application having blacklisted every node with that label. At that point no node in the cluster could satisfy the request because it either didn't have the label or it was blacklisted. So the resource request accumulated scheduling opportunities until the count eventually overflowed.

        Show
        jlowe Jason Lowe added a comment - When the overflow occurs the RM crashes with a stacktrace like this: 2015-12-26 20:18:39,731 [ResourceManager Event Processor] FATAL resourcemanager.ResourceManager: Error in handling event type NODE_UPDATE to the scheduler java.lang.IllegalArgumentException: count cannot be negative: -2147483648 at com.google.common.base.Preconditions.checkArgument(Preconditions.java:115) at com.google.common.collect.Multisets.checkNonnegative(Multisets.java:943) at com.google.common.collect.AbstractMapBasedMultiset.setCount(AbstractMapBasedMultiset.java:277) at com.google.common.collect.HashMultiset.setCount(HashMultiset.java:34) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.addSchedulingOpportunity(SchedulerApplicationAttempt.java:485) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:872) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:586) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:447) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1019) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1061) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:115) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:682) at java.lang.Thread.run(Thread.java:745) 2015-12-26 20:18:39,732 [ResourceManager Event Processor] INFO resourcemanager.ResourceManager: Exiting, bbye.. In this particular case the resource request went unsatisfied for a long time due to the use of node labels and the application having blacklisted every node with that label. At that point no node in the cluster could satisfy the request because it either didn't have the label or it was blacklisted. So the resource request accumulated scheduling opportunities until the count eventually overflowed.
        Hide
        jlowe Jason Lowe added a comment -

        Patch to cap scheduling opportunities so they don't overflow.

        Show
        jlowe Jason Lowe added a comment - Patch to cap scheduling opportunities so they don't overflow.
        Hide
        djp Junping Du added a comment -

        Nice catch, Jason Lowe! +1 pending on Jenkins result.

        Show
        djp Junping Du added a comment - Nice catch, Jason Lowe ! +1 pending on Jenkins result.
        Hide
        hadoopqa Hadoop QA added a comment -
        -1 overall



        Vote Subsystem Runtime Comment
        0 reexec 0m 0s Docker mode activated.
        +1 @author 0m 0s The patch does not contain any @author tags.
        +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
        +1 mvninstall 7m 33s trunk passed
        +1 compile 0m 25s trunk passed with JDK v1.8.0_66
        +1 compile 0m 30s trunk passed with JDK v1.7.0_91
        +1 checkstyle 0m 14s trunk passed
        +1 mvnsite 0m 35s trunk passed
        +1 mvneclipse 0m 15s trunk passed
        +1 findbugs 1m 12s trunk passed
        +1 javadoc 0m 22s trunk passed with JDK v1.8.0_66
        +1 javadoc 0m 26s trunk passed with JDK v1.7.0_91
        +1 mvninstall 0m 30s the patch passed
        +1 compile 0m 26s the patch passed with JDK v1.8.0_66
        +1 javac 0m 26s the patch passed
        +1 compile 0m 28s the patch passed with JDK v1.7.0_91
        +1 javac 0m 28s the patch passed
        +1 checkstyle 0m 13s the patch passed
        +1 mvnsite 0m 34s the patch passed
        +1 mvneclipse 0m 12s the patch passed
        +1 whitespace 0m 0s Patch has no whitespace issues.
        +1 findbugs 1m 16s the patch passed
        +1 javadoc 0m 19s the patch passed with JDK v1.8.0_66
        +1 javadoc 0m 24s the patch passed with JDK v1.7.0_91
        -1 unit 60m 9s hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_66.
        -1 unit 60m 15s hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.7.0_91.
        +1 asflicense 0m 19s Patch does not generate ASF License warnings.
        137m 43s



        Reason Tests
        JDK v1.8.0_66 Failed junit tests hadoop.yarn.server.resourcemanager.TestClientRMTokens
          hadoop.yarn.server.resourcemanager.TestAMAuthorization
        JDK v1.7.0_91 Failed junit tests hadoop.yarn.server.resourcemanager.TestClientRMTokens
          hadoop.yarn.server.resourcemanager.TestAMAuthorization
          hadoop.yarn.server.resourcemanager.metrics.TestSystemMetricsPublisher



        Subsystem Report/Notes
        Docker Image:yetus/hadoop:0ca8df7
        JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12780634/YARN-4546.001.patch
        JIRA Issue YARN-4546
        Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
        uname Linux 7605fefd7e13 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
        Build tool maven
        Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
        git revision trunk / dec8fed
        Default Java 1.7.0_91
        Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_66 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_91
        findbugs v3.0.0
        unit https://builds.apache.org/job/PreCommit-YARN-Build/10163/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.8.0_66.txt
        unit https://builds.apache.org/job/PreCommit-YARN-Build/10163/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.7.0_91.txt
        unit test logs https://builds.apache.org/job/PreCommit-YARN-Build/10163/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.8.0_66.txt https://builds.apache.org/job/PreCommit-YARN-Build/10163/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.7.0_91.txt
        JDK v1.7.0_91 Test Results https://builds.apache.org/job/PreCommit-YARN-Build/10163/testReport/
        modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
        Max memory used 75MB
        Powered by Apache Yetus 0.2.0-SNAPSHOT http://yetus.apache.org
        Console output https://builds.apache.org/job/PreCommit-YARN-Build/10163/console

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 0s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 7m 33s trunk passed +1 compile 0m 25s trunk passed with JDK v1.8.0_66 +1 compile 0m 30s trunk passed with JDK v1.7.0_91 +1 checkstyle 0m 14s trunk passed +1 mvnsite 0m 35s trunk passed +1 mvneclipse 0m 15s trunk passed +1 findbugs 1m 12s trunk passed +1 javadoc 0m 22s trunk passed with JDK v1.8.0_66 +1 javadoc 0m 26s trunk passed with JDK v1.7.0_91 +1 mvninstall 0m 30s the patch passed +1 compile 0m 26s the patch passed with JDK v1.8.0_66 +1 javac 0m 26s the patch passed +1 compile 0m 28s the patch passed with JDK v1.7.0_91 +1 javac 0m 28s the patch passed +1 checkstyle 0m 13s the patch passed +1 mvnsite 0m 34s the patch passed +1 mvneclipse 0m 12s the patch passed +1 whitespace 0m 0s Patch has no whitespace issues. +1 findbugs 1m 16s the patch passed +1 javadoc 0m 19s the patch passed with JDK v1.8.0_66 +1 javadoc 0m 24s the patch passed with JDK v1.7.0_91 -1 unit 60m 9s hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_66. -1 unit 60m 15s hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.7.0_91. +1 asflicense 0m 19s Patch does not generate ASF License warnings. 137m 43s Reason Tests JDK v1.8.0_66 Failed junit tests hadoop.yarn.server.resourcemanager.TestClientRMTokens   hadoop.yarn.server.resourcemanager.TestAMAuthorization JDK v1.7.0_91 Failed junit tests hadoop.yarn.server.resourcemanager.TestClientRMTokens   hadoop.yarn.server.resourcemanager.TestAMAuthorization   hadoop.yarn.server.resourcemanager.metrics.TestSystemMetricsPublisher Subsystem Report/Notes Docker Image:yetus/hadoop:0ca8df7 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12780634/YARN-4546.001.patch JIRA Issue YARN-4546 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 7605fefd7e13 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / dec8fed Default Java 1.7.0_91 Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_66 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_91 findbugs v3.0.0 unit https://builds.apache.org/job/PreCommit-YARN-Build/10163/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.8.0_66.txt unit https://builds.apache.org/job/PreCommit-YARN-Build/10163/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.7.0_91.txt unit test logs https://builds.apache.org/job/PreCommit-YARN-Build/10163/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.8.0_66.txt https://builds.apache.org/job/PreCommit-YARN-Build/10163/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.7.0_91.txt JDK v1.7.0_91 Test Results https://builds.apache.org/job/PreCommit-YARN-Build/10163/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Max memory used 75MB Powered by Apache Yetus 0.2.0-SNAPSHOT http://yetus.apache.org Console output https://builds.apache.org/job/PreCommit-YARN-Build/10163/console This message was automatically generated.
        Hide
        djp Junping Du added a comment -

        The test failures are not related and I believe there is several JIRAs to track them now. Committing the patch.

        Show
        djp Junping Du added a comment - The test failures are not related and I believe there is several JIRAs to track them now. Committing the patch.
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-trunk-Commit #9056 (See https://builds.apache.org/job/Hadoop-trunk-Commit/9056/)
        YARN-4546. ResourceManager crash due to scheduling opportunity overflow. (junping_du: rev c1462a67ff7bb632df50e1c52de971cced56c6a3)

        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerApplicationAttempt.java
        • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/TestSchedulerApplicationAttempt.java
        • hadoop-yarn-project/CHANGES.txt
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-trunk-Commit #9056 (See https://builds.apache.org/job/Hadoop-trunk-Commit/9056/ ) YARN-4546 . ResourceManager crash due to scheduling opportunity overflow. (junping_du: rev c1462a67ff7bb632df50e1c52de971cced56c6a3) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerApplicationAttempt.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/TestSchedulerApplicationAttempt.java hadoop-yarn-project/CHANGES.txt
        Hide
        djp Junping Du added a comment -

        I have commit the patch to trunk, branch-2, branch-2.6, branch-2.7 and branch-2.8. Thanks Jason Lowe for contributing the patch!

        Show
        djp Junping Du added a comment - I have commit the patch to trunk, branch-2, branch-2.6, branch-2.7 and branch-2.8. Thanks Jason Lowe for contributing the patch!
        Hide
        vinodkv Vinod Kumar Vavilapalli added a comment -

        Closing the JIRA as part of 2.7.3 release.

        Show
        vinodkv Vinod Kumar Vavilapalli added a comment - Closing the JIRA as part of 2.7.3 release.

          People

          • Assignee:
            jlowe Jason Lowe
            Reporter:
            jlowe Jason Lowe
          • Votes:
            0 Vote for this issue
            Watchers:
            13 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development