Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-5545

Fix issues related to Max App in capacity scheduler

    Details

    • Hadoop Flags:
      Reviewed

      Description

      Issues as part of Max apps in Capacity scheduler:
      1. Cap total applications across the queue hierarchy based on existing max app calculation
      2. Introduce a new configuration to take default max apps per queue irrespective of the queue capacity configuration
      3. When the capacity configuration of the default partition is ZERO but queue has capacity for other partition then app is not getting submitted, though app is submitted in other partition

      Steps to reproduce Issue 3 :

      Configure capacity scheduler
      yarn.scheduler.capacity.root.default.capacity=0
      yarn.scheduler.capacity.root.queue1.accessible-node-labels.labelx.capacity=50
      yarn.scheduler.capacity.root.default.accessible-node-labels.labelx.capacity=50

      Submit application as below

      ./yarn jar ../share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.0.0-alpha2-SNAPSHOT-tests.jar sleep -Dmapreduce.job.node-label-expression=labelx -Dmapreduce.job.queuename=default -m 1 -r 1 -mt 10000000 -rt 1

      2016-08-21 18:21:31,375 INFO mapreduce.JobSubmitter: Cleaning up the staging area /tmp/hadoop-yarn/staging/root/.staging/job_1471670113386_0001
      java.io.IOException: org.apache.hadoop.yarn.exceptions.YarnException: Failed to submit application_1471670113386_0001 to YARN : org.apache.hadoop.security.AccessControlException: Queue root.default already has 0 applications, cannot accept submission of application: application_1471670113386_0001
      	at org.apache.hadoop.mapred.YARNRunner.submitJob(YARNRunner.java:316)
      	at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:255)
      	at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1344)
      	at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1341)
      ...
      Caused by: org.apache.hadoop.yarn.exceptions.YarnException: Failed to submit application_1471670113386_0001 to YARN : org.apache.hadoop.security.AccessControlException: Queue root.default already has 0 applications, cannot accept submission of application: application_1471670113386_0001
      	at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.submitApplication(YarnClientImpl.java:286)
      	at org.apache.hadoop.mapred.ResourceMgrDelegate.submitApplication(ResourceMgrDelegate.java:296)
      	at org.apache.hadoop.mapred.YARNRunner.submitJob(YARNRunner.java:301)
      	... 25 more
      
      1. YARN-5545.0008.patch
        15 kB
        Bibin A Chundatt
      2. YARN-5545.0007.patch
        15 kB
        Bibin A Chundatt
      3. YARN-5545.0006.patch
        14 kB
        Bibin A Chundatt
      4. YARN-5545.0005.patch
        14 kB
        Bibin A Chundatt
      5. YARN-5545.004.patch
        15 kB
        Bibin A Chundatt
      6. YARN-5545.0003.patch
        22 kB
        Bibin A Chundatt
      7. YARN-5545.0002.patch
        22 kB
        Bibin A Chundatt
      8. YARN-5545.0001.patch
        21 kB
        Bibin A Chundatt
      9. capacity-scheduler.xml
        4 kB
        Bibin A Chundatt

        Activity

        Hide
        Naganarasimha Naganarasimha G R added a comment -

        It was simple test case fix to make the local variable mgr as final in the testcase, hence have fixed it and committed the patch ! Thanks for the contribution Bibin A Chundatt and additional reviews from Sunil G & Jason Lowe, Committed the branch to trunk and branch-2

        Show
        Naganarasimha Naganarasimha G R added a comment - It was simple test case fix to make the local variable mgr as final in the testcase, hence have fixed it and committed the patch ! Thanks for the contribution Bibin A Chundatt and additional reviews from Sunil G & Jason Lowe , Committed the branch to trunk and branch-2
        Hide
        Naganarasimha Naganarasimha G R added a comment -

        Thanks Bibin A Chundatt, Latest patch seems to be failing in compilation in branch-2. Can you please check and provide a patch for branch-2 ?

        [ERROR] COMPILATION ERROR : 
        [INFO] -------------------------------------------------------------
        [ERROR] /opt/git/commit/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestApplicationLimits.java:[790,16] local variable mgr is accessed from within inner class; needs to be declared final
        [INFO] 1 error
        
        Show
        Naganarasimha Naganarasimha G R added a comment - Thanks Bibin A Chundatt , Latest patch seems to be failing in compilation in branch-2. Can you please check and provide a patch for branch-2 ? [ERROR] COMPILATION ERROR : [INFO] ------------------------------------------------------------- [ERROR] /opt/git/commit/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestApplicationLimits.java:[790,16] local variable mgr is accessed from within inner class; needs to be declared final [INFO] 1 error
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #10820 (See https://builds.apache.org/job/Hadoop-trunk-Commit/10820/)
        YARN-5545. Fix issues related to Max App in capacity scheduler. (naganarasimha_gr: rev 503e73e849cbdd1194cc0d16b4969c60929aca11)

        • (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java
        • (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestApplicationLimits.java
        • (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java
        • (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #10820 (See https://builds.apache.org/job/Hadoop-trunk-Commit/10820/ ) YARN-5545 . Fix issues related to Max App in capacity scheduler. (naganarasimha_gr: rev 503e73e849cbdd1194cc0d16b4969c60929aca11) (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestApplicationLimits.java (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java
        Hide
        sunilg Sunil G added a comment -

        +1

        Show
        sunilg Sunil G added a comment - +1
        Hide
        Naganarasimha Naganarasimha G R added a comment -

        Thanks Bibin A Chundatt, +1, Latest patch looks good to me, if no further comments will commit it later today.

        Show
        Naganarasimha Naganarasimha G R added a comment - Thanks Bibin A Chundatt , +1, Latest patch looks good to me, if no further comments will commit it later today.
        Hide
        hadoopqa Hadoop QA added a comment -
        +1 overall



        Vote Subsystem Runtime Comment
        0 reexec 0m 18s Docker mode activated.
        +1 @author 0m 0s The patch does not contain any @author tags.
        +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
        +1 mvninstall 6m 46s trunk passed
        +1 compile 0m 32s trunk passed
        +1 checkstyle 0m 24s trunk passed
        +1 mvnsite 0m 39s trunk passed
        +1 mvneclipse 0m 17s trunk passed
        +1 findbugs 1m 0s trunk passed
        +1 javadoc 0m 24s trunk passed
        +1 mvninstall 0m 41s the patch passed
        +1 compile 0m 38s the patch passed
        +1 javac 0m 38s the patch passed
        +1 checkstyle 0m 24s the patch passed
        +1 mvnsite 0m 42s the patch passed
        +1 mvneclipse 0m 17s the patch passed
        +1 whitespace 0m 0s The patch has no whitespace issues.
        +1 findbugs 1m 22s the patch passed
        +1 javadoc 0m 24s the patch passed
        +1 unit 42m 39s hadoop-yarn-server-resourcemanager in the patch passed.
        +1 asflicense 0m 16s The patch does not generate ASF License warnings.
        59m 2s



        Subsystem Report/Notes
        Docker Image:yetus/hadoop:e809691
        JIRA Issue YARN-5545
        JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12838310/YARN-5545.0008.patch
        Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
        uname Linux 3e8ae235e1f0 3.13.0-95-generic #142-Ubuntu SMP Fri Aug 12 17:00:09 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
        Build tool maven
        Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
        git revision trunk / c8bc7a8
        Default Java 1.8.0_101
        findbugs v3.0.0
        Test Results https://builds.apache.org/job/PreCommit-YARN-Build/13854/testReport/
        modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
        Console output https://builds.apache.org/job/PreCommit-YARN-Build/13854/console
        Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - +1 overall Vote Subsystem Runtime Comment 0 reexec 0m 18s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 6m 46s trunk passed +1 compile 0m 32s trunk passed +1 checkstyle 0m 24s trunk passed +1 mvnsite 0m 39s trunk passed +1 mvneclipse 0m 17s trunk passed +1 findbugs 1m 0s trunk passed +1 javadoc 0m 24s trunk passed +1 mvninstall 0m 41s the patch passed +1 compile 0m 38s the patch passed +1 javac 0m 38s the patch passed +1 checkstyle 0m 24s the patch passed +1 mvnsite 0m 42s the patch passed +1 mvneclipse 0m 17s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 22s the patch passed +1 javadoc 0m 24s the patch passed +1 unit 42m 39s hadoop-yarn-server-resourcemanager in the patch passed. +1 asflicense 0m 16s The patch does not generate ASF License warnings. 59m 2s Subsystem Report/Notes Docker Image:yetus/hadoop:e809691 JIRA Issue YARN-5545 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12838310/YARN-5545.0008.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 3e8ae235e1f0 3.13.0-95-generic #142-Ubuntu SMP Fri Aug 12 17:00:09 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / c8bc7a8 Default Java 1.8.0_101 findbugs v3.0.0 Test Results https://builds.apache.org/job/PreCommit-YARN-Build/13854/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/13854/console Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
        Hide
        bibinchundatt Bibin A Chundatt added a comment -

        Thank you Sunil G for review.
        Attaching patch after handling both testcase fix and condition check

        Show
        bibinchundatt Bibin A Chundatt added a comment - Thank you Sunil G for review. Attaching patch after handling both testcase fix and condition check
        Hide
        sunilg Sunil G added a comment -

        Bibin A Chundatt
        In isSystemAppsLimitReached, less than or equal to check is used. Hence I feel one more extra applications can get submitted. Could you please help to confirm that.

        Show
        sunilg Sunil G added a comment - Bibin A Chundatt In isSystemAppsLimitReached , less than or equal to check is used. Hence I feel one more extra applications can get submitted. Could you please help to confirm that.
        Hide
        hadoopqa Hadoop QA added a comment -
        -1 overall



        Vote Subsystem Runtime Comment
        0 reexec 0m 19s Docker mode activated.
        +1 @author 0m 0s The patch does not contain any @author tags.
        +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
        +1 mvninstall 7m 11s trunk passed
        +1 compile 0m 34s trunk passed
        +1 checkstyle 0m 24s trunk passed
        +1 mvnsite 0m 39s trunk passed
        +1 mvneclipse 0m 17s trunk passed
        +1 findbugs 0m 59s trunk passed
        +1 javadoc 0m 22s trunk passed
        +1 mvninstall 0m 31s the patch passed
        +1 compile 0m 31s the patch passed
        +1 javac 0m 31s the patch passed
        +1 checkstyle 0m 20s the patch passed
        +1 mvnsite 0m 36s the patch passed
        +1 mvneclipse 0m 15s the patch passed
        +1 whitespace 0m 0s The patch has no whitespace issues.
        +1 findbugs 1m 5s the patch passed
        +1 javadoc 0m 19s the patch passed
        -1 unit 41m 20s hadoop-yarn-server-resourcemanager in the patch failed.
        +1 asflicense 0m 16s The patch does not generate ASF License warnings.
        57m 18s



        Reason Tests
        Failed junit tests hadoop.yarn.server.resourcemanager.scheduler.capacity.TestApplicationLimits



        Subsystem Report/Notes
        Docker Image:yetus/hadoop:e809691
        JIRA Issue YARN-5545
        JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12838291/YARN-5545.0007.patch
        Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
        uname Linux 7b3d10941fc9 3.13.0-95-generic #142-Ubuntu SMP Fri Aug 12 17:00:09 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
        Build tool maven
        Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
        git revision trunk / 71adf44
        Default Java 1.8.0_101
        findbugs v3.0.0
        unit https://builds.apache.org/job/PreCommit-YARN-Build/13850/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
        Test Results https://builds.apache.org/job/PreCommit-YARN-Build/13850/testReport/
        modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
        Console output https://builds.apache.org/job/PreCommit-YARN-Build/13850/console
        Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 19s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 7m 11s trunk passed +1 compile 0m 34s trunk passed +1 checkstyle 0m 24s trunk passed +1 mvnsite 0m 39s trunk passed +1 mvneclipse 0m 17s trunk passed +1 findbugs 0m 59s trunk passed +1 javadoc 0m 22s trunk passed +1 mvninstall 0m 31s the patch passed +1 compile 0m 31s the patch passed +1 javac 0m 31s the patch passed +1 checkstyle 0m 20s the patch passed +1 mvnsite 0m 36s the patch passed +1 mvneclipse 0m 15s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 5s the patch passed +1 javadoc 0m 19s the patch passed -1 unit 41m 20s hadoop-yarn-server-resourcemanager in the patch failed. +1 asflicense 0m 16s The patch does not generate ASF License warnings. 57m 18s Reason Tests Failed junit tests hadoop.yarn.server.resourcemanager.scheduler.capacity.TestApplicationLimits Subsystem Report/Notes Docker Image:yetus/hadoop:e809691 JIRA Issue YARN-5545 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12838291/YARN-5545.0007.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 7b3d10941fc9 3.13.0-95-generic #142-Ubuntu SMP Fri Aug 12 17:00:09 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 71adf44 Default Java 1.8.0_101 findbugs v3.0.0 unit https://builds.apache.org/job/PreCommit-YARN-Build/13850/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/13850/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/13850/console Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
        Hide
        bibinchundatt Bibin A Chundatt added a comment -

        Attaching patch handling comments from naga

        Show
        bibinchundatt Bibin A Chundatt added a comment - Attaching patch handling comments from naga
        Hide
        Naganarasimha Naganarasimha G R added a comment -

        Any updates on this?

        Show
        Naganarasimha Naganarasimha G R added a comment - Any updates on this?
        Hide
        hadoopqa Hadoop QA added a comment -
        +1 overall



        Vote Subsystem Runtime Comment
        0 reexec 0m 17s Docker mode activated.
        +1 @author 0m 0s The patch does not contain any @author tags.
        +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
        +1 mvninstall 8m 23s trunk passed
        +1 compile 0m 39s trunk passed
        +1 checkstyle 0m 24s trunk passed
        +1 mvnsite 0m 46s trunk passed
        +1 mvneclipse 0m 18s trunk passed
        +1 findbugs 1m 8s trunk passed
        +1 javadoc 0m 25s trunk passed
        +1 mvninstall 0m 40s the patch passed
        +1 compile 0m 39s the patch passed
        +1 javac 0m 39s the patch passed
        +1 checkstyle 0m 23s the patch passed
        +1 mvnsite 0m 39s the patch passed
        +1 mvneclipse 0m 15s the patch passed
        +1 whitespace 0m 0s The patch has no whitespace issues.
        +1 findbugs 1m 8s the patch passed
        +1 javadoc 0m 19s the patch passed
        +1 unit 41m 14s hadoop-yarn-server-resourcemanager in the patch passed.
        +1 asflicense 0m 17s The patch does not generate ASF License warnings.
        59m 12s



        Subsystem Report/Notes
        Docker Image:yetus/hadoop:e809691
        JIRA Issue YARN-5545
        JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12837810/YARN-5545.0006.patch
        Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
        uname Linux 8ac4d3e0f03b 3.13.0-95-generic #142-Ubuntu SMP Fri Aug 12 17:00:09 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
        Build tool maven
        Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
        git revision trunk / acd509d
        Default Java 1.8.0_101
        findbugs v3.0.0
        Test Results https://builds.apache.org/job/PreCommit-YARN-Build/13809/testReport/
        modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
        Console output https://builds.apache.org/job/PreCommit-YARN-Build/13809/console
        Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - +1 overall Vote Subsystem Runtime Comment 0 reexec 0m 17s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 8m 23s trunk passed +1 compile 0m 39s trunk passed +1 checkstyle 0m 24s trunk passed +1 mvnsite 0m 46s trunk passed +1 mvneclipse 0m 18s trunk passed +1 findbugs 1m 8s trunk passed +1 javadoc 0m 25s trunk passed +1 mvninstall 0m 40s the patch passed +1 compile 0m 39s the patch passed +1 javac 0m 39s the patch passed +1 checkstyle 0m 23s the patch passed +1 mvnsite 0m 39s the patch passed +1 mvneclipse 0m 15s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 8s the patch passed +1 javadoc 0m 19s the patch passed +1 unit 41m 14s hadoop-yarn-server-resourcemanager in the patch passed. +1 asflicense 0m 17s The patch does not generate ASF License warnings. 59m 12s Subsystem Report/Notes Docker Image:yetus/hadoop:e809691 JIRA Issue YARN-5545 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12837810/YARN-5545.0006.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 8ac4d3e0f03b 3.13.0-95-generic #142-Ubuntu SMP Fri Aug 12 17:00:09 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / acd509d Default Java 1.8.0_101 findbugs v3.0.0 Test Results https://builds.apache.org/job/PreCommit-YARN-Build/13809/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/13809/console Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
        Hide
        Naganarasimha Naganarasimha G R added a comment -

        Thanks Bibin A Chundatt,
        Overall the approach seems to be fine just small nits :

        1. In TestApplicationLimits.testApplicationLimitSubmit we have not captured any error messages for assert's
        2. May be we can statically bind so that we can directly use assertEquals instead of Assert.assertEquals
        Show
        Naganarasimha Naganarasimha G R added a comment - Thanks Bibin A Chundatt , Overall the approach seems to be fine just small nits : In TestApplicationLimits.testApplicationLimitSubmit we have not captured any error messages for assert 's May be we can statically bind so that we can directly use assertEquals instead of Assert.assertEquals
        Hide
        bibinchundatt Bibin A Chundatt added a comment -

        Attaching patch based after moving system max application limit to scheduler.

        Show
        bibinchundatt Bibin A Chundatt added a comment - Attaching patch based after moving system max application limit to scheduler.
        Hide
        Naganarasimha Naganarasimha G R added a comment -

        Thanks for sharing thoughts and discussing on this Sunil G & Bibin A Chundatt,
        Earlier thought there are multiple places we are typecasting scheduler and checking for accepting apps, hence thought adding to interface.
        But seems like its not been done multiple places hence i am fine with approach 1 and keep it simple and if we require later on we can create similar to option 2

        Show
        Naganarasimha Naganarasimha G R added a comment - Thanks for sharing thoughts and discussing on this Sunil G & Bibin A Chundatt , Earlier thought there are multiple places we are typecasting scheduler and checking for accepting apps, hence thought adding to interface. But seems like its not been done multiple places hence i am fine with approach 1 and keep it simple and if we require later on we can create similar to option 2
        Hide
        sunilg Sunil G added a comment -

        Thanks Bibin A Chundatt

        Option2 will introduce a new api, and such an api could potentially grab some lock in scheduler later (eventhough this patch may not intend to do so), and that would not be very good since its a direct api call from client end. I think such a provision needs to be discussed more and may need more visibility. Also as per this patch, this important change will go here as a sub-part. I think lets move this change to another ticket and we can discuss about the same there. So current patch regarding label can go here and most of us has consensus for same already. Thoughts?

        Show
        sunilg Sunil G added a comment - Thanks Bibin A Chundatt Option2 will introduce a new api, and such an api could potentially grab some lock in scheduler later (eventhough this patch may not intend to do so), and that would not be very good since its a direct api call from client end. I think such a provision needs to be discussed more and may need more visibility. Also as per this patch, this important change will go here as a sub-part. I think lets move this change to another ticket and we can discuss about the same there. So current patch regarding label can go here and most of us has consensus for same already. Thoughts?
        Hide
        bibinchundatt Bibin A Chundatt added a comment -

        Is there any potential problem or feasibility issue with Option1?

        Both solution are ok. But if we implement second its does avoid app add event to scheduler. Also later any schedulers can implement and use the same.
        We have to just finalize whether scheduler and history should know whether application was submitted and rejected due to application limit.

        Show
        bibinchundatt Bibin A Chundatt added a comment - Is there any potential problem or feasibility issue with Option1? Both solution are ok. But if we implement second its does avoid app add event to scheduler. Also later any schedulers can implement and use the same. We have to just finalize whether scheduler and history should know whether application was submitted and rejected due to application limit.
        Hide
        sunilg Sunil G added a comment -

        Bibin A Chundatt
        Is there any potential problem or feasibility issue with Option1?

        Show
        sunilg Sunil G added a comment - Bibin A Chundatt Is there any potential problem or feasibility issue with Option1?
        Hide
        bibinchundatt Bibin A Chundatt added a comment - - edited

        Sunil G
        Will update patch with solution 2.

        Show
        bibinchundatt Bibin A Chundatt added a comment - - edited Sunil G Will update patch with solution 2.
        Hide
        sunilg Sunil G added a comment -

        Hi Naganarasimha Garla and Bibin A Chundatt

        I also have a similar opinion with Naganarasimha Garla. We are introducing lot of instanceof checks, which is not so clean.

        Couple of options I feel

        • We could have this check inside CS#addApplication, and we could raise APP_REJECTED event back of the limit is met.
        • As suggested by Naga, We can also try to have a interface in YarnScheduler and then create a dummy implementation in AbstractYarnScheduler. Then in CS, we could have checks as mentioned in the patch.

        I feel option 1 is slightly simple if we could achieve the same. Thoughts?

        Show
        sunilg Sunil G added a comment - Hi Naganarasimha Garla and Bibin A Chundatt I also have a similar opinion with Naganarasimha Garla . We are introducing lot of instanceof checks, which is not so clean. Couple of options I feel We could have this check inside CS#addApplication , and we could raise APP_REJECTED event back of the limit is met. As suggested by Naga, We can also try to have a interface in YarnScheduler and then create a dummy implementation in AbstractYarnScheduler . Then in CS, we could have checks as mentioned in the patch. I feel option 1 is slightly simple if we could achieve the same. Thoughts?
        Hide
        bibinchundatt Bibin A Chundatt added a comment - - edited

        Naganarasimha G R/Sunil G

        We willadd interface in YarnScheduler to check app can be submitted. So that each scheduler we can implement as per the needs.
        Probably the access check in RMAppManager we can move to the same.

        Show
        bibinchundatt Bibin A Chundatt added a comment - - edited Naganarasimha G R / Sunil G We willadd interface in YarnScheduler to check app can be submitted. So that each scheduler we can implement as per the needs. Probably the access check in RMAppManager we can move to the same.
        Hide
        Naganarasimha Naganarasimha G R added a comment - - edited

        Sunil G & Bibin A Chundatt,
        Was wondering whether to type cast would be the right approach or to introduce an api in YarnScheduler to validate whether application can be accepted or event better to do it in CapacityScheduler.addApplication which call leafQueue.submitApplication(which currently does the queue level validation for max apps) ? As in future there can be similar checks for other schedulers too and not good to have specific scheduler checks in the main RM flow

        Show
        Naganarasimha Naganarasimha G R added a comment - - edited Sunil G & Bibin A Chundatt , Was wondering whether to type cast would be the right approach or to introduce an api in YarnScheduler to validate whether application can be accepted or event better to do it in CapacityScheduler.addApplication which call leafQueue.submitApplication(which currently does the queue level validation for max apps) ? As in future there can be similar checks for other schedulers too and not good to have specific scheduler checks in the main RM flow
        Hide
        Naganarasimha Naganarasimha G R added a comment -

        Bibin A Chundatt,

        IIUC the finished application never gets to scheduler. From NEW state to FINISHED the transition will be complete. But for pending cases might cause a problem. ... If we add the check in current location we have the additional benefit of not creating apps and attempts when not necessary.

        I could not get you completely but additionally adding a check !isRecovery like you mentioned should be sufficient, earlier had not seen this argument & just wanted to say that finished app also goes through this call and then moves to the state FINISHED so recover flow would fail.
        Sunil G, any other comments on the latest patch ? if the above mentioned issue is fixed would it be sufficient to go on ?

        Show
        Naganarasimha Naganarasimha G R added a comment - Bibin A Chundatt , IIUC the finished application never gets to scheduler. From NEW state to FINISHED the transition will be complete. But for pending cases might cause a problem. ... If we add the check in current location we have the additional benefit of not creating apps and attempts when not necessary. I could not get you completely but additionally adding a check !isRecovery like you mentioned should be sufficient, earlier had not seen this argument & just wanted to say that finished app also goes through this call and then moves to the state FINISHED so recover flow would fail. Sunil G , any other comments on the latest patch ? if the above mentioned issue is fixed would it be sufficient to go on ?
        Hide
        bibinchundatt Bibin A Chundatt added a comment -

        Thank you Naganarasimha Garla for comments.IIUC the finished application never gets to scheduler. From NEW state to FINISHED the transition will be complete. But for pending cases might cause a problem.
        Adding handling for isRecovery should be enough. If we add the check in current location we have the additional benefit of not creating apps and attempts when not necessary.

            // Check system level max application limit is reached
            if (!isRecovery && scheduler instanceof CapacityScheduler) {
              if (((CapacityScheduler) scheduler).isSystemAppsLimitReached()) {
                String message =
                    "Cluster level application limit reached,rejecting application";
                throw new YarnException(message);
              }
            }
        
        Show
        bibinchundatt Bibin A Chundatt added a comment - Thank you Naganarasimha Garla for comments.IIUC the finished application never gets to scheduler. From NEW state to FINISHED the transition will be complete. But for pending cases might cause a problem. Adding handling for isRecovery should be enough. If we add the check in current location we have the additional benefit of not creating apps and attempts when not necessary. // Check system level max application limit is reached if (!isRecovery && scheduler instanceof CapacityScheduler) { if (((CapacityScheduler) scheduler).isSystemAppsLimitReached()) { String message = "Cluster level application limit reached,rejecting application" ; throw new YarnException(message); } }
        Hide
        Naganarasimha Naganarasimha G R added a comment -

        Thanks for the patch Bibin A Chundatt,
        createAndPopulateNewRMApp is used in the recover flow and will be called for the finished apps too so i think checking for it here would not be the right location. Other than other parts of the patch is fine !

        Show
        Naganarasimha Naganarasimha G R added a comment - Thanks for the patch Bibin A Chundatt , createAndPopulateNewRMApp is used in the recover flow and will be called for the finished apps too so i think checking for it here would not be the right location. Other than other parts of the patch is fine !
        Hide
        hadoopqa Hadoop QA added a comment -
        +1 overall



        Vote Subsystem Runtime Comment
        0 reexec 0m 18s Docker mode activated.
        +1 @author 0m 0s The patch does not contain any @author tags.
        +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
        +1 mvninstall 6m 44s trunk passed
        +1 compile 0m 32s trunk passed
        +1 checkstyle 0m 23s trunk passed
        +1 mvnsite 0m 37s trunk passed
        +1 mvneclipse 0m 16s trunk passed
        +1 findbugs 0m 57s trunk passed
        +1 javadoc 0m 21s trunk passed
        +1 mvninstall 0m 31s the patch passed
        +1 compile 0m 30s the patch passed
        +1 javac 0m 30s the patch passed
        +1 checkstyle 0m 21s the patch passed
        +1 mvnsite 0m 36s the patch passed
        +1 mvneclipse 0m 14s the patch passed
        +1 whitespace 0m 0s The patch has no whitespace issues.
        +1 findbugs 1m 3s the patch passed
        +1 javadoc 0m 18s the patch passed
        +1 unit 38m 34s hadoop-yarn-server-resourcemanager in the patch passed.
        +1 asflicense 0m 16s The patch does not generate ASF License warnings.
        53m 48s



        Subsystem Report/Notes
        Docker Image:yetus/hadoop:9560f25
        JIRA Issue YARN-5545
        JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12837644/YARN-5545.0005.patch
        Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
        uname Linux 3eeeb13a7305 3.13.0-95-generic #142-Ubuntu SMP Fri Aug 12 17:00:09 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
        Build tool maven
        Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
        git revision trunk / d8bab3d
        Default Java 1.8.0_101
        findbugs v3.0.0
        Test Results https://builds.apache.org/job/PreCommit-YARN-Build/13798/testReport/
        modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
        Console output https://builds.apache.org/job/PreCommit-YARN-Build/13798/console
        Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - +1 overall Vote Subsystem Runtime Comment 0 reexec 0m 18s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 6m 44s trunk passed +1 compile 0m 32s trunk passed +1 checkstyle 0m 23s trunk passed +1 mvnsite 0m 37s trunk passed +1 mvneclipse 0m 16s trunk passed +1 findbugs 0m 57s trunk passed +1 javadoc 0m 21s trunk passed +1 mvninstall 0m 31s the patch passed +1 compile 0m 30s the patch passed +1 javac 0m 30s the patch passed +1 checkstyle 0m 21s the patch passed +1 mvnsite 0m 36s the patch passed +1 mvneclipse 0m 14s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 3s the patch passed +1 javadoc 0m 18s the patch passed +1 unit 38m 34s hadoop-yarn-server-resourcemanager in the patch passed. +1 asflicense 0m 16s The patch does not generate ASF License warnings. 53m 48s Subsystem Report/Notes Docker Image:yetus/hadoop:9560f25 JIRA Issue YARN-5545 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12837644/YARN-5545.0005.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 3eeeb13a7305 3.13.0-95-generic #142-Ubuntu SMP Fri Aug 12 17:00:09 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / d8bab3d Default Java 1.8.0_101 findbugs v3.0.0 Test Results https://builds.apache.org/job/PreCommit-YARN-Build/13798/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/13798/console Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
        Hide
        bibinchundatt Bibin A Chundatt added a comment -

        Attaching patch after handling system level limits too.

        Show
        bibinchundatt Bibin A Chundatt added a comment - Attaching patch after handling system level limits too.
        Hide
        sunilg Sunil G added a comment -

        Hi Bibin A Chundatt and NGarla_Unused

        Current approach in the patch looks fine. I also think that a cluster max check can be added to protect system from over shooting max-applications. I have not looked patch in detail, will do that today.

        Show
        sunilg Sunil G added a comment - Hi Bibin A Chundatt and NGarla_Unused Current approach in the patch looks fine. I also think that a cluster max check can be added to protect system from over shooting max-applications. I have not looked patch in detail, will do that today.
        Hide
        Naganarasimha Naganarasimha G R added a comment -

        Hi Bibin A Chundatt,
        Seems like there is no thoughts from others on this yet, so i think we can go ahead with the existing approach which has a drawback that we will not be able to set global-default-max for only few queues (having default Queue partition cap = 0) but will get enforced to all. If required we can introduce some new config later.
        Additionally as we were discussing earlier can you put a check to ensure that total number of applications do not exceed cluster max applications ?

        Show
        Naganarasimha Naganarasimha G R added a comment - Hi Bibin A Chundatt , Seems like there is no thoughts from others on this yet, so i think we can go ahead with the existing approach which has a drawback that we will not be able to set global-default-max for only few queues (having default Queue partition cap = 0) but will get enforced to all. If required we can introduce some new config later. Additionally as we were discussing earlier can you put a check to ensure that total number of applications do not exceed cluster max applications ?
        Hide
        bibinchundatt Bibin A Chundatt added a comment -

        Sunil G
        Could you please review current implementation

        Show
        bibinchundatt Bibin A Chundatt added a comment - Sunil G Could you please review current implementation
        Hide
        Naganarasimha Naganarasimha G R added a comment -

        All other things are fine just for the caveat that you mentioned.

        But One case to point out here is that, is there any use case by which customer is expecting to divide max-apps by capacity (no queue override) and do not want a default global max-apps?. If so, then we can add few more tuning configs to forcefully enable capacity division of max-apps in each queue level over default global max-app. Does this make sense?

        So based on the consensus from Jason Lowe,Sunil G & Tan, Wangda, may be we can conclude to currently go ahead with the same approach and if required to add additional configs in future.

        Show
        Naganarasimha Naganarasimha G R added a comment - All other things are fine just for the caveat that you mentioned. But One case to point out here is that, is there any use case by which customer is expecting to divide max-apps by capacity (no queue override) and do not want a default global max-apps?. If so, then we can add few more tuning configs to forcefully enable capacity division of max-apps in each queue level over default global max-app. Does this make sense? So based on the consensus from Jason Lowe , Sunil G & Tan, Wangda , may be we can conclude to currently go ahead with the same approach and if required to add additional configs in future.
        Hide
        sunilg Sunil G added a comment - - edited

        Extremely for the comment. I mistyped in a wrong Jira. Pls discard below comment

        .....
        Currently we are trying to invoke activateApplications while recovering each application. Yes, as of now nodes are getting registered later in the flow. But for scheduler, we need not have to consider such timing cases from RMAppManager/RM end. Being said that, its important to separate 2 issues out here
        ......

        Show
        sunilg Sunil G added a comment - - edited Extremely for the comment. I mistyped in a wrong Jira. Pls discard below comment ..... Currently we are trying to invoke activateApplications while recovering each application. Yes, as of now nodes are getting registered later in the flow. But for scheduler, we need not have to consider such timing cases from RMAppManager/RM end. Being said that, its important to separate 2 issues out here ......
        Hide
        bibinchundatt Bibin A Chundatt added a comment -

        Testcase failure is not related to patch attached. YARN-5548 is already available to track the same.
        Varun Saxena can u have at look at YARN-5548 too.

        Show
        bibinchundatt Bibin A Chundatt added a comment - Testcase failure is not related to patch attached. YARN-5548 is already available to track the same. Varun Saxena can u have at look at YARN-5548 too.
        Hide
        hadoopqa Hadoop QA added a comment -
        -1 overall



        Vote Subsystem Runtime Comment
        0 reexec 0m 15s Docker mode activated.
        +1 @author 0m 0s The patch does not contain any @author tags.
        +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
        +1 mvninstall 6m 38s trunk passed
        +1 compile 0m 32s trunk passed
        +1 checkstyle 0m 23s trunk passed
        +1 mvnsite 0m 37s trunk passed
        +1 mvneclipse 0m 17s trunk passed
        +1 findbugs 0m 56s trunk passed
        +1 javadoc 0m 20s trunk passed
        +1 mvninstall 0m 31s the patch passed
        +1 compile 0m 29s the patch passed
        +1 javac 0m 29s the patch passed
        +1 checkstyle 0m 20s the patch passed
        +1 mvnsite 0m 35s the patch passed
        +1 mvneclipse 0m 14s the patch passed
        +1 whitespace 0m 0s The patch has no whitespace issues.
        +1 findbugs 1m 2s the patch passed
        +1 javadoc 0m 18s the patch passed
        -1 unit 36m 26s hadoop-yarn-server-resourcemanager in the patch failed.
        +1 asflicense 0m 15s The patch does not generate ASF License warnings.
        50m 43s



        Reason Tests
        Failed junit tests hadoop.yarn.server.resourcemanager.TestRMRestart



        Subsystem Report/Notes
        Docker Image:yetus/hadoop:9560f25
        JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12835090/YARN-5545.004.patch
        JIRA Issue YARN-5545
        Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
        uname Linux 6de9ebf193ce 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
        Build tool maven
        Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
        git revision trunk / dbd2057
        Default Java 1.8.0_101
        findbugs v3.0.0
        unit https://builds.apache.org/job/PreCommit-YARN-Build/13498/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
        unit test logs https://builds.apache.org/job/PreCommit-YARN-Build/13498/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
        Test Results https://builds.apache.org/job/PreCommit-YARN-Build/13498/testReport/
        modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
        Console output https://builds.apache.org/job/PreCommit-YARN-Build/13498/console
        Powered by Apache Yetus 0.3.0 http://yetus.apache.org

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 15s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 6m 38s trunk passed +1 compile 0m 32s trunk passed +1 checkstyle 0m 23s trunk passed +1 mvnsite 0m 37s trunk passed +1 mvneclipse 0m 17s trunk passed +1 findbugs 0m 56s trunk passed +1 javadoc 0m 20s trunk passed +1 mvninstall 0m 31s the patch passed +1 compile 0m 29s the patch passed +1 javac 0m 29s the patch passed +1 checkstyle 0m 20s the patch passed +1 mvnsite 0m 35s the patch passed +1 mvneclipse 0m 14s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 2s the patch passed +1 javadoc 0m 18s the patch passed -1 unit 36m 26s hadoop-yarn-server-resourcemanager in the patch failed. +1 asflicense 0m 15s The patch does not generate ASF License warnings. 50m 43s Reason Tests Failed junit tests hadoop.yarn.server.resourcemanager.TestRMRestart Subsystem Report/Notes Docker Image:yetus/hadoop:9560f25 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12835090/YARN-5545.004.patch JIRA Issue YARN-5545 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 6de9ebf193ce 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / dbd2057 Default Java 1.8.0_101 findbugs v3.0.0 unit https://builds.apache.org/job/PreCommit-YARN-Build/13498/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt unit test logs https://builds.apache.org/job/PreCommit-YARN-Build/13498/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/13498/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/13498/console Powered by Apache Yetus 0.3.0 http://yetus.apache.org This message was automatically generated.
        Hide
        bibinchundatt Bibin A Chundatt added a comment -

        Thank you Sunil G/Naganarasimha Garla/Jason Lowe for discussion and comments

        Attaching patch based on discussion.

        1. Added new configuration yarn.scheduler.capacity.global-queue-max-application property
        2. Testcase added for application submit for handling user limit and application limit for queue
        Show
        bibinchundatt Bibin A Chundatt added a comment - Thank you Sunil G / Naganarasimha Garla / Jason Lowe for discussion and comments Attaching patch based on discussion. Added new configuration yarn.scheduler.capacity.global-queue-max-application property Testcase added for application submit for handling user limit and application limit for queue
        Hide
        sunilg Sunil G added a comment -

        Thanks Naga for the points.
        Basically we will be giving precedence for any queue specific max-apps config (overriding). This is existing behavior. So we are only looking for cases where this queue specific config for max-apps are not present.

        if set then there is no need for maxSystemApps * queueCapacities.getAbsoluteCapacity() as it will never reach

        There will not be any defaults for global max-apps config. Hence if its not set specifically by admin, then we will consider existing way of calculating max-apps for a queue from system level max-apps w.r.t capacity of queue.

        So the code will be reachable if user is not specifying global max-apps config, thus complying to backward compatibility.

        Problem started with maxSystemApps * queueCapacities.getAbsoluteCapacity(), which partition's absolute capacity needs to be considered when for a given queue is not overriding max applications and default capacity of the queue is zero.

        I think its a choice from admin side. For scenarios like default label capacity is not configured, and there are no queue level overriding for max-apps, a possible solution is to configure global max-apps config. Still if any queue is using its own override, that will be considered. This can solve the pblm here.
        But One case to point out here is that, is there any use case by which customer is expecting to divide max-apps by capacity (no queue override) and do not want a default global max-apps?. If so, then we can add few more tuning configs to forcefully enable capacity division of max-apps in each queue level over default global max-app. Does this make sense?

        I feel that enforce strict checking should have been implicit requirement

        As mentioned in earlier comment, we can add strict check for max-apps w.r.t to system-wide max-apps limit. And it can be implicit and we can reject apps if its hit. As pointed out by Bibin, I do not feel we need a config for that.

        Show
        sunilg Sunil G added a comment - Thanks Naga for the points. Basically we will be giving precedence for any queue specific max-apps config (overriding). This is existing behavior. So we are only looking for cases where this queue specific config for max-apps are not present. if set then there is no need for maxSystemApps * queueCapacities.getAbsoluteCapacity() as it will never reach There will not be any defaults for global max-apps config. Hence if its not set specifically by admin, then we will consider existing way of calculating max-apps for a queue from system level max-apps w.r.t capacity of queue. So the code will be reachable if user is not specifying global max-apps config, thus complying to backward compatibility. Problem started with maxSystemApps * queueCapacities.getAbsoluteCapacity(), which partition's absolute capacity needs to be considered when for a given queue is not overriding max applications and default capacity of the queue is zero. I think its a choice from admin side. For scenarios like default label capacity is not configured, and there are no queue level overriding for max-apps, a possible solution is to configure global max-apps config. Still if any queue is using its own override, that will be considered. This can solve the pblm here. But One case to point out here is that, is there any use case by which customer is expecting to divide max-apps by capacity (no queue override) and do not want a default global max-apps?. If so, then we can add few more tuning configs to forcefully enable capacity division of max-apps in each queue level over default global max-app. Does this make sense? I feel that enforce strict checking should have been implicit requirement As mentioned in earlier comment, we can add strict check for max-apps w.r.t to system-wide max-apps limit. And it can be implicit and we can reject apps if its hit. As pointed out by Bibin, I do not feel we need a config for that.
        Hide
        Naganarasimha Naganarasimha G R added a comment -

        Thanks Sunil G,Tan, Wangda & Jason Lowe, for taking the discussion forward.
        I had few queries still

        1. GlobalMaximumApplicationsPerQueue doesnt have any default set right ? if set then there is no need for maxSystemApps * queueCapacities.getAbsoluteCapacity() as it will never reach
        2. IMO approach which was captured by Sunil in his earlier comment is not solving the base problem completely. Problem started with maxSystemApps * queueCapacities.getAbsoluteCapacity(), which partition's absolute capacity needs to be considered when for a given queue is not overriding max applications and default capacity of the queue is zero. So based on your approach only way to avoid it is to set GlobalMaximumApplicationsPerQueue so this would imply that for all the queues this value will be taken and earlier approach of maxSystemApps * queueCapacities.getAbsoluteCapacity() will not be considered.
        3. I feel that enforce strict checking should have been implicit requirement with the assumption that the admin would have not configured in a way that queue max apps exceeds system max apps. And we need not validate the configuration that all queue's max apps is not greater than system max apps but just validate that while submitting the app first the system level max apps are not getting violated and then queue level max app is not getting violated.
          Thoughts ?
        Show
        Naganarasimha Naganarasimha G R added a comment - Thanks Sunil G , Tan, Wangda & Jason Lowe , for taking the discussion forward. I had few queries still GlobalMaximumApplicationsPerQueue doesnt have any default set right ? if set then there is no need for maxSystemApps * queueCapacities.getAbsoluteCapacity() as it will never reach IMO approach which was captured by Sunil in his earlier comment is not solving the base problem completely. Problem started with maxSystemApps * queueCapacities.getAbsoluteCapacity() , which partition's absolute capacity needs to be considered when for a given queue is not overriding max applications and default capacity of the queue is zero. So based on your approach only way to avoid it is to set GlobalMaximumApplicationsPerQueue so this would imply that for all the queues this value will be taken and earlier approach of maxSystemApps * queueCapacities.getAbsoluteCapacity() will not be considered. I feel that enforce strict checking should have been implicit requirement with the assumption that the admin would have not configured in a way that queue max apps exceeds system max apps. And we need not validate the configuration that all queue's max apps is not greater than system max apps but just validate that while submitting the app first the system level max apps are not getting violated and then queue level max app is not getting violated. Thoughts ?
        Hide
        bibinchundatt Bibin A Chundatt added a comment -

        Sunil G

        However I think we do not need another config to enforce strict checking. It can be done in todays form.

        To keep the old behavior we can keep the value as false by default.

        Show
        bibinchundatt Bibin A Chundatt added a comment - Sunil G However I think we do not need another config to enforce strict checking. It can be done in todays form. To keep the old behavior we can keep the value as false by default.
        Hide
        sunilg Sunil G added a comment -

        Thanks Jason Lowe for the valuable thoughts and suggestions.

        Thanks Wangda Tan. It makes sense for me. Bibin A Chundatt, However I think we do not need another config to enforce strict checking. It can be done in todays form.

        I will file a followup jira for same. IN that, we can check and reject app submission to any queue, if system-wide limit is met. Thoughts?

        Show
        sunilg Sunil G added a comment - Thanks Jason Lowe for the valuable thoughts and suggestions. Thanks Wangda Tan . It makes sense for me. Bibin A Chundatt , However I think we do not need another config to enforce strict checking. It can be done in todays form. I will file a followup jira for same. IN that, we can check and reject app submission to any queue, if system-wide limit is met. Thoughts?
        Hide
        bibinchundatt Bibin A Chundatt added a comment -

        Since the maximum-application is major used to cap memory consumed by apps in RM. So I think at least in a follow up JIRA, system-level maximum applications should be enforced.

        +1 for the same. similar to cgroups we can add configuration for strict mode to be enable.

        Show
        bibinchundatt Bibin A Chundatt added a comment - Since the maximum-application is major used to cap memory consumed by apps in RM. So I think at least in a follow up JIRA, system-level maximum applications should be enforced. +1 for the same. similar to cgroups we can add configuration for strict mode to be enable.
        Hide
        leftnoteasy Wangda Tan added a comment -

        Thanks Jason Lowe, Sunil G for suggestions.

        I generally agree with approach at https://issues.apache.org/jira/browse/YARN-5545?focusedCommentId=15494147&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15494147.

        Since the maximum-application is major used to cap memory consumed by apps in RM. So I think at least in a follow up JIRA, system-level maximum applications should be enforced. We should not allow pending + running apps number beyond system-level maximum applications. Without this, it gonna be hard to estimate how many apps in RM.

        Thoughts?

        Show
        leftnoteasy Wangda Tan added a comment - Thanks Jason Lowe , Sunil G for suggestions. I generally agree with approach at https://issues.apache.org/jira/browse/YARN-5545?focusedCommentId=15494147&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15494147 . Since the maximum-application is major used to cap memory consumed by apps in RM. So I think at least in a follow up JIRA, system-level maximum applications should be enforced. We should not allow pending + running apps number beyond system-level maximum applications. Without this, it gonna be hard to estimate how many apps in RM. Thoughts?
        Hide
        jlowe Jason Lowe added a comment -

        Yes, that's essentially the idea. Users can work around the issue initially reported today by setting a queue-specific max apps setting. All the new global queue max apps setting does is allow users to easily specify a default max apps value for all queues that don't have a specific setting rather than manually set it themselves on each one.

        Show
        jlowe Jason Lowe added a comment - Yes, that's essentially the idea. Users can work around the issue initially reported today by setting a queue-specific max apps setting. All the new global queue max apps setting does is allow users to easily specify a default max apps value for all queues that don't have a specific setting rather than manually set it themselves on each one.
        Hide
        sunilg Sunil G added a comment -

        Thank you very much Jason Lowe for sharing use case and detailed analysis.

        I think i understood now the intend here. We will be sticking with the existing configuration set here, and introducing a much more flexible global queue max-apps. So for those queues which are not configured per-queue level, and do not have any capacity configured (in case of node labels and the the problem mentioned in this jira) will be set to this new config (global queue max-apps).

        So I think more or less, we could have below pseudo code to represent this behavior.

            maxApplications = conf.getMaximumApplicationsPerQueue(getQueuePath());
            if (maxApplications < 0) {
              int maxGlobalPerQueueApps = conf.getGlobalMaximumApplicationsPerQueue();
              if(maxGlobalPerQueueApps > 0) {
                 maxApplications = maxGlobalPerQueueApps;
              } else  {
                int maxSystemApps = conf.getMaximumSystemApplications();
                maxApplications =
                  (int) (maxSystemApps * queueCapacities.getAbsoluteCapacity());
              }
            }
        

        So in cases where there are no capacity configured for some labels in a queue, we could make use of global queue max-apps configurations.

        Show
        sunilg Sunil G added a comment - Thank you very much Jason Lowe for sharing use case and detailed analysis. I think i understood now the intend here. We will be sticking with the existing configuration set here, and introducing a much more flexible global queue max-apps. So for those queues which are not configured per-queue level, and do not have any capacity configured (in case of node labels and the the problem mentioned in this jira) will be set to this new config (global queue max-apps). So I think more or less, we could have below pseudo code to represent this behavior. maxApplications = conf.getMaximumApplicationsPerQueue(getQueuePath()); if (maxApplications < 0) { int maxGlobalPerQueueApps = conf.getGlobalMaximumApplicationsPerQueue(); if (maxGlobalPerQueueApps > 0) { maxApplications = maxGlobalPerQueueApps; } else { int maxSystemApps = conf.getMaximumSystemApplications(); maxApplications = ( int ) (maxSystemApps * queueCapacities.getAbsoluteCapacity()); } } So in cases where there are no capacity configured for some labels in a queue, we could make use of global queue max-apps configurations.
        Hide
        jlowe Jason Lowe added a comment -

        This could be configured to set max-apps per queue level in cluster level (queue won’t override this).

        A queue-level max-app setting should always override the system-level setting. If a user explicitly sets the max-apps setting for a particular queue then we cannot ignore that. We already have setups today where max-apps is being tuned at the queue-level for some queues.

        Today if users set a queue-level max app limit then it overrides any system-level limit. That means even today users are allowed to configure RMs that can accept over the system-level app limit by explicitly overriding the derived queue limits with specific limits that are larger. Therefore I'm tempted to have the global queue config completely override the old system-level max-apps config because it's akin to setting the max-apps level for each queue explicitly. That means we operate in one of two modes: if global queue max-apps is not set then we do what we do today and derive the max-apps based on relative capacities. Queues that override max-apps at their level continue to behave as they do today and get the override setting. If the global queue max-apps is set then yarn.scheduler.capacity.maximum-applications is completely ignored. Queues that override max-apps at their level continue to behave as they do today and get the override setting. Queues that do not override get the global queue setting as their max apps setting.

        This preserves existing behavior if the queue is not set and is likely the least surprising behavior when the new setting is used, especially if we document for both the old system max-apps and global queue max-apps configs that the latter always overrides the former when set.

        Show
        jlowe Jason Lowe added a comment - This could be configured to set max-apps per queue level in cluster level (queue won’t override this). A queue-level max-app setting should always override the system-level setting. If a user explicitly sets the max-apps setting for a particular queue then we cannot ignore that. We already have setups today where max-apps is being tuned at the queue-level for some queues. Today if users set a queue-level max app limit then it overrides any system-level limit. That means even today users are allowed to configure RMs that can accept over the system-level app limit by explicitly overriding the derived queue limits with specific limits that are larger. Therefore I'm tempted to have the global queue config completely override the old system-level max-apps config because it's akin to setting the max-apps level for each queue explicitly. That means we operate in one of two modes: if global queue max-apps is not set then we do what we do today and derive the max-apps based on relative capacities. Queues that override max-apps at their level continue to behave as they do today and get the override setting. If the global queue max-apps is set then yarn.scheduler.capacity.maximum-applications is completely ignored. Queues that override max-apps at their level continue to behave as they do today and get the override setting. Queues that do not override get the global queue setting as their max apps setting. This preserves existing behavior if the queue is not set and is likely the least surprising behavior when the new setting is used, especially if we document for both the old system max-apps and global queue max-apps configs that the latter always overrides the former when set.
        Hide
        sunilg Sunil G added a comment -

        Thank you very much Jason Lowe for pitching in and sharing thoughts. Makes sense to me overall.

        if we go down this route then I think we should have a separate top-level config that, when set, specifies the default max-apps per queue explicitly rather than having them try to derive it based on relative capacities. We can then debate whether that also overrides the system-wide setting or if we still respect the system-wide limit.

        IIUC, existing yarn.scheduler.capacity.maximum-applications can be used for system level max-limit for apps, and proposing new config like yarn.scheduler.capacity.global.queue-level-maximum-applications. This could be configured to set max-apps per queue level in cluster level (queue won’t override this). So if I set this config as 10k, then any queue in best case could atmost submit 10k apps. And this will also work along with system-wide app limit. Hence if we configure system-wide app limit as 50k, and assuming we have 10queues (10k each limit), we will not end up in having 100k apps in cluster. Rather we will hit system-wide limit of 50k.

        As more queues are added to system, admin can decrease global queue max-app-limit for better fine tuning if needed. If we are tending to use global queue max-app-limit as a relaxed boundary, then strict actions (reject an app) can be done based on system-wide limit. But if we are configuring this limit more judiciously, we can think of making same queue max-app-limit also as strict limit to reject apps for a queue.
        I see only one problem for this now. If we are not making Q * X ~ Y (where Q is number of queues, X is global per-queue limit and Y is system-wide max-app limit) as a strict rule, then we have 2 possibilities Q * X > Y and Q * X < Y. I think mostly admin prefer to use former approach, where system-wide limit will be stricter and a relaxed limit for per-queue limit. But if we use latter, then we may reject apps even though system-wide limit is still not met. This may or may not be fine. I think with more discussion we can come to a common consensus here.

        Show
        sunilg Sunil G added a comment - Thank you very much Jason Lowe for pitching in and sharing thoughts. Makes sense to me overall. if we go down this route then I think we should have a separate top-level config that, when set, specifies the default max-apps per queue explicitly rather than having them try to derive it based on relative capacities. We can then debate whether that also overrides the system-wide setting or if we still respect the system-wide limit. IIUC, existing yarn.scheduler.capacity.maximum-applications can be used for system level max-limit for apps, and proposing new config like yarn.scheduler.capacity.global.queue-level-maximum-applications . This could be configured to set max-apps per queue level in cluster level (queue won’t override this). So if I set this config as 10k, then any queue in best case could atmost submit 10k apps. And this will also work along with system-wide app limit. Hence if we configure system-wide app limit as 50k, and assuming we have 10queues (10k each limit), we will not end up in having 100k apps in cluster. Rather we will hit system-wide limit of 50k. As more queues are added to system, admin can decrease global queue max-app-limit for better fine tuning if needed. If we are tending to use global queue max-app-limit as a relaxed boundary, then strict actions (reject an app) can be done based on system-wide limit. But if we are configuring this limit more judiciously, we can think of making same queue max-app-limit also as strict limit to reject apps for a queue. I see only one problem for this now. If we are not making Q * X ~ Y (where Q is number of queues, X is global per-queue limit and Y is system-wide max-app limit) as a strict rule, then we have 2 possibilities Q * X > Y and Q * X < Y . I think mostly admin prefer to use former approach, where system-wide limit will be stricter and a relaxed limit for per-queue limit. But if we use latter, then we may reject apps even though system-wide limit is still not met. This may or may not be fine. I think with more discussion we can come to a common consensus here.
        Hide
        jlowe Jason Lowe added a comment -

        The problem with changing queues to use the max apps conf directly is that it becomes more difficult for admins to control the overall memory pressure on the RM from pending apps. IIUC after that change each queue would be able to hold up to the system max-apps number of apps. So each time an admin adds a queue it piles on another potential max-apps amount of apps the RM could be tracking in total. Or if the admin increases the max-apps number by X it actually increases the total RM app storage by Q*X, where Q is the number of leaf queues.

        That's quite different than what happens today and is a significant behavior change. If we go down this route then I think we should have a separate top-level config that, when set, specifies the default max-apps per queue explicitly rather than having them try to derive it based on relative capacities. We can then debate whether that also overrides the system-wide setting or if we still respect the system-wide limit (i.e.: queue may reject an app submission not because it hit the queue's max apps limit but because the RM hit the system-wide apps limit). Going with a separate, new config means we can preserve backwards compatibility for those who have become accustomed to the existing behavior and no surprises when admins use their old configs on the new software.

        I think max-am-resource-percent is a red herring with respect to the max apps discussion. max-am-resource-percent only controls how many active applications there are in a queue, and max apps is controlling the total number of apps in the queue. In fact I wouldn't be surprised if the code doesn't check and an admin could configure the RM to allow more active apps than the total number of apps the queue is allowed to have at any time.

        Show
        jlowe Jason Lowe added a comment - The problem with changing queues to use the max apps conf directly is that it becomes more difficult for admins to control the overall memory pressure on the RM from pending apps. IIUC after that change each queue would be able to hold up to the system max-apps number of apps. So each time an admin adds a queue it piles on another potential max-apps amount of apps the RM could be tracking in total. Or if the admin increases the max-apps number by X it actually increases the total RM app storage by Q*X, where Q is the number of leaf queues. That's quite different than what happens today and is a significant behavior change. If we go down this route then I think we should have a separate top-level config that, when set, specifies the default max-apps per queue explicitly rather than having them try to derive it based on relative capacities. We can then debate whether that also overrides the system-wide setting or if we still respect the system-wide limit (i.e.: queue may reject an app submission not because it hit the queue's max apps limit but because the RM hit the system-wide apps limit). Going with a separate, new config means we can preserve backwards compatibility for those who have become accustomed to the existing behavior and no surprises when admins use their old configs on the new software. I think max-am-resource-percent is a red herring with respect to the max apps discussion. max-am-resource-percent only controls how many active applications there are in a queue, and max apps is controlling the total number of apps in the queue. In fact I wouldn't be surprised if the code doesn't check and an admin could configure the RM to allow more active apps than the total number of apps the queue is allowed to have at any time.
        Hide
        bibinchundatt Bibin A Chundatt added a comment -

        we could only keep only yarn.scheduler.capacity.maximum-applications at system level. We could avoid configuring maximum-applications per queue.

        We should not remove maximum-applications per queue . Only when max application per queue is not configured the system level application has impact as per current implementation no need to change the same.

        Show
        bibinchundatt Bibin A Chundatt added a comment - we could only keep only yarn.scheduler.capacity.maximum-applications at system level. We could avoid configuring maximum-applications per queue. We should not remove maximum-applications per queue . Only when max application per queue is not configured the system level application has impact as per current implementation no need to change the same.
        Hide
        sunilg Sunil G added a comment -

        Bibin A Chundatt, NGarla_Unused

        Had an offline discussion with Wangda Tan on this. Summary is as follows

        If we are opting for Approach1 (Consider average of absolute percentage of all partition,but not average of absolute percentage per partition), we may have issues like Bibin A Chundatt came across,

        • If there are no nodes in cluster
        • During RM restart

        and apps will be rejected. We could come up with work around here, but code will not be that clean in a critical code patch of scheduler.

        So one of the suggestion is,
        we could only keep only yarn.scheduler.capacity.maximum-applications at system level. We could avoid configuring maximum-applications per queue. Yes, its a behavioral change. Still in current use cases, this configuration per-queue is not very strict configuration. This will be usually configured with a very big value, and max-am-resource-percent is playing a crucial role in limitting applications (max AM containers) running in a queue.

        Current code in LeafQueue:

            maxApplications = conf.getMaximumApplicationsPerQueue(getQueuePath());
            if (maxApplications < 0) {
              int maxSystemApps = conf.getMaximumSystemApplications();
              maxApplications =
                  (int) (maxSystemApps * queueCapacities.getAbsoluteCapacity());
            }
        

        So this could be changed to maxApplications = conf.getMaximumSystemApplications();. This value could be configured higher in case where more labels are available in system. Thoughts?

        Looping Jason Lowe. Pls share your thoughts.

        Show
        sunilg Sunil G added a comment - Bibin A Chundatt , NGarla_Unused Had an offline discussion with Wangda Tan on this. Summary is as follows If we are opting for Approach1 (Consider average of absolute percentage of all partition,but not average of absolute percentage per partition), we may have issues like Bibin A Chundatt came across, If there are no nodes in cluster During RM restart and apps will be rejected. We could come up with work around here, but code will not be that clean in a critical code patch of scheduler. So one of the suggestion is, we could only keep only yarn.scheduler.capacity.maximum-applications at system level. We could avoid configuring maximum-applications per queue. Yes, its a behavioral change. Still in current use cases, this configuration per-queue is not very strict configuration. This will be usually configured with a very big value, and max-am-resource-percent is playing a crucial role in limitting applications (max AM containers) running in a queue. Current code in LeafQueue: maxApplications = conf.getMaximumApplicationsPerQueue(getQueuePath()); if (maxApplications < 0) { int maxSystemApps = conf.getMaximumSystemApplications(); maxApplications = ( int ) (maxSystemApps * queueCapacities.getAbsoluteCapacity()); } So this could be changed to maxApplications = conf.getMaximumSystemApplications(); . This value could be configured higher in case where more labels are available in system. Thoughts? Looping Jason Lowe . Pls share your thoughts.
        Hide
        Naganarasimha Naganarasimha G R added a comment -

        Bibin A Chundatt, Tan, Wangda,
        Bibin's last approach seems to be very complicated for admin to understand the intent of it, IMO Bibin's initial approach which matches the wangda's last comment is probably the best approach and its just that we need to solve for this one scenario, so my thoughts on that scenario :

        1. Do not allow any apps to be submitted. I feel no harm in it either its momentary situation(in case of fail over) or something seriously wrong which admin needs to take care of.
        2. Allow certain number of apps which should also consider already running apps, say 10% of max running apps can be considered.
          Thoughts?
        Show
        Naganarasimha Naganarasimha G R added a comment - Bibin A Chundatt , Tan, Wangda , Bibin's last approach seems to be very complicated for admin to understand the intent of it, IMO Bibin's initial approach which matches the wangda's last comment is probably the best approach and its just that we need to solve for this one scenario, so my thoughts on that scenario : Do not allow any apps to be submitted. I feel no harm in it either its momentary situation(in case of fail over) or something seriously wrong which admin needs to take care of. Allow certain number of apps which should also consider already running apps, say 10% of max running apps can be considered. Thoughts?
        Hide
        bibinchundatt Bibin A Chundatt added a comment -

        Thank you Wangda Tan for looking into issue.

        So queue will split maximum-application-number according to ratio of their total configured resource across partitions

        Approach
        Consider average of absolute percentage of all partition,but not average of absolute percentage per partition b ,Label 1 can be of 10% of 20 GB and default partition can be of 50% of 100GB.
        
            Get percentage capacity of queue as [ sum of resource of queue A all partition (X) / Total cluster resource n cluster (Y) ]= absolute percentage overall cluster (Z).
            max application of queue = Z * maxclusterapplication
            Have to update the max application always with NODE registration and removal.
        
        

        This was the initial approach we thought about, during discussion we came across scenario when rm is restarted and NM is not registered
        might get rejected any thoughts on that.

        Any thoughts on above scenarios how we should handle ??

        Show
        bibinchundatt Bibin A Chundatt added a comment - Thank you Wangda Tan for looking into issue. So queue will split maximum-application-number according to ratio of their total configured resource across partitions Approach Consider average of absolute percentage of all partition,but not average of absolute percentage per partition b ,Label 1 can be of 10% of 20 GB and default partition can be of 50% of 100GB. Get percentage capacity of queue as [ sum of resource of queue A all partition (X) / Total cluster resource n cluster (Y) ]= absolute percentage overall cluster (Z). max application of queue = Z * maxclusterapplication Have to update the max application always with NODE registration and removal. This was the initial approach we thought about, during discussion we came across scenario when rm is restarted and NM is not registered might get rejected any thoughts on that. Any thoughts on above scenarios how we should handle ??
        Hide
        leftnoteasy Wangda Tan added a comment -

        Bibin A Chundatt, Sunil G, Naganarasimha G R.

        Thanks for discussion,

        I think for this issue, what we should do:

        • Don't split maximum-application-number to per-partition, as we already have am-resource-percent-per-partition, adding more per-partition configuration will confuse user
        • And also, you cannot say one app belongs to one partition, you can only say one AM belongs to one partition
        • So queue will split maximum-application-number according to ratio of their total configured resource across partitions. For example,
          Cluster maximum-application = 100, 
          queueA configured partitionX = 10G, partitionY = 20G; 
          queueB configured partitionX = 20G, partitionY = 50G;
          

          So queueA 's maximum-application is 100 * (10 + 20) / (10 + 20 + 20 + 50) = 30
          And queueB's maximum-application is 100 * (20 + 50) / (10 + 20 + 20 + 50) = 70

        • Please note that, the maximum-applications of queues will be updated when CS configuration updated (refresh queue), and cluster resource updated, so we need to update it inside CSQueue#updateClusterResource .

        Thoughts?

        Show
        leftnoteasy Wangda Tan added a comment - Bibin A Chundatt , Sunil G , Naganarasimha G R . Thanks for discussion, I think for this issue, what we should do: Don't split maximum-application-number to per-partition, as we already have am-resource-percent-per-partition, adding more per-partition configuration will confuse user And also, you cannot say one app belongs to one partition, you can only say one AM belongs to one partition So queue will split maximum-application-number according to ratio of their total configured resource across partitions. For example, Cluster maximum-application = 100, queueA configured partitionX = 10G, partitionY = 20G; queueB configured partitionX = 20G, partitionY = 50G; So queueA 's maximum-application is 100 * (10 + 20) / (10 + 20 + 20 + 50) = 30 And queueB's maximum-application is 100 * (20 + 50) / (10 + 20 + 20 + 50) = 70 Please note that, the maximum-applications of queues will be updated when CS configuration updated (refresh queue), and cluster resource updated, so we need to update it inside CSQueue#updateClusterResource . Thoughts?
        Hide
        bibinchundatt Bibin A Chundatt added a comment -

        NGarla_Unused and Sunil G

        Also consider the cases when the accessibility is * and new partitions are added without refreshing, this configuration will be wrong as its static.

        Thank you for pointing out will check the same. But NGarla_Unused when ever we reconfigure capacity scheduler xml this limits also will get refreshed.

        Would it be better to set the default value of yarn.scheduler.capacity.maximum-applications.accessible-node-labels.<label> to that of yarn.scheduler.capacity.maximum-applications

        Will use yarn.scheduler.capacity.maximum-applications itself.

        IIUC you seem to adopt the approach little different than what you mention in your comment, though we are having per partition level max app limit, we just sum up max limits of all partitions under a queue and check against ApplicationLimit.getAllMaxApplication()

        This was added since application per partition we can't consider for app limit IIUC we have to check max apps to queue from all partitions.

        Documentation will add for the same.

        Show
        bibinchundatt Bibin A Chundatt added a comment - NGarla_Unused and Sunil G Also consider the cases when the accessibility is * and new partitions are added without refreshing, this configuration will be wrong as its static. Thank you for pointing out will check the same. But NGarla_Unused when ever we reconfigure capacity scheduler xml this limits also will get refreshed. Would it be better to set the default value of yarn.scheduler.capacity.maximum-applications.accessible-node-labels.<label> to that of yarn.scheduler.capacity.maximum-applications Will use yarn.scheduler.capacity.maximum-applications itself. IIUC you seem to adopt the approach little different than what you mention in your comment, though we are having per partition level max app limit, we just sum up max limits of all partitions under a queue and check against ApplicationLimit.getAllMaxApplication() This was added since application per partition we can't consider for app limit IIUC we have to check max apps to queue from all partitions. Documentation will add for the same.
        Hide
        sunilg Sunil G added a comment -

        NGarla_Unused and Bibin A Chundatt

        I earlier suggested to have "maximum-applications" per label. And as mentioned by Naga in the last summary, it is one of the option to control apps for labels.
        However it may be an added hurdle for admins to set it correctly per-label. Also I had discussed with Wangda Tan earlier.I think its better if we have maximum-applications per <label> }} in cluster-wise (as mentioned in option2 with slight difference) to that of {{yarn.scheduler.capacity.maximum-applications. May be we need not have to expose this as a new config. Rather we can adopt it from yarn.scheduler.capacity.maximum-applications itself. It could be documented to explain this. Thoughts?

        Show
        sunilg Sunil G added a comment - NGarla_Unused and Bibin A Chundatt I earlier suggested to have "maximum-applications" per label. And as mentioned by Naga in the last summary, it is one of the option to control apps for labels. However it may be an added hurdle for admins to set it correctly per-label. Also I had discussed with Wangda Tan earlier.I think its better if we have maximum-applications per <label> }} in cluster-wise (as mentioned in option2 with slight difference) to that of {{yarn.scheduler.capacity.maximum-applications . May be we need not have to expose this as a new config. Rather we can adopt it from yarn.scheduler.capacity.maximum-applications itself. It could be documented to explain this. Thoughts?
        Hide
        Naganarasimha Naganarasimha G R added a comment -

        Thanks Bibin A Chundatt, for the patch .
        Few points to discuss on the approach

        1. Would it be good to have a separate queue partition based max application limit similar to yarn.scheduler.capacity.<queue-path>.maximum-applications so that there is finer control on logical partitions similar to default partition ?
        2. Would it be better to set the default value of yarn.scheduler.capacity.maximum-applications.accessible-node-labels.<label> to that of yarn.scheduler.capacity.maximum-applications, it will make the work of the admin much easier. similarly we can decide the same for the previous point if we plan to adopt it.
        3. IIUC you seem to adopt the approach little different than what you mention in your comment, though we are having per partition level max app limit, we just sum up max limits of all partitions under a queue and check against ApplicationLimit.getAllMaxApplication(). If we were to not actually validate against per Queue's PartitionLevelMaxApps then why need to come up with a new configuration? Also consider the cases when the accessibility is * and new partitions are added without refreshing, this configuration will be wrong as its static.
        4. Need to take care of documentation which i think is missed for MaximumAMResourcePercentPerPartition too, May be can be handled in a different jira
        Show
        Naganarasimha Naganarasimha G R added a comment - Thanks Bibin A Chundatt , for the patch . Few points to discuss on the approach Would it be good to have a separate queue partition based max application limit similar to yarn.scheduler.capacity.<queue-path>.maximum-applications so that there is finer control on logical partitions similar to default partition ? Would it be better to set the default value of yarn.scheduler.capacity.maximum-applications.accessible-node-labels.<label> to that of yarn.scheduler.capacity.maximum-applications , it will make the work of the admin much easier. similarly we can decide the same for the previous point if we plan to adopt it. IIUC you seem to adopt the approach little different than what you mention in your comment , though we are having per partition level max app limit, we just sum up max limits of all partitions under a queue and check against ApplicationLimit.getAllMaxApplication() . If we were to not actually validate against per Queue's PartitionLevelMaxApps then why need to come up with a new configuration? Also consider the cases when the accessibility is * and new partitions are added without refreshing , this configuration will be wrong as its static. Need to take care of documentation which i think is missed for MaximumAMResourcePercentPerPartition too, May be can be handled in a different jira
        Hide
        hadoopqa Hadoop QA added a comment -
        +1 overall



        Vote Subsystem Runtime Comment
        0 reexec 0m 16s Docker mode activated.
        +1 @author 0m 0s The patch does not contain any @author tags.
        +1 test4tests 0m 0s The patch appears to include 2 new or modified test files.
        +1 mvninstall 8m 0s trunk passed
        +1 compile 0m 39s trunk passed
        +1 checkstyle 0m 25s trunk passed
        +1 mvnsite 0m 40s trunk passed
        +1 mvneclipse 0m 17s trunk passed
        +1 findbugs 0m 57s trunk passed
        +1 javadoc 0m 20s trunk passed
        +1 mvninstall 0m 30s the patch passed
        +1 compile 0m 34s the patch passed
        +1 javac 0m 34s the patch passed
        +1 checkstyle 0m 21s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 0 new + 216 unchanged - 2 fixed = 216 total (was 218)
        +1 mvnsite 0m 43s the patch passed
        +1 mvneclipse 0m 16s the patch passed
        +1 whitespace 0m 0s The patch has no whitespace issues.
        +1 findbugs 1m 18s the patch passed
        +1 javadoc 0m 19s the patch passed
        +1 unit 40m 45s hadoop-yarn-server-resourcemanager in the patch passed.
        +1 asflicense 0m 16s The patch does not generate ASF License warnings.
        57m 18s



        Subsystem Report/Notes
        Docker Image:yetus/hadoop:9560f25
        JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12827648/YARN-5545.0003.patch
        JIRA Issue YARN-5545
        Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
        uname Linux 43d78b37e973 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
        Build tool maven
        Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
        git revision trunk / b6d839a
        Default Java 1.8.0_101
        findbugs v3.0.0
        Test Results https://builds.apache.org/job/PreCommit-YARN-Build/13053/testReport/
        modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
        Console output https://builds.apache.org/job/PreCommit-YARN-Build/13053/console
        Powered by Apache Yetus 0.3.0 http://yetus.apache.org

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - +1 overall Vote Subsystem Runtime Comment 0 reexec 0m 16s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 2 new or modified test files. +1 mvninstall 8m 0s trunk passed +1 compile 0m 39s trunk passed +1 checkstyle 0m 25s trunk passed +1 mvnsite 0m 40s trunk passed +1 mvneclipse 0m 17s trunk passed +1 findbugs 0m 57s trunk passed +1 javadoc 0m 20s trunk passed +1 mvninstall 0m 30s the patch passed +1 compile 0m 34s the patch passed +1 javac 0m 34s the patch passed +1 checkstyle 0m 21s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 0 new + 216 unchanged - 2 fixed = 216 total (was 218) +1 mvnsite 0m 43s the patch passed +1 mvneclipse 0m 16s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 18s the patch passed +1 javadoc 0m 19s the patch passed +1 unit 40m 45s hadoop-yarn-server-resourcemanager in the patch passed. +1 asflicense 0m 16s The patch does not generate ASF License warnings. 57m 18s Subsystem Report/Notes Docker Image:yetus/hadoop:9560f25 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12827648/YARN-5545.0003.patch JIRA Issue YARN-5545 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 43d78b37e973 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / b6d839a Default Java 1.8.0_101 findbugs v3.0.0 Test Results https://builds.apache.org/job/PreCommit-YARN-Build/13053/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/13053/console Powered by Apache Yetus 0.3.0 http://yetus.apache.org This message was automatically generated.
        Hide
        bibinchundatt Bibin A Chundatt added a comment -

        jira YARN-5548 exists for testcase failure

        Show
        bibinchundatt Bibin A Chundatt added a comment - jira YARN-5548 exists for testcase failure
        Hide
        hadoopqa Hadoop QA added a comment -
        -1 overall



        Vote Subsystem Runtime Comment
        0 reexec 0m 19s Docker mode activated.
        +1 @author 0m 0s The patch does not contain any @author tags.
        +1 test4tests 0m 0s The patch appears to include 2 new or modified test files.
        +1 mvninstall 6m 47s trunk passed
        +1 compile 0m 33s trunk passed
        +1 checkstyle 0m 23s trunk passed
        +1 mvnsite 0m 38s trunk passed
        +1 mvneclipse 0m 17s trunk passed
        +1 findbugs 0m 56s trunk passed
        +1 javadoc 0m 21s trunk passed
        +1 mvninstall 0m 31s the patch passed
        +1 compile 0m 29s the patch passed
        +1 javac 0m 29s the patch passed
        -1 checkstyle 0m 19s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 2 new + 216 unchanged - 2 fixed = 218 total (was 218)
        +1 mvnsite 0m 35s the patch passed
        +1 mvneclipse 0m 14s the patch passed
        +1 whitespace 0m 0s The patch has no whitespace issues.
        +1 findbugs 1m 2s the patch passed
        +1 javadoc 0m 18s the patch passed
        -1 unit 39m 26s hadoop-yarn-server-resourcemanager in the patch failed.
        +1 asflicense 0m 16s The patch does not generate ASF License warnings.
        54m 5s



        Reason Tests
        Failed junit tests hadoop.yarn.server.resourcemanager.TestRMRestart



        Subsystem Report/Notes
        Docker Image:yetus/hadoop:9560f25
        JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12827633/YARN-5545.0002.patch
        JIRA Issue YARN-5545
        Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
        uname Linux bbc76c507adb 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
        Build tool maven
        Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
        git revision trunk / 401db4f
        Default Java 1.8.0_101
        findbugs v3.0.0
        checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/13050/artifact/patchprocess/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
        unit https://builds.apache.org/job/PreCommit-YARN-Build/13050/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
        unit test logs https://builds.apache.org/job/PreCommit-YARN-Build/13050/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
        Test Results https://builds.apache.org/job/PreCommit-YARN-Build/13050/testReport/
        modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
        Console output https://builds.apache.org/job/PreCommit-YARN-Build/13050/console
        Powered by Apache Yetus 0.3.0 http://yetus.apache.org

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 19s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 2 new or modified test files. +1 mvninstall 6m 47s trunk passed +1 compile 0m 33s trunk passed +1 checkstyle 0m 23s trunk passed +1 mvnsite 0m 38s trunk passed +1 mvneclipse 0m 17s trunk passed +1 findbugs 0m 56s trunk passed +1 javadoc 0m 21s trunk passed +1 mvninstall 0m 31s the patch passed +1 compile 0m 29s the patch passed +1 javac 0m 29s the patch passed -1 checkstyle 0m 19s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 2 new + 216 unchanged - 2 fixed = 218 total (was 218) +1 mvnsite 0m 35s the patch passed +1 mvneclipse 0m 14s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 2s the patch passed +1 javadoc 0m 18s the patch passed -1 unit 39m 26s hadoop-yarn-server-resourcemanager in the patch failed. +1 asflicense 0m 16s The patch does not generate ASF License warnings. 54m 5s Reason Tests Failed junit tests hadoop.yarn.server.resourcemanager.TestRMRestart Subsystem Report/Notes Docker Image:yetus/hadoop:9560f25 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12827633/YARN-5545.0002.patch JIRA Issue YARN-5545 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux bbc76c507adb 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 401db4f Default Java 1.8.0_101 findbugs v3.0.0 checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/13050/artifact/patchprocess/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt unit https://builds.apache.org/job/PreCommit-YARN-Build/13050/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt unit test logs https://builds.apache.org/job/PreCommit-YARN-Build/13050/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/13050/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/13050/console Powered by Apache Yetus 0.3.0 http://yetus.apache.org This message was automatically generated.
        Hide
        bibinchundatt Bibin A Chundatt added a comment -

        Attaching patch after handling checkstyle and fixing testcase failure

        Show
        bibinchundatt Bibin A Chundatt added a comment - Attaching patch after handling checkstyle and fixing testcase failure
        Hide
        hadoopqa Hadoop QA added a comment -
        -1 overall



        Vote Subsystem Runtime Comment
        0 reexec 0m 18s Docker mode activated.
        +1 @author 0m 0s The patch does not contain any @author tags.
        +1 test4tests 0m 0s The patch appears to include 2 new or modified test files.
        +1 mvninstall 7m 34s trunk passed
        +1 compile 0m 34s trunk passed
        +1 checkstyle 0m 22s trunk passed
        +1 mvnsite 0m 39s trunk passed
        +1 mvneclipse 0m 16s trunk passed
        +1 findbugs 1m 4s trunk passed
        +1 javadoc 0m 23s trunk passed
        +1 mvninstall 0m 33s the patch passed
        +1 compile 0m 32s the patch passed
        +1 javac 0m 32s the patch passed
        -1 checkstyle 0m 20s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 6 new + 123 unchanged - 2 fixed = 129 total (was 125)
        +1 mvnsite 0m 38s the patch passed
        +1 mvneclipse 0m 15s the patch passed
        +1 whitespace 0m 0s The patch has no whitespace issues.
        +1 findbugs 1m 8s the patch passed
        +1 javadoc 0m 19s the patch passed
        -1 unit 38m 39s hadoop-yarn-server-resourcemanager in the patch failed.
        -1 asflicense 0m 16s The patch generated 1 ASF License warnings.
        54m 32s



        Reason Tests
        Failed junit tests hadoop.yarn.server.resourcemanager.scheduler.capacity.TestApplicationLimits



        Subsystem Report/Notes
        Docker Image:yetus/hadoop:9560f25
        JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12826489/YARN-5545.0001.patch
        JIRA Issue YARN-5545
        Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
        uname Linux 3628e2179fc0 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
        Build tool maven
        Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
        git revision trunk / 01721dd
        Default Java 1.8.0_101
        findbugs v3.0.0
        checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/12973/artifact/patchprocess/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
        unit https://builds.apache.org/job/PreCommit-YARN-Build/12973/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
        unit test logs https://builds.apache.org/job/PreCommit-YARN-Build/12973/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
        Test Results https://builds.apache.org/job/PreCommit-YARN-Build/12973/testReport/
        asflicense https://builds.apache.org/job/PreCommit-YARN-Build/12973/artifact/patchprocess/patch-asflicense-problems.txt
        modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
        Console output https://builds.apache.org/job/PreCommit-YARN-Build/12973/console
        Powered by Apache Yetus 0.3.0 http://yetus.apache.org

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 18s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 2 new or modified test files. +1 mvninstall 7m 34s trunk passed +1 compile 0m 34s trunk passed +1 checkstyle 0m 22s trunk passed +1 mvnsite 0m 39s trunk passed +1 mvneclipse 0m 16s trunk passed +1 findbugs 1m 4s trunk passed +1 javadoc 0m 23s trunk passed +1 mvninstall 0m 33s the patch passed +1 compile 0m 32s the patch passed +1 javac 0m 32s the patch passed -1 checkstyle 0m 20s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 6 new + 123 unchanged - 2 fixed = 129 total (was 125) +1 mvnsite 0m 38s the patch passed +1 mvneclipse 0m 15s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 8s the patch passed +1 javadoc 0m 19s the patch passed -1 unit 38m 39s hadoop-yarn-server-resourcemanager in the patch failed. -1 asflicense 0m 16s The patch generated 1 ASF License warnings. 54m 32s Reason Tests Failed junit tests hadoop.yarn.server.resourcemanager.scheduler.capacity.TestApplicationLimits Subsystem Report/Notes Docker Image:yetus/hadoop:9560f25 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12826489/YARN-5545.0001.patch JIRA Issue YARN-5545 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 3628e2179fc0 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 01721dd Default Java 1.8.0_101 findbugs v3.0.0 checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/12973/artifact/patchprocess/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt unit https://builds.apache.org/job/PreCommit-YARN-Build/12973/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt unit test logs https://builds.apache.org/job/PreCommit-YARN-Build/12973/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/12973/testReport/ asflicense https://builds.apache.org/job/PreCommit-YARN-Build/12973/artifact/patchprocess/patch-asflicense-problems.txt modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/12973/console Powered by Apache Yetus 0.3.0 http://yetus.apache.org This message was automatically generated.
        Hide
        bibinchundatt Bibin A Chundatt added a comment - - edited

        Sunil G/ NGarla_Unused

        1. Solution based in resource usage is having issue during startup when none of the node managers are registered.Resource will be zero and application can get rejected.
        2. Attaching patch based on yarn.scheduler.capacity.maximum-applications to be run on partition for all queues. For each label we can configure as yarn.scheduler.capacity.maximum-applications.accessible-node-labels.<label>
        3. During application limit check will considered applications than run on complete cluster for leaf queue (of all partitions).
        4. When property is not configured default value of yarn.scheduler.capacity.maximum-applications is considered for partition queues

        If max-application for queue is configured then yarn.scheduler.capacity.maximum-applications will not be considered.

        Attaching first patch for the same.

        Show
        bibinchundatt Bibin A Chundatt added a comment - - edited Sunil G / NGarla_Unused Solution based in resource usage is having issue during startup when none of the node managers are registered.Resource will be zero and application can get rejected. Attaching patch based on yarn.scheduler.capacity.maximum-applications to be run on partition for all queues. For each label we can configure as yarn.scheduler.capacity.maximum-applications.accessible-node-labels.<label> During application limit check will considered applications than run on complete cluster for leaf queue (of all partitions). When property is not configured default value of yarn.scheduler.capacity.maximum-applications is considered for partition queues If max-application for queue is configured then yarn.scheduler.capacity.maximum-applications will not be considered. Attaching first patch for the same.
        Hide
        bibinchundatt Bibin A Chundatt added a comment -

        Wangda Tan Could you please share your thoughts.

        Show
        bibinchundatt Bibin A Chundatt added a comment - Wangda Tan Could you please share your thoughts.
        Hide
        bibinchundatt Bibin A Chundatt added a comment -

        Thank you Sunil G and NGarla_Unused for feedback.
        As already mentioned .The issue is when queue level max apps are not configured.
        Then int maxSystemApps = conf.getMaximumSystemApplications(); apps limits is considered and for each queue absolute percentage wise apps are distributed.

        The configuration (maxclusterapplication) is cluster limit of application and when we partition to lower level its should be based on queue

        System level app limit should be considered.

        If we implement as per our approach :

        From Sunil G example queue A will have 53%*10000=5300 as app limit and queueB will have 47%*10000=4700 as limit when yarn.scheduler.capacity.maximum-applications is 5300+4700=10000 as cluster limit. And when labels not available then will get same behavior as old.

        User level calculation are already based in max application in a queue. User level should be only based on overall queue shouldnot consider label thts my understanding.

            maxApplicationsPerUser = Math.min(maxApplications,
                (int)(maxApplications * (userLimit / 100.0f) * userLimitFactor));
         
        Show
        bibinchundatt Bibin A Chundatt added a comment - Thank you Sunil G and NGarla_Unused for feedback. As already mentioned .The issue is when queue level max apps are not configured. Then int maxSystemApps = conf.getMaximumSystemApplications(); apps limits is considered and for each queue absolute percentage wise apps are distributed. The configuration (maxclusterapplication) is cluster limit of application and when we partition to lower level its should be based on queue System level app limit should be considered. If we implement as per our approach : From Sunil G example queue A will have 53%*10000=5300 as app limit and queueB will have 47%*10000=4700 as limit when yarn.scheduler.capacity.maximum-applications is 5300+4700=10000 as cluster limit. And when labels not available then will get same behavior as old. User level calculation are already based in max application in a queue. User level should be only based on overall queue shouldnot consider label thts my understanding. maxApplicationsPerUser = Math .min(maxApplications, ( int )(maxApplications * (userLimit / 100.0f) * userLimitFactor));
        Hide
        Naganarasimha Naganarasimha G R added a comment -

        Thanks for your feeback Sunil G, Idea of considering overall resource is we do not want to limit number of applications based on partition as AM can be submitted in a partition with higher resource and request for partions resources which are still limited, so we thought limiting overall number of apps per leaf queue is better.

        I think, with above table, if we have max_apps_labelX=100, and similarly, will it make it simple? But i agree that we need to consider user level calculations also under label if we do this.

        Yes it makes it simple but IMO doesnt solve anything as explained earlier.

        still more apps can run in queueA. So if we directly go with clusterwise apps and share it with per label, could it affect some queue's which has more apps to run.

        Anyway max number of apps per queue can also be defined yarn.scheduler.capacity.<queue-path>.maximum-applications which can still behave in the same way i.e. max number of apps per queue = max number of apps per queue across any partition.

        Show
        Naganarasimha Naganarasimha G R added a comment - Thanks for your feeback Sunil G , Idea of considering overall resource is we do not want to limit number of applications based on partition as AM can be submitted in a partition with higher resource and request for partions resources which are still limited, so we thought limiting overall number of apps per leaf queue is better. I think, with above table, if we have max_apps_labelX=100, and similarly, will it make it simple? But i agree that we need to consider user level calculations also under label if we do this. Yes it makes it simple but IMO doesnt solve anything as explained earlier. still more apps can run in queueA. So if we directly go with clusterwise apps and share it with per label, could it affect some queue's which has more apps to run. Anyway max number of apps per queue can also be defined yarn.scheduler.capacity.<queue-path>.maximum-applications which can still behave in the same way i.e. max number of apps per queue = max number of apps per queue across any partition.
        Hide
        sunilg Sunil G added a comment -

        Assume max_apps = 1000, labelX=100GB, labelY=50GB, default=20GB

        queueA queueB
        labelX = 30% labelX = 70%
        labelY = 100% labelY = 0
        default = 50% default = 50%

        For queueA, is Z is like (0.3 * 100GB + 1 * 50Gb + 0.5 * 20GB)/(100GB+50Gb+20GB) = 0.53? Pls let me know whether I understood the calculation. If we use this, i think we are considering resource of different label in queue to one single %, weightage is not considered. I think, with above table, if we have max_apps_labelX=100, and similarly, will it make it simple? But i agree that we need to consider user level calculations also under label if we do this.

        Show
        sunilg Sunil G added a comment - Assume max_apps = 1000, labelX=100GB, labelY=50GB, default=20GB queueA queueB labelX = 30% labelX = 70% labelY = 100% labelY = 0 default = 50% default = 50% For queueA, is Z is like (0.3 * 100GB + 1 * 50Gb + 0.5 * 20GB)/(100GB+50Gb+20GB) = 0.53 ? Pls let me know whether I understood the calculation. If we use this, i think we are considering resource of different label in queue to one single %, weightage is not considered. I think, with above table, if we have max_apps_labelX=100, and similarly, will it make it simple? But i agree that we need to consider user level calculations also under label if we do this.
        Hide
        sunilg Sunil G added a comment -

        I am still not very much sure about the issue with per-label-per-queue max application.

        With the current approach, I can see a small problem. Since queues can have heterogeneous apps in terms of its resource consumption, I will try to show a scenario.
        queueA can run 1000apps taking 1GB of memory. queueB can run may be 10apps which take 100GB. If queueA and queueB are given with 50% capacity, still more apps can run in queueA. So if we directly go with clusterwise apps and share it with per label, could it affect some queue's which has more apps to run. This is not a very likely case, still I would like to point out the same.

        Show
        sunilg Sunil G added a comment - I am still not very much sure about the issue with per-label-per-queue max application. With the current approach, I can see a small problem. Since queues can have heterogeneous apps in terms of its resource consumption, I will try to show a scenario. queueA can run 1000apps taking 1GB of memory. queueB can run may be 10apps which take 100GB. If queueA and queueB are given with 50% capacity, still more apps can run in queueA. So if we directly go with clusterwise apps and share it with per label, could it affect some queue's which has more apps to run. This is not a very likely case, still I would like to point out the same.
        Hide
        Naganarasimha Naganarasimha G R added a comment - - edited

        Hi Ying Zhang,
        Difference between this jira and YARN-3216 is one is related to limiting number of applications and other is related to limiting quantity of cluster resources for AM. Later is more obvious case which gets easily reproduced !

        Show
        Naganarasimha Naganarasimha G R added a comment - - edited Hi Ying Zhang , Difference between this jira and YARN-3216 is one is related to limiting number of applications and other is related to limiting quantity of cluster resources for AM. Later is more obvious case which gets easily reproduced !
        Hide
        Ying Zhang Ying Zhang added a comment - - edited

        Hi Bibin A Chundatt, we have been tried to use NodeLabels a while ago, and have gone through the major JIRAs related. Thought this issue had already been solved by YARN-3216. Would you please elaborate more on what's the difference here and why it hasn't been covered by YARN-3216? Thanks very much.

        Show
        Ying Zhang Ying Zhang added a comment - - edited Hi Bibin A Chundatt , we have been tried to use NodeLabels a while ago, and have gone through the major JIRAs related. Thought this issue had already been solved by YARN-3216 . Would you please elaborate more on what's the difference here and why it hasn't been covered by YARN-3216 ? Thanks very much.
        Hide
        Naganarasimha Naganarasimha G R added a comment -

        Thanks Bibin A Chundatt for looking into the issue,

        Get percentage capacity of queue as [ sum of resource of queue A all partition (X) / Total cluster resource n cluster (Y) ]= absolute percentage overall cluster (Z).

        +1 for the above approach but just ensure that calculations are optimized and doesnt happen too frequent...

        Show
        Naganarasimha Naganarasimha G R added a comment - Thanks Bibin A Chundatt for looking into the issue, Get percentage capacity of queue as [ sum of resource of queue A all partition (X) / Total cluster resource n cluster (Y) ]= absolute percentage overall cluster (Z). +1 for the above approach but just ensure that calculations are optimized and doesnt happen too frequent...
        Hide
        bibinchundatt Bibin A Chundatt added a comment -

        Thank you Sunil G for looking into issue. Had an offline discussion with NGarla_Unused also.

        Its always better to handle the limits of application on overall partition

        1. Application submitted can ask AM resource from one partition and other resource from another partition. So limit should be on queue level
        2. User/tenant level limit for application should be based on queue.
        3. The configuration (maxclusterapplication) is cluster limit of application and when we partition to lower level its should be based on queue

        Approach
        Consider average of absolute percentage of all partition,but not average of absolute percentage per partition b ,Label 1 can be of 10% of 20 GB and default partition can be of 50% of 100GB.

        1. Get percentage capacity of queue as [ sum of resource of queue A all partition (X) / Total cluster resource n cluster (Y) ]= absolute percentage overall cluster (Z).
        2. max application of queue = Z * maxclusterapplication
        3. Have to update the max application always with NODE registration and removal.
        Show
        bibinchundatt Bibin A Chundatt added a comment - Thank you Sunil G for looking into issue. Had an offline discussion with NGarla_Unused also. Its always better to handle the limits of application on overall partition Application submitted can ask AM resource from one partition and other resource from another partition. So limit should be on queue level User/tenant level limit for application should be based on queue. The configuration (maxclusterapplication) is cluster limit of application and when we partition to lower level its should be based on queue Approach Consider average of absolute percentage of all partition,but not average of absolute percentage per partition b ,Label 1 can be of 10% of 20 GB and default partition can be of 50% of 100GB. Get percentage capacity of queue as [ sum of resource of queue A all partition ( X ) / Total cluster resource n cluster ( Y ) ]= absolute percentage overall cluster ( Z ). max application of queue = Z * maxclusterapplication Have to update the max application always with NODE registration and removal.
        Hide
        sunilg Sunil G added a comment -

        Thanks Bibin A Chundatt for reporting this.

        We need config for maximum application per queue per label if we need to solve the problem cleanly. For long term, this may be better. With this, we might also need to rebook on metrics, UI etc too.
        Otherwise we need to introduce few hacks when default cap is not configured.
        I prefer first option. Thoughts.?

        Show
        sunilg Sunil G added a comment - Thanks Bibin A Chundatt for reporting this. We need config for maximum application per queue per label if we need to solve the problem cleanly. For long term, this may be better. With this, we might also need to rebook on metrics, UI etc too. Otherwise we need to introduce few hacks when default cap is not configured. I prefer first option. Thoughts.?
        Hide
        bibinchundatt Bibin A Chundatt added a comment -

        Submit application handled as below. Max application is handled based on capacity of queue capacity of default partition.
        LeafQueue#submitApplication

              // Check submission limits for queues
              if (getNumApplications() >= getMaxApplications()) {
                String msg = "Queue " + getQueuePath() + 
                " already has " + getNumApplications() + " applications," +
                " cannot accept submission of application: " + applicationId;
                LOG.info(msg);
                throw new AccessControlException(msg);
              }
        

        LeafQueue#setupQueueConfigs max application is set based in default partition absolute capacity if max application per queue is not set.

            maxApplications = conf.getMaximumApplicationsPerQueue(getQueuePath());
            if (maxApplications < 0) {
              int maxSystemApps = conf.getMaximumSystemApplications();
              maxApplications =
                  (int) (maxSystemApps * queueCapacities.getAbsoluteCapacity());
            }
        

        We should consider max of absolute capacity of all partition in this case.

        Any thoughts??

        Show
        bibinchundatt Bibin A Chundatt added a comment - Submit application handled as below. Max application is handled based on capacity of queue capacity of default partition. LeafQueue#submitApplication // Check submission limits for queues if (getNumApplications() >= getMaxApplications()) { String msg = "Queue " + getQueuePath() + " already has " + getNumApplications() + " applications," + " cannot accept submission of application: " + applicationId; LOG.info(msg); throw new AccessControlException(msg); } LeafQueue#setupQueueConfigs max application is set based in default partition absolute capacity if max application per queue is not set. maxApplications = conf.getMaximumApplicationsPerQueue(getQueuePath()); if (maxApplications < 0) { int maxSystemApps = conf.getMaximumSystemApplications(); maxApplications = ( int ) (maxSystemApps * queueCapacities.getAbsoluteCapacity()); } We should consider max of absolute capacity of all partition in this case. Any thoughts??

          People

          • Assignee:
            bibinchundatt Bibin A Chundatt
            Reporter:
            bibinchundatt Bibin A Chundatt
          • Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development