Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-4882

Change the log level to DEBUG for recovering completed applications

    Details

    • Hadoop Flags:
      Reviewed

      Description

      I think for recovering completed applications no need to log as INFO, rather it can be made it as DEBUG. The problem seen from large cluster is if any issue happens during RM start up and continuously switching , then RM logs are filled with most with recovering applications only.
      There are 6 lines are logged for 1 applications as I shown in below logs, then consider RM default value for max-completed applications is 10K. So for each switch 10K*6=60K lines will be added which is not useful I feel.

      2016-03-01 10:20:59,077 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: Default priority level is set to application:application_1456298208485_21507
      2016-03-01 10:20:59,094 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Recovering app: application_1456298208485_21507 with 1 attempts and final state = FINISHED
      2016-03-01 10:20:59,100 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Recovering attempt: appattempt_1456298208485_21507_000001 with final state: FINISHED
      2016-03-01 10:20:59,107 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1456298208485_21507_000001 State change from NEW to FINISHED
      2016-03-01 10:20:59,111 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1456298208485_21507 State change from NEW to FINISHED
      2016-03-01 10:20:59,112 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=rohith   OPERATION=Application Finished - Succeeded      TARGET=RMAppManager     RESULT=SUCCESS  APPID=application_1456298208485_21507
      

      The main problem is missing important information's from the logs before RM unstable. Even though log roll back is 50 or 100, in a short period all these logs will be rolled out and all the logs contains only RM switching information that too recovering applications!!.

      I suggest at least completed applications recovery should be logged as DEBUG.

      1. YARN-4882.005.patch
        7 kB
        Daniel Templeton
      2. YARN-4882.004.patch
        7 kB
        Daniel Templeton
      3. YARN-4882.003.patch
        7 kB
        Daniel Templeton
      4. YARN-4882.002.patch
        7 kB
        Daniel Templeton
      5. YARN-4882.001.patch
        6 kB
        Daniel Templeton

        Activity

        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Jenkins build Hadoop-trunk-Commit #11051 (See https://builds.apache.org/job/Hadoop-trunk-Commit/11051/)
        YARN-4882. Change the log level to DEBUG for recovering completed (rkanter: rev f216276d2164c6564632c571fd3adbb03bc8b3e4)

        • (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java
        • (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java
        • (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMAppManager.java
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Jenkins build Hadoop-trunk-Commit #11051 (See https://builds.apache.org/job/Hadoop-trunk-Commit/11051/ ) YARN-4882 . Change the log level to DEBUG for recovering completed (rkanter: rev f216276d2164c6564632c571fd3adbb03bc8b3e4) (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMAppManager.java
        Hide
        templedf Daniel Templeton added a comment -

        Thanks, Robert Kanter!

        Show
        templedf Daniel Templeton added a comment - Thanks, Robert Kanter !
        Hide
        rkanter Robert Kanter added a comment -

        Thanks Daniel Templeton. Committed to trunk and branch-2!

        Show
        rkanter Robert Kanter added a comment - Thanks Daniel Templeton . Committed to trunk and branch-2!
        Hide
        rkanter Robert Kanter added a comment -

        +1

        Show
        rkanter Robert Kanter added a comment - +1
        Hide
        hadoopqa Hadoop QA added a comment -
        -1 overall



        Vote Subsystem Runtime Comment
        0 reexec 0m 17s Docker mode activated.
        +1 @author 0m 0s The patch does not contain any @author tags.
        -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
        +1 mvninstall 12m 31s trunk passed
        +1 compile 0m 32s trunk passed
        +1 checkstyle 0m 23s trunk passed
        +1 mvnsite 0m 33s trunk passed
        +1 mvneclipse 0m 15s trunk passed
        +1 findbugs 0m 58s trunk passed
        +1 javadoc 0m 21s trunk passed
        +1 mvninstall 0m 31s the patch passed
        +1 compile 0m 29s the patch passed
        +1 javac 0m 29s the patch passed
        +1 checkstyle 0m 21s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 0 new + 252 unchanged - 2 fixed = 252 total (was 254)
        +1 mvnsite 0m 31s the patch passed
        +1 mvneclipse 0m 13s the patch passed
        +1 whitespace 0m 0s The patch has no whitespace issues.
        +1 findbugs 1m 3s the patch passed
        +1 javadoc 0m 18s the patch passed
        -1 unit 38m 59s hadoop-yarn-server-resourcemanager in the patch failed.
        +1 asflicense 0m 17s The patch does not generate ASF License warnings.
        59m 51s



        Reason Tests
        Failed junit tests hadoop.yarn.server.resourcemanager.TestRMRestart



        Subsystem Report/Notes
        Docker Image:yetus/hadoop:a9ad5d6
        JIRA Issue YARN-4882
        JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12844988/YARN-4882.005.patch
        Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
        uname Linux 271abc327851 3.13.0-93-generic #140-Ubuntu SMP Mon Jul 18 21:21:05 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
        Build tool maven
        Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
        git revision trunk / 9ca54f4
        Default Java 1.8.0_111
        findbugs v3.0.0
        unit https://builds.apache.org/job/PreCommit-YARN-Build/14486/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
        Test Results https://builds.apache.org/job/PreCommit-YARN-Build/14486/testReport/
        modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
        Console output https://builds.apache.org/job/PreCommit-YARN-Build/14486/console
        Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 17s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 mvninstall 12m 31s trunk passed +1 compile 0m 32s trunk passed +1 checkstyle 0m 23s trunk passed +1 mvnsite 0m 33s trunk passed +1 mvneclipse 0m 15s trunk passed +1 findbugs 0m 58s trunk passed +1 javadoc 0m 21s trunk passed +1 mvninstall 0m 31s the patch passed +1 compile 0m 29s the patch passed +1 javac 0m 29s the patch passed +1 checkstyle 0m 21s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 0 new + 252 unchanged - 2 fixed = 252 total (was 254) +1 mvnsite 0m 31s the patch passed +1 mvneclipse 0m 13s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 3s the patch passed +1 javadoc 0m 18s the patch passed -1 unit 38m 59s hadoop-yarn-server-resourcemanager in the patch failed. +1 asflicense 0m 17s The patch does not generate ASF License warnings. 59m 51s Reason Tests Failed junit tests hadoop.yarn.server.resourcemanager.TestRMRestart Subsystem Report/Notes Docker Image:yetus/hadoop:a9ad5d6 JIRA Issue YARN-4882 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12844988/YARN-4882.005.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 271abc327851 3.13.0-93-generic #140-Ubuntu SMP Mon Jul 18 21:21:05 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 9ca54f4 Default Java 1.8.0_111 findbugs v3.0.0 unit https://builds.apache.org/job/PreCommit-YARN-Build/14486/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/14486/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/14486/console Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
        Hide
        templedf Daniel Templeton added a comment -

        Changed the colon in the attempt message to an equals.

        Show
        templedf Daniel Templeton added a comment - Changed the colon in the attempt message to an equals.
        Hide
        templedf Daniel Templeton added a comment -

        Test failures are unrelated, and the lack of tests is because it's only changing log messages.

        Show
        templedf Daniel Templeton added a comment - Test failures are unrelated, and the lack of tests is because it's only changing log messages.
        Hide
        hadoopqa Hadoop QA added a comment -
        -1 overall



        Vote Subsystem Runtime Comment
        0 reexec 0m 13s Docker mode activated.
        +1 @author 0m 0s The patch does not contain any @author tags.
        -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
        +1 mvninstall 13m 46s trunk passed
        +1 compile 0m 35s trunk passed
        +1 checkstyle 0m 25s trunk passed
        +1 mvnsite 0m 36s trunk passed
        +1 mvneclipse 0m 16s trunk passed
        +1 findbugs 1m 6s trunk passed
        +1 javadoc 0m 26s trunk passed
        +1 mvninstall 0m 31s the patch passed
        +1 compile 0m 30s the patch passed
        +1 javac 0m 30s the patch passed
        +1 checkstyle 0m 20s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 0 new + 252 unchanged - 2 fixed = 252 total (was 254)
        +1 mvnsite 0m 31s the patch passed
        +1 mvneclipse 0m 13s the patch passed
        +1 whitespace 0m 0s The patch has no whitespace issues.
        +1 findbugs 1m 1s the patch passed
        +1 javadoc 0m 18s the patch passed
        -1 unit 39m 35s hadoop-yarn-server-resourcemanager in the patch failed.
        +1 asflicense 0m 16s The patch does not generate ASF License warnings.
        61m 58s



        Reason Tests
        Failed junit tests hadoop.yarn.server.resourcemanager.TestRMRestart
          hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer



        Subsystem Report/Notes
        Docker Image:yetus/hadoop:a9ad5d6
        JIRA Issue YARN-4882
        JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12844824/YARN-4882.004.patch
        Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
        uname Linux 2047ba30c913 3.13.0-105-generic #152-Ubuntu SMP Fri Dec 2 15:37:11 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
        Build tool maven
        Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
        git revision trunk / 1bbd023
        Default Java 1.8.0_111
        findbugs v3.0.0
        unit https://builds.apache.org/job/PreCommit-YARN-Build/14480/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
        Test Results https://builds.apache.org/job/PreCommit-YARN-Build/14480/testReport/
        modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
        Console output https://builds.apache.org/job/PreCommit-YARN-Build/14480/console
        Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 13s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 mvninstall 13m 46s trunk passed +1 compile 0m 35s trunk passed +1 checkstyle 0m 25s trunk passed +1 mvnsite 0m 36s trunk passed +1 mvneclipse 0m 16s trunk passed +1 findbugs 1m 6s trunk passed +1 javadoc 0m 26s trunk passed +1 mvninstall 0m 31s the patch passed +1 compile 0m 30s the patch passed +1 javac 0m 30s the patch passed +1 checkstyle 0m 20s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 0 new + 252 unchanged - 2 fixed = 252 total (was 254) +1 mvnsite 0m 31s the patch passed +1 mvneclipse 0m 13s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 1s the patch passed +1 javadoc 0m 18s the patch passed -1 unit 39m 35s hadoop-yarn-server-resourcemanager in the patch failed. +1 asflicense 0m 16s The patch does not generate ASF License warnings. 61m 58s Reason Tests Failed junit tests hadoop.yarn.server.resourcemanager.TestRMRestart   hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer Subsystem Report/Notes Docker Image:yetus/hadoop:a9ad5d6 JIRA Issue YARN-4882 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12844824/YARN-4882.004.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 2047ba30c913 3.13.0-105-generic #152-Ubuntu SMP Fri Dec 2 15:37:11 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 1bbd023 Default Java 1.8.0_111 findbugs v3.0.0 unit https://builds.apache.org/job/PreCommit-YARN-Build/14480/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/14480/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/14480/console Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
        Hide
        templedf Daniel Templeton added a comment -

        Fixed a typo.

        Show
        templedf Daniel Templeton added a comment - Fixed a typo.
        Hide
        templedf Daniel Templeton added a comment -

        I rebased and made the log messages line up. I intentionally didn't modify them because I didn't want to mess with any scripts that were doing automatic parsing. But, hey, it's Hadoop 3!

        Show
        templedf Daniel Templeton added a comment - I rebased and made the log messages line up. I intentionally didn't modify them because I didn't want to mess with any scripts that were doing automatic parsing. But, hey, it's Hadoop 3!
        Hide
        rkanter Robert Kanter added a comment -

        I think we should make the messages consistent between RMAppImpl and RMAppAttemptImpl:

        STATE_CHANGE_MESSAGE = "%s State change from %s to %s on event=%s";
        STATE_CHANGE_MESSAGE = "%s State change from %s to %s";
        
        RECOVERY_MESSAGE = "Recovering app: %s with %d attempts and final state = %s";
        RECOVERY_MESSAGE = "Recovering attempt: %s with final state: %s";
        
        Show
        rkanter Robert Kanter added a comment - I think we should make the messages consistent between RMAppImpl and RMAppAttemptImpl : STATE_CHANGE_MESSAGE = "%s State change from %s to %s on event=%s"; STATE_CHANGE_MESSAGE = "%s State change from %s to %s"; RECOVERY_MESSAGE = "Recovering app: %s with %d attempts and final state = %s"; RECOVERY_MESSAGE = "Recovering attempt: %s with final state: %s";
        Hide
        hadoopqa Hadoop QA added a comment -
        -1 overall



        Vote Subsystem Runtime Comment
        0 reexec 0m 39s Docker mode activated.
        +1 @author 0m 0s The patch does not contain any @author tags.
        -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
        +1 mvninstall 11m 55s trunk passed
        +1 compile 0m 31s trunk passed
        +1 checkstyle 0m 23s trunk passed
        +1 mvnsite 0m 37s trunk passed
        +1 mvneclipse 0m 16s trunk passed
        +1 findbugs 0m 57s trunk passed
        +1 javadoc 0m 19s trunk passed
        +1 mvninstall 0m 30s the patch passed
        +1 compile 0m 28s the patch passed
        +1 javac 0m 28s the patch passed
        +1 checkstyle 0m 21s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 0 new + 265 unchanged - 2 fixed = 265 total (was 267)
        +1 mvnsite 0m 34s the patch passed
        +1 mvneclipse 0m 13s the patch passed
        +1 whitespace 0m 0s The patch has no whitespace issues.
        +1 findbugs 1m 2s the patch passed
        +1 javadoc 0m 17s the patch passed
        +1 unit 34m 39s hadoop-yarn-server-resourcemanager in the patch passed.
        +1 asflicense 0m 14s The patch does not generate ASF License warnings.
        54m 34s



        Subsystem Report/Notes
        Docker Image:yetus/hadoop:9560f25
        JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12834721/YARN-4882.002.patch
        JIRA Issue YARN-4882
        Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
        uname Linux b2de176f60ac 3.13.0-93-generic #140-Ubuntu SMP Mon Jul 18 21:21:05 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
        Build tool maven
        Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
        git revision trunk / f63cd78
        Default Java 1.8.0_101
        findbugs v3.0.0
        Test Results https://builds.apache.org/job/PreCommit-YARN-Build/13469/testReport/
        modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
        Console output https://builds.apache.org/job/PreCommit-YARN-Build/13469/console
        Powered by Apache Yetus 0.3.0 http://yetus.apache.org

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 39s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 mvninstall 11m 55s trunk passed +1 compile 0m 31s trunk passed +1 checkstyle 0m 23s trunk passed +1 mvnsite 0m 37s trunk passed +1 mvneclipse 0m 16s trunk passed +1 findbugs 0m 57s trunk passed +1 javadoc 0m 19s trunk passed +1 mvninstall 0m 30s the patch passed +1 compile 0m 28s the patch passed +1 javac 0m 28s the patch passed +1 checkstyle 0m 21s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 0 new + 265 unchanged - 2 fixed = 265 total (was 267) +1 mvnsite 0m 34s the patch passed +1 mvneclipse 0m 13s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 2s the patch passed +1 javadoc 0m 17s the patch passed +1 unit 34m 39s hadoop-yarn-server-resourcemanager in the patch passed. +1 asflicense 0m 14s The patch does not generate ASF License warnings. 54m 34s Subsystem Report/Notes Docker Image:yetus/hadoop:9560f25 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12834721/YARN-4882.002.patch JIRA Issue YARN-4882 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux b2de176f60ac 3.13.0-93-generic #140-Ubuntu SMP Mon Jul 18 21:21:05 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / f63cd78 Default Java 1.8.0_101 findbugs v3.0.0 Test Results https://builds.apache.org/job/PreCommit-YARN-Build/13469/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/13469/console Powered by Apache Yetus 0.3.0 http://yetus.apache.org This message was automatically generated.
        Hide
        templedf Daniel Templeton added a comment -

        New patch to address comments.

        Show
        templedf Daniel Templeton added a comment - New patch to address comments.
        Hide
        templedf Daniel Templeton added a comment - - edited

        That's where I started, but I didn't want to create the message unless we're using it. Hmmm... Seems to me there's already a helper method call String.format()...

        Show
        templedf Daniel Templeton added a comment - - edited That's where I started, but I didn't want to create the message unless we're using it. Hmmm... Seems to me there's already a helper method call String.format() ...
        Hide
        kasha Karthik Kambatla added a comment -

        The following snippet seems more complicated:

              if ((oldState != getState()) &&
                  ((event.getType() != RMAppEventType.RECOVER) ||
                  (recoveredFinalState != null) || LOG.isDebugEnabled())) {
                String message = appID + " State change from " + oldState + " to "
                    + getState() + " on event=" + event.getType();
        
                // Log at INFO if we're not recovering or not in a terminal state.
                // Log at DEBUG otherwise.
                if ((event.getType() != RMAppEventType.RECOVER) ||
                    (recoveredFinalState != null)) {
                  LOG.info(message);
                } else {
                  LOG.debug(message);
                }
        

        How about the following instead? If we want to, we could move creating of the message to a helper method called from LOG.info() or LOG.debug().

              if (oldState != getState()) {
                String message = appID + " State change from " + oldState + " to "
                    + getState() + " on event=" + event.getType();
        
                // Log at INFO if we're not recovering or not in a terminal state.
                // Log at DEBUG otherwise.
                if (event.getType() != RMAppEventType.RECOVER ||
                    recoveredFinalState != null) {
                  LOG.info(message);
                } else if (LOG.isDebugEnabled()){
                  LOG.debug(message);
                }
        
        Show
        kasha Karthik Kambatla added a comment - The following snippet seems more complicated: if ((oldState != getState()) && ((event.getType() != RMAppEventType.RECOVER) || (recoveredFinalState != null ) || LOG.isDebugEnabled())) { String message = appID + " State change from " + oldState + " to " + getState() + " on event=" + event.getType(); // Log at INFO if we're not recovering or not in a terminal state. // Log at DEBUG otherwise. if ((event.getType() != RMAppEventType.RECOVER) || (recoveredFinalState != null )) { LOG.info(message); } else { LOG.debug(message); } How about the following instead? If we want to, we could move creating of the message to a helper method called from LOG.info() or LOG.debug() . if (oldState != getState()) { String message = appID + " State change from " + oldState + " to " + getState() + " on event=" + event.getType(); // Log at INFO if we're not recovering or not in a terminal state. // Log at DEBUG otherwise. if (event.getType() != RMAppEventType.RECOVER || recoveredFinalState != null ) { LOG.info(message); } else if (LOG.isDebugEnabled()){ LOG.debug(message); }
        Hide
        templedf Daniel Templeton added a comment -

        Please have a look to see if this fits the bill.

        Show
        templedf Daniel Templeton added a comment - Please have a look to see if this fits the bill.
        Hide
        jlowe Jason Lowe added a comment -

        let us start with debug logging success cases. If we ever run into a case of needing logs for success cases, let us add a separate logger.

        Yep, sounds good.

        Show
        jlowe Jason Lowe added a comment - let us start with debug logging success cases. If we ever run into a case of needing logs for success cases, let us add a separate logger. Yep, sounds good.
        Hide
        kasha Karthik Kambatla added a comment -

        Thanks for your inputs, Jason Lowe. I would say, let us start with debug logging success cases. If we ever run into a case of needing logs for success cases, let us add a separate logger.

        Show
        kasha Karthik Kambatla added a comment - Thanks for your inputs, Jason Lowe . I would say, let us start with debug logging success cases. If we ever run into a case of needing logs for success cases, let us add a separate logger.
        Hide
        templedf Daniel Templeton added a comment -

        Thanks for clarifying, Jason Lowe. It sounds like the general opinion here is that as long as we make recovery failures loud and recovery successes quiet, we should be fine. If we can additionally add some unobtrusive summary stats (including timing), that would be swell.

        If it's really a make or break, I'll create a new logger, but I'm fairly certain it'll just be added to the long list of things that are theoretically configurable in YARN that no one actually ever configures.

        BTW, It doesn't look to me like the recoveries are written to the audit log.

        Show
        templedf Daniel Templeton added a comment - Thanks for clarifying, Jason Lowe . It sounds like the general opinion here is that as long as we make recovery failures loud and recovery successes quiet, we should be fine. If we can additionally add some unobtrusive summary stats (including timing), that would be swell. If it's really a make or break, I'll create a new logger, but I'm fairly certain it'll just be added to the long list of things that are theoretically configurable in YARN that no one actually ever configures. BTW, It doesn't look to me like the recoveries are written to the audit log.
        Hide
        rohithsharma Rohith Sharma K S added a comment -

        One of the thought I had for separate log file is after changing log level to DEBUG, there is no way to identify to which application recovery failed. Ex : If one of the application state got corrupted. Basically, we need at least application id to which recovery has failed so that Admin can remove that application from state store and make RM cluster up.
        And difficult task is since recovery flow is same for both completed and running applications, changing log to DEBUG impact both. So should be very cautious about which logs are changed to DEBUG.

        Show
        rohithsharma Rohith Sharma K S added a comment - One of the thought I had for separate log file is after changing log level to DEBUG, there is no way to identify to which application recovery failed. Ex : If one of the application state got corrupted. Basically, we need at least application id to which recovery has failed so that Admin can remove that application from state store and make RM cluster up. And difficult task is since recovery flow is same for both completed and running applications, changing log to DEBUG impact both. So should be very cautious about which logs are changed to DEBUG.
        Hide
        jlowe Jason Lowe added a comment -

        The main motiviation of proposing a separate logger is to allow finer control in cases where someone wants to see this recovery output but not enable DEBUG/TRACE for the entire module. We did something similar in the MapReduce ShuffleHandler so we could audit shuffle transfers without enabling debug for all ShuffleHandler operations (like verifying HTTP headers, etc.). Admins can then configure it to work like it does today (i.e.: enable DEBUG/TRACE on the new logger and leave it going to the same log file as everything else), use a separate log file for that logger, or leave it disabled.

        IIRC recovered completed applications are re-logged to the RM audit logger, so we may not really need another log at least for the completed case. I'm OK if we want to lower successful recoveries to the DEBUG/TRACE level even without a separate logger for them.

        Show
        jlowe Jason Lowe added a comment - The main motiviation of proposing a separate logger is to allow finer control in cases where someone wants to see this recovery output but not enable DEBUG/TRACE for the entire module. We did something similar in the MapReduce ShuffleHandler so we could audit shuffle transfers without enabling debug for all ShuffleHandler operations (like verifying HTTP headers, etc.). Admins can then configure it to work like it does today (i.e.: enable DEBUG/TRACE on the new logger and leave it going to the same log file as everything else), use a separate log file for that logger, or leave it disabled. IIRC recovered completed applications are re-logged to the RM audit logger, so we may not really need another log at least for the completed case. I'm OK if we want to lower successful recoveries to the DEBUG/TRACE level even without a separate logger for them.
        Hide
        kasha Karthik Kambatla added a comment -

        Was debugging something else in the RM and saw this. We should definitely drive towards reducing the log spew.

        In terms of the discussion here, in my experience:

        1. We need logs for applications that fail to recover so we know why they failed to recover.
        2. We don't really need logs for applications that recovered fine. At least so far, I haven't run into any issues with the RM recovering a job that should not have been. Does anyone have experience with these? I am fine with logging at DEBUG or even TRACE level.
        3. It might be nice to log something about how many applications have been recovered successfully, how many failed. And, since I am expressing wishes here, may be the time we spent recovering applications.

        App recovery failures should be less frequent, and I wonder if we need a separate log file just for those. Also, in my experience, the RM can itself fail if the app recovery fails. It is infinitely easier to have all the info in one log file at debug time, as opposed to looking at two log files and tracking timestamps. Are there any major advantages to a separate log file that I am missing?

        Jason Lowe, Rohith Sharma K S, Varun Vasudev, Daniel Templeton - can we reconsider this and come to a consensus?

        Show
        kasha Karthik Kambatla added a comment - Was debugging something else in the RM and saw this. We should definitely drive towards reducing the log spew. In terms of the discussion here, in my experience: We need logs for applications that fail to recover so we know why they failed to recover. We don't really need logs for applications that recovered fine. At least so far, I haven't run into any issues with the RM recovering a job that should not have been. Does anyone have experience with these? I am fine with logging at DEBUG or even TRACE level. It might be nice to log something about how many applications have been recovered successfully, how many failed. And, since I am expressing wishes here, may be the time we spent recovering applications. App recovery failures should be less frequent, and I wonder if we need a separate log file just for those. Also, in my experience, the RM can itself fail if the app recovery fails. It is infinitely easier to have all the info in one log file at debug time, as opposed to looking at two log files and tracking timestamps. Are there any major advantages to a separate log file that I am missing? Jason Lowe , Rohith Sharma K S , Varun Vasudev , Daniel Templeton - can we reconsider this and come to a consensus?
        Hide
        rohithsharma Rohith Sharma K S added a comment -

        One of the reason for redirecting recovered applications to separate log file is, by the time you get to know there is something problem in the cluster the logs files are flooded especially when RM HA is configured. I agree that changing log level would help, but it wont solve our log flooding problem.

        Show
        rohithsharma Rohith Sharma K S added a comment - One of the reason for redirecting recovered applications to separate log file is, by the time you get to know there is something problem in the cluster the logs files are flooded especially when RM HA is configured. I agree that changing log level would help, but it wont solve our log flooding problem.
        Hide
        rohithsharma Rohith Sharma K S added a comment -

        Apologies for missing your comment. Yes, your understanding is right. Initially when I raised this JIRA I had in mind that we don't want to flood the logs. But considering some issues would happen during recovery(would be a bug in RM also) that helps to know which application recovery caused the failure. This is the reason I given vote for keeping the logs in separate file.

        Are the recovery logs used for anything other than diagnosing a failed recovery?

        AFAIK, completed applications recovery logs used only diagnosis only if recovery fails. And also helps in debugging for running applications if applications get hanged or not allocating any containers cases.

        Show
        rohithsharma Rohith Sharma K S added a comment - Apologies for missing your comment. Yes, your understanding is right. Initially when I raised this JIRA I had in mind that we don't want to flood the logs. But considering some issues would happen during recovery(would be a bug in RM also) that helps to know which application recovery caused the failure. This is the reason I given vote for keeping the logs in separate file. Are the recovery logs used for anything other than diagnosing a failed recovery? AFAIK, completed applications recovery logs used only diagnosis only if recovery fails. And also helps in debugging for running applications if applications get hanged or not allocating any containers cases.
        Hide
        templedf Daniel Templeton added a comment -

        I'm going to assume the silence means I captured it pretty well.

        We don't want to flood the logs with an intractable number of log messages during recovery

        This one is clearly solved by both having an extra log file and just dialing down the log level.

        We need to be able to identify bad applications in the case that recovery fails

        As long as we don't dial down the log level for recovery failures, both solutions seem to address this objective as well. On the point that sometimes knowing what didn't fail is useful in a failed recovery, let me ask a question. If the recovery fails, the RM fails to start, right? If the RM fails to start, it's possible to change the log level before starting it again, if getting a list of the successful recoveries is helpful, right? And since that recovery will also fail, it's possible to reset the log level before the final restart after resolving the issue. Is there a scenario where the RM starts successfully but the list of recovered apps is still useful?

        I dislike the idea of adding an extra log file and a property to enable it to the admin's plate for the sole purpose of logging successful recoveries, when that information is not commonly useful, and when the same information could be retrieved through a well known existing mechanism (changing the log level).

        I propose that we streamline the log messages to be useful and succinct. The RM should detail the recovery statistics at the info level, recovery failures at the warn or error level, and recovery successes at the debug level. The log messages should also be reworked to include as information as possible to assist in debugging failures while being less chatty.

        Any other thoughts?

        Show
        templedf Daniel Templeton added a comment - I'm going to assume the silence means I captured it pretty well. We don't want to flood the logs with an intractable number of log messages during recovery This one is clearly solved by both having an extra log file and just dialing down the log level. We need to be able to identify bad applications in the case that recovery fails As long as we don't dial down the log level for recovery failures, both solutions seem to address this objective as well. On the point that sometimes knowing what didn't fail is useful in a failed recovery, let me ask a question. If the recovery fails, the RM fails to start, right? If the RM fails to start, it's possible to change the log level before starting it again, if getting a list of the successful recoveries is helpful, right? And since that recovery will also fail, it's possible to reset the log level before the final restart after resolving the issue. Is there a scenario where the RM starts successfully but the list of recovered apps is still useful? I dislike the idea of adding an extra log file and a property to enable it to the admin's plate for the sole purpose of logging successful recoveries, when that information is not commonly useful, and when the same information could be retrieved through a well known existing mechanism (changing the log level). I propose that we streamline the log messages to be useful and succinct. The RM should detail the recovery statistics at the info level, recovery failures at the warn or error level, and recovery successes at the debug level. The log messages should also be reworked to include as information as possible to assist in debugging failures while being less chatty. Any other thoughts?
        Hide
        templedf Daniel Templeton added a comment -

        Humor me for a second. I'd like to turn this conversation around and start at the beginning. What are our design objectives? Here's what I've heard so far:

        • We don't want to flood the logs with an intractable number of log messages during recovery
        • We need to be able to identify bad applications in the case that recovery fails

        Any other objectives I'm missing? Are the recovery logs used for anything other than diagnosing a failed recovery?

        Show
        templedf Daniel Templeton added a comment - Humor me for a second. I'd like to turn this conversation around and start at the beginning. What are our design objectives? Here's what I've heard so far: We don't want to flood the logs with an intractable number of log messages during recovery We need to be able to identify bad applications in the case that recovery fails Any other objectives I'm missing? Are the recovery logs used for anything other than diagnosing a failed recovery?
        Hide
        rohithsharma Rohith Sharma K S added a comment -

        I vote for 2nd option or/and hybrid solution. Separate log file would be better. Many times we have seen that recovery would fail because of because of state store corruption. In this case, RM never be up. We require application id of the corrupted application which can be used for deleting it from state store so that at least RM can be made up.

        Show
        rohithsharma Rohith Sharma K S added a comment - I vote for 2nd option or/and hybrid solution. Separate log file would be better. Many times we have seen that recovery would fail because of because of state store corruption. In this case, RM never be up. We require application id of the corrupted application which can be used for deleting it from state store so that at least RM can be made up.
        Hide
        vvasudev Varun Vasudev added a comment -

        My view is that we should only log progress in the RM log. I would prefer it we did log a single line for every completed app, but for now I don't mind skipping completed apps. Logs for completed apps have been helpful in debugging when we don't know if the RM knows about an app but those have been rare. I think the benefit of using a different log file for recovery is that if tomorrow we do decide to log a line for every app, we won't need to worry about flooding the RM log like we currently do.

        Like Jason mentioned earlier, a hybrid solution to add an optional logger which sends recovery logs to another file but is turned off by default and log only non-completed apps to the RM log might be the best option?

        Show
        vvasudev Varun Vasudev added a comment - My view is that we should only log progress in the RM log. I would prefer it we did log a single line for every completed app, but for now I don't mind skipping completed apps. Logs for completed apps have been helpful in debugging when we don't know if the RM knows about an app but those have been rare. I think the benefit of using a different log file for recovery is that if tomorrow we do decide to log a line for every app, we won't need to worry about flooding the RM log like we currently do. Like Jason mentioned earlier, a hybrid solution to add an optional logger which sends recovery logs to another file but is turned off by default and log only non-completed apps to the RM log might be the best option?
        Hide
        templedf Daniel Templeton added a comment -

        Varun Vasudev, help me understand what information from the successful recovery of applications is useful in debugging an failure. What in particular about a successfully recovered application is it that you'd want to know when debugging a recovery failure? I have never had the pleasure of debugging a failed recovery before.

        Show
        templedf Daniel Templeton added a comment - Varun Vasudev , help me understand what information from the successful recovery of applications is useful in debugging an failure. What in particular about a successfully recovered application is it that you'd want to know when debugging a recovery failure? I have never had the pleasure of debugging a failed recovery before.
        Hide
        vvasudev Varun Vasudev added a comment -

        I vote for re-routing all recovery logs to a new file and only to log a progress message in the RM log(something like "processed 1000 of 10,000 apps" or something similar). Even a summary log line per app will result in a huge bunch of entries in the RM log.

        Show
        vvasudev Varun Vasudev added a comment - I vote for re-routing all recovery logs to a new file and only to log a progress message in the RM log(something like "processed 1000 of 10,000 apps" or something similar). Even a summary log line per app will result in a huge bunch of entries in the RM log.
        Hide
        templedf Daniel Templeton added a comment -

        Jason Lowe, I think your first point is dead on. The log output should have a summary at the info level, full recovery logging at the debug level, and anything less than a successful recovery at the warn or error level, depending. (I just assumed the last bit was already happening.) I don't see where having full logs about the things that worked as expected will help in debugging, unless we screw something up and let an application slip through the cracks. (Even then, figuring out which one's missing from the 10,000 x 6 lines of log output is non-trivial.) So let's make sure we don't let anything slip through the cracks.

        Show
        templedf Daniel Templeton added a comment - Jason Lowe , I think your first point is dead on. The log output should have a summary at the info level, full recovery logging at the debug level, and anything less than a successful recovery at the warn or error level, depending. (I just assumed the last bit was already happening.) I don't see where having full logs about the things that worked as expected will help in debugging, unless we screw something up and let an application slip through the cracks. (Even then, figuring out which one's missing from the 10,000 x 6 lines of log output is non-trivial.) So let's make sure we don't let anything slip through the cracks.
        Hide
        jlowe Jason Lowe added a comment -

        The problem with saying we can up the log level to DEBUG is that debugging recovery will often be a postmortem process, trying to figure out why a particular application wasn't recovered or didn't end up in the right state. That also means logging a single rollup value for all apps is going to not be very helpful when the problem is trying to figure out what happened to a particular app. By the time one wants to re-run the scenario with DEBUG turned on the state store is going to be significantly different.

        I agree the amount of logging is a bit excessive today and can be reduced, but I also think it's important to track how active applications were recovered. I can think of a couple of potential options:

        1) Reduce the logging for completed applications, potentially just one line per app or just a rollup number of completed apps. Completed applications dominate the current recovery logging and are less interesting. For debugging, one can presume that an app that wasn't logged during recovery either was missing from the state store or was recovered as completed.

        2) Use a separate logger for recovery activity which allows users to control the amount of logging pertaining to recovery (i.e.: via log level) and potentially use a separate log file just for recovery

        And of course we could do some combination of the above options, both reducing the recovery logging we have today and allowing the user to control it separately from normal RM logging.

        Show
        jlowe Jason Lowe added a comment - The problem with saying we can up the log level to DEBUG is that debugging recovery will often be a postmortem process, trying to figure out why a particular application wasn't recovered or didn't end up in the right state. That also means logging a single rollup value for all apps is going to not be very helpful when the problem is trying to figure out what happened to a particular app. By the time one wants to re-run the scenario with DEBUG turned on the state store is going to be significantly different. I agree the amount of logging is a bit excessive today and can be reduced, but I also think it's important to track how active applications were recovered. I can think of a couple of potential options: 1) Reduce the logging for completed applications, potentially just one line per app or just a rollup number of completed apps. Completed applications dominate the current recovery logging and are less interesting. For debugging, one can presume that an app that wasn't logged during recovery either was missing from the state store or was recovered as completed. 2) Use a separate logger for recovery activity which allows users to control the amount of logging pertaining to recovery (i.e.: via log level) and potentially use a separate log file just for recovery And of course we could do some combination of the above options, both reducing the recovery logging we have today and allowing the user to control it separately from normal RM logging.
        Hide
        templedf Daniel Templeton added a comment -

        When debugging recovery, couldn't one up the log level to DEBUG?

        Show
        templedf Daniel Templeton added a comment - When debugging recovery, couldn't one up the log level to DEBUG?
        Hide
        vvasudev Varun Vasudev added a comment -

        I think instead of changing the log level, we should redirect those statements to another file. They are useful for when debugging recovery.

        cc Jason Lowe

        Show
        vvasudev Varun Vasudev added a comment - I think instead of changing the log level, we should redirect those statements to another file. They are useful for when debugging recovery. cc Jason Lowe
        Hide
        rohithsharma Rohith Sharma K S added a comment -

        Just thinking, In case if recovered applications logs are more important for few users, then how about logging in different file other than RM logs?

        Show
        rohithsharma Rohith Sharma K S added a comment - Just thinking, In case if recovered applications logs are more important for few users, then how about logging in different file other than RM logs?
        Hide
        rohithsharma Rohith Sharma K S added a comment -

        Sure, go ahead.

        Show
        rohithsharma Rohith Sharma K S added a comment - Sure, go ahead.
        Hide
        templedf Daniel Templeton added a comment -

        I'm happy to take this one.

        Show
        templedf Daniel Templeton added a comment - I'm happy to take this one.
        Hide
        templedf Daniel Templeton added a comment -

        I agree. The current logging should be at the debug level. There should also be a summary statement logged at the info level, like "9999 applications successfully recovered."

        Show
        templedf Daniel Templeton added a comment - I agree. The current logging should be at the debug level. There should also be a summary statement logged at the info level, like "9999 applications successfully recovered."

          People

          • Assignee:
            templedf Daniel Templeton
            Reporter:
            rohithsharma Rohith Sharma K S
          • Votes:
            0 Vote for this issue
            Watchers:
            11 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development