Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-6031

Application recovery has failed when node label feature is turned off during RM recovery

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 2.8.0
    • Fix Version/s: 2.9.0, 3.0.0-alpha4, 2.8.2
    • Component/s: scheduler
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Here is the repro steps:
      Enable node label, restart RM, configure CS properly, and run some jobs;
      Disable node label, restart RM, and the following exception thrown:

      Caused by: org.apache.hadoop.yarn.exceptions.InvalidLabelResourceRequestException: Invalid resource request, node label not enabled but request contains label expression
              at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:225)
              at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:248)
              at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:394)
              at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:339)
              at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:319)
              at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:436)
              at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1165)
              at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574)
              at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
              ... 10 more
      

      During RM restart, application recovery failed due to that application had node label expression specified while node label has been disabled.

      1. YARN-6031.001.patch
        3 kB
        Ying Zhang
      2. YARN-6031.002.patch
        4 kB
        Ying Zhang
      3. YARN-6031.003.patch
        9 kB
        Ying Zhang
      4. YARN-6031.004.patch
        9 kB
        Ying Zhang
      5. YARN-6031.005.patch
        9 kB
        Ying Zhang
      6. YARN-6031.006.patch
        9 kB
        Ying Zhang
      7. YARN-6031.007.patch
        9 kB
        Ying Zhang
      8. YARN-6031-branch-2.8.001.patch
        10 kB
        Ying Zhang

        Issue Links

          Activity

          Hide
          sunilg Sunil G added a comment -

          I was thinking whether we can let the app continue to run

          Thanks Jian He. We can let the app run and account resource under NO_LABEL. I will open a new jira to track same.

          Show
          sunilg Sunil G added a comment - I was thinking whether we can let the app continue to run Thanks Jian He . We can let the app run and account resource under NO_LABEL. I will open a new jira to track same.
          Hide
          jianhe Jian He added a comment -

          In RMAppManager#createAndPopulateNewRMApp, app is just created whether its in submission/recovery mode. Attempt is not yet created. Hence I think this wont be a problem.

          The scenario is this: the RMApp is now transitioned to failed and the state is persisted in store, but attempt state is still null. If next time the admin re-enables node label, RMApp will be recovered as FAILED, but attempt state will be NULL.

          Hence recovery for other apps will also continue and we will have context of this app as well.

          Killing an app for a mistake of admin may be harsh from the perspective of service app, as all service containers will be killed. I was thinking whether we can let the app continue to run - existing containers will be running fine, the new requests with label will be rejected. I guess we can surface this as a diagnostics to the user ?

          Show
          jianhe Jian He added a comment - In RMAppManager#createAndPopulateNewRMApp, app is just created whether its in submission/recovery mode. Attempt is not yet created. Hence I think this wont be a problem. The scenario is this: the RMApp is now transitioned to failed and the state is persisted in store, but attempt state is still null. If next time the admin re-enables node label, RMApp will be recovered as FAILED, but attempt state will be NULL. Hence recovery for other apps will also continue and we will have context of this app as well. Killing an app for a mistake of admin may be harsh from the perspective of service app, as all service containers will be killed. I was thinking whether we can let the app continue to run - existing containers will be running fine, the new requests with label will be rejected. I guess we can surface this as a diagnostics to the user ?
          Hide
          sunilg Sunil G added a comment -

          Hi Jian He
          Few doubts here,

          1. Below code catches InvalidLabelResourceRequestException and assumes that the error is because node-label becomes disabled

          This code snippet catches InvalidLabelResourceRequestException and suppress the same only in case of recovery. If AMResourceRequest was stored in statestore, which means that validateAndCreateResourceRequest was successful when app was submitted. Now during recovery, same will throw error only when node labels are disables by conf. If its in store, we can assume that the am request is sane enough. Could you please give more context where some other scenario can also throw same exception during recovery.
          On an another note, if not recovery throw e;, we throw same exception back.

          2. Below code directly transitions app to failed by using a Rejected event. The attempt state is not moved to failed

          In RMAppManager#createAndPopulateNewRMApp, app is just created whether its in submission/recovery mode. Attempt is not yet created. Hence I think this wont be a problem.

          3. Is it ok to let the app continue in this scenario, it's less disruptive to the apps.

          Currently exception was thrown and RM was loosing the context of such an app. To record and track such an app, we create the app nd move it to fail state. Hence recovery for other apps will also continue and we will have context of this app as well.

          Show
          sunilg Sunil G added a comment - Hi Jian He Few doubts here, 1. Below code catches InvalidLabelResourceRequestException and assumes that the error is because node-label becomes disabled This code snippet catches InvalidLabelResourceRequestException and suppress the same only in case of recovery. If AMResourceRequest was stored in statestore, which means that validateAndCreateResourceRequest was successful when app was submitted. Now during recovery, same will throw error only when node labels are disables by conf. If its in store, we can assume that the am request is sane enough. Could you please give more context where some other scenario can also throw same exception during recovery. On an another note, if not recovery throw e; , we throw same exception back. 2. Below code directly transitions app to failed by using a Rejected event. The attempt state is not moved to failed In RMAppManager#createAndPopulateNewRMApp, app is just created whether its in submission/recovery mode. Attempt is not yet created. Hence I think this wont be a problem. 3. Is it ok to let the app continue in this scenario, it's less disruptive to the apps. Currently exception was thrown and RM was loosing the context of such an app. To record and track such an app, we create the app nd move it to fail state. Hence recovery for other apps will also continue and we will have context of this app as well.
          Hide
          jianhe Jian He added a comment - - edited

          Ran into this patch when debugging same issue, got few questions:
          cc Sunil G, Ying Zhang
          1. Below code catches InvalidLabelResourceRequestException and assumes that the error is because node-label becomes disabled, but the same InvalidLabelResourceRequestException can be thrown for other reasons too, right ? in that case, the following logic becomes invalid.

                amReqs = validateAndCreateResourceRequest(submissionContext, isRecovery);
              } catch (InvalidLabelResourceRequestException e) {
                // This can happen if the application had been submitted and run
                // with Node Label enabled but recover with Node Label disabled.
                // Thus there might be node label expression in the application's
                // resource requests. If this is the case, create RmAppImpl with
                // null amReq and reject the application later with clear error
                // message. So that the application can still be tracked by RM
                // after recovery and user can see what's going on and react accordingly.
                if (isRecovery &&
                    !YarnConfiguration.areNodeLabelsEnabled(this.conf)) {
                  if (LOG.isDebugEnabled()) {
                    LOG.debug("AMResourceRequest is not created for " + applicationId
                        + ". NodeLabel is not enabled in cluster, but AM resource "
                        + "request contains a label expression.");
                  }
                } else {
                  throw e;
                }
          

          2. Below code directly transitions app to failed by using a Rejected event. The attempt state is not moved to failed, it'll be stuck there ? I think we need to send KILL event instead of REJECT event

                if (labelExp != null &&
                    !labelExp.equals(RMNodeLabelsManager.NO_LABEL)) {
                  String message = "Failed to recover application " + appId
                      + ". NodeLabel is not enabled in cluster, but AM resource request "
                      + "contains a label expression.";
                  LOG.warn(message);
                  application.handle(
                      new RMAppEvent(appId, RMAppEventType.APP_REJECTED, message));
                  return;
                }
          

          3. Is it ok to let the app continue in this scenario, it's less disruptive to the apps. What's the disadvantage if we let app continue ?

          Show
          jianhe Jian He added a comment - - edited Ran into this patch when debugging same issue, got few questions: cc Sunil G , Ying Zhang 1. Below code catches InvalidLabelResourceRequestException and assumes that the error is because node-label becomes disabled, but the same InvalidLabelResourceRequestException can be thrown for other reasons too, right ? in that case, the following logic becomes invalid. amReqs = validateAndCreateResourceRequest(submissionContext, isRecovery); } catch (InvalidLabelResourceRequestException e) { // This can happen if the application had been submitted and run // with Node Label enabled but recover with Node Label disabled. // Thus there might be node label expression in the application's // resource requests. If this is the case , create RmAppImpl with // null amReq and reject the application later with clear error // message. So that the application can still be tracked by RM // after recovery and user can see what's going on and react accordingly. if (isRecovery && !YarnConfiguration.areNodeLabelsEnabled( this .conf)) { if (LOG.isDebugEnabled()) { LOG.debug( "AMResourceRequest is not created for " + applicationId + ". NodeLabel is not enabled in cluster, but AM resource " + "request contains a label expression." ); } } else { throw e; } 2. Below code directly transitions app to failed by using a Rejected event. The attempt state is not moved to failed, it'll be stuck there ? I think we need to send KILL event instead of REJECT event if (labelExp != null && !labelExp.equals(RMNodeLabelsManager.NO_LABEL)) { String message = "Failed to recover application " + appId + ". NodeLabel is not enabled in cluster, but AM resource request " + "contains a label expression." ; LOG.warn(message); application.handle( new RMAppEvent(appId, RMAppEventType.APP_REJECTED, message)); return ; } 3. Is it ok to let the app continue in this scenario, it's less disruptive to the apps. What's the disadvantage if we let app continue ?
          Hide
          vinodkv Vinod Kumar Vavilapalli added a comment -

          2.8.1 became a security release. Moving fix-version to 2.8.2 after the fact.

          Show
          vinodkv Vinod Kumar Vavilapalli added a comment - 2.8.1 became a security release. Moving fix-version to 2.8.2 after the fact.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 19s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
          +1 mvninstall 6m 57s branch-2.8 passed
          +1 compile 0m 31s branch-2.8 passed with JDK v1.8.0_121
          +1 compile 0m 34s branch-2.8 passed with JDK v1.7.0_121
          +1 checkstyle 0m 20s branch-2.8 passed
          +1 mvnsite 0m 39s branch-2.8 passed
          +1 mvneclipse 0m 18s branch-2.8 passed
          +1 findbugs 1m 13s branch-2.8 passed
          +1 javadoc 0m 21s branch-2.8 passed with JDK v1.8.0_121
          +1 javadoc 0m 25s branch-2.8 passed with JDK v1.7.0_121
          +1 mvninstall 0m 31s the patch passed
          +1 compile 0m 27s the patch passed with JDK v1.8.0_121
          +1 javac 0m 27s the patch passed
          +1 compile 0m 32s the patch passed with JDK v1.7.0_121
          +1 javac 0m 32s the patch passed
          +1 checkstyle 0m 17s the patch passed
          +1 mvnsite 0m 37s the patch passed
          +1 mvneclipse 0m 15s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 1m 25s the patch passed
          +1 javadoc 0m 19s the patch passed with JDK v1.8.0_121
          +1 javadoc 0m 22s the patch passed with JDK v1.7.0_121
          -1 unit 76m 51s hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.7.0_121.
          +1 asflicense 0m 18s The patch does not generate ASF License warnings.
          171m 2s



          Reason Tests
          JDK v1.8.0_121 Failed junit tests hadoop.yarn.server.resourcemanager.TestAMAuthorization
            hadoop.yarn.server.resourcemanager.TestClientRMTokens
          JDK v1.7.0_121 Failed junit tests hadoop.yarn.server.resourcemanager.TestAMAuthorization
            hadoop.yarn.server.resourcemanager.TestClientRMTokens



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:5af2af1
          JIRA Issue YARN-6031
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12851578/YARN-6031-branch-2.8.001.patch
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 1d714a427b8a 3.13.0-105-generic #152-Ubuntu SMP Fri Dec 2 15:37:11 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision branch-2.8 / 2bbcaa8
          Default Java 1.7.0_121
          Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_121 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_121
          findbugs v3.0.0
          unit https://builds.apache.org/job/PreCommit-YARN-Build/14860/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.7.0_121.txt
          JDK v1.7.0_121 Test Results https://builds.apache.org/job/PreCommit-YARN-Build/14860/testReport/
          modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/14860/console
          Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 19s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 6m 57s branch-2.8 passed +1 compile 0m 31s branch-2.8 passed with JDK v1.8.0_121 +1 compile 0m 34s branch-2.8 passed with JDK v1.7.0_121 +1 checkstyle 0m 20s branch-2.8 passed +1 mvnsite 0m 39s branch-2.8 passed +1 mvneclipse 0m 18s branch-2.8 passed +1 findbugs 1m 13s branch-2.8 passed +1 javadoc 0m 21s branch-2.8 passed with JDK v1.8.0_121 +1 javadoc 0m 25s branch-2.8 passed with JDK v1.7.0_121 +1 mvninstall 0m 31s the patch passed +1 compile 0m 27s the patch passed with JDK v1.8.0_121 +1 javac 0m 27s the patch passed +1 compile 0m 32s the patch passed with JDK v1.7.0_121 +1 javac 0m 32s the patch passed +1 checkstyle 0m 17s the patch passed +1 mvnsite 0m 37s the patch passed +1 mvneclipse 0m 15s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 25s the patch passed +1 javadoc 0m 19s the patch passed with JDK v1.8.0_121 +1 javadoc 0m 22s the patch passed with JDK v1.7.0_121 -1 unit 76m 51s hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.7.0_121. +1 asflicense 0m 18s The patch does not generate ASF License warnings. 171m 2s Reason Tests JDK v1.8.0_121 Failed junit tests hadoop.yarn.server.resourcemanager.TestAMAuthorization   hadoop.yarn.server.resourcemanager.TestClientRMTokens JDK v1.7.0_121 Failed junit tests hadoop.yarn.server.resourcemanager.TestAMAuthorization   hadoop.yarn.server.resourcemanager.TestClientRMTokens Subsystem Report/Notes Docker Image:yetus/hadoop:5af2af1 JIRA Issue YARN-6031 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12851578/YARN-6031-branch-2.8.001.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 1d714a427b8a 3.13.0-105-generic #152-Ubuntu SMP Fri Dec 2 15:37:11 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision branch-2.8 / 2bbcaa8 Default Java 1.7.0_121 Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_121 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_121 findbugs v3.0.0 unit https://builds.apache.org/job/PreCommit-YARN-Build/14860/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.7.0_121.txt JDK v1.7.0_121 Test Results https://builds.apache.org/job/PreCommit-YARN-Build/14860/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/14860/console Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          sunilg Sunil G added a comment -

          Thanks Ying Zhang for the contribution and thanks Daniel Templeton Wangda Tan and Bibin A Chundatt for additional review.

          Show
          sunilg Sunil G added a comment - Thanks Ying Zhang for the contribution and thanks Daniel Templeton Wangda Tan and Bibin A Chundatt for additional review.
          Hide
          sunilg Sunil G added a comment -

          Test case failures are know and not related to the patch for branch-2.8 (jdk 7).

          Committing to branch-2.8

          Show
          sunilg Sunil G added a comment - Test case failures are know and not related to the patch for branch-2.8 (jdk 7). Committing to branch-2.8
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 27s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
          +1 mvninstall 6m 53s branch-2.8 passed
          +1 compile 0m 34s branch-2.8 passed with JDK v1.8.0_121
          +1 compile 0m 32s branch-2.8 passed with JDK v1.7.0_121
          +1 checkstyle 0m 19s branch-2.8 passed
          +1 mvnsite 0m 37s branch-2.8 passed
          +1 mvneclipse 0m 16s branch-2.8 passed
          +1 findbugs 1m 10s branch-2.8 passed
          +1 javadoc 0m 20s branch-2.8 passed with JDK v1.8.0_121
          +1 javadoc 0m 24s branch-2.8 passed with JDK v1.7.0_121
          +1 mvninstall 0m 30s the patch passed
          +1 compile 0m 26s the patch passed with JDK v1.8.0_121
          +1 javac 0m 26s the patch passed
          +1 compile 0m 29s the patch passed with JDK v1.7.0_121
          +1 javac 0m 29s the patch passed
          +1 checkstyle 0m 16s the patch passed
          +1 mvnsite 0m 36s the patch passed
          +1 mvneclipse 0m 14s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 1m 23s the patch passed
          +1 javadoc 0m 21s the patch passed with JDK v1.8.0_121
          +1 javadoc 0m 21s the patch passed with JDK v1.7.0_121
          -1 unit 75m 29s hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.7.0_121.
          +1 asflicense 0m 20s The patch does not generate ASF License warnings.
          168m 19s



          Reason Tests
          JDK v1.8.0_121 Failed junit tests hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerSurgicalPreemption
            hadoop.yarn.server.resourcemanager.TestClientRMTokens
            hadoop.yarn.server.resourcemanager.TestAMAuthorization
          JDK v1.7.0_121 Failed junit tests hadoop.yarn.server.resourcemanager.TestClientRMTokens
            hadoop.yarn.server.resourcemanager.TestAMAuthorization



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:5af2af1
          JIRA Issue YARN-6031
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12851556/YARN-6031-branch-2.8.001.patch
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 034939e402a1 3.13.0-107-generic #154-Ubuntu SMP Tue Dec 20 09:57:27 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision branch-2.8 / 2bbcaa8
          Default Java 1.7.0_121
          Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_121 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_121
          findbugs v3.0.0
          unit https://builds.apache.org/job/PreCommit-YARN-Build/14859/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.7.0_121.txt
          JDK v1.7.0_121 Test Results https://builds.apache.org/job/PreCommit-YARN-Build/14859/testReport/
          modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/14859/console
          Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 27s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 6m 53s branch-2.8 passed +1 compile 0m 34s branch-2.8 passed with JDK v1.8.0_121 +1 compile 0m 32s branch-2.8 passed with JDK v1.7.0_121 +1 checkstyle 0m 19s branch-2.8 passed +1 mvnsite 0m 37s branch-2.8 passed +1 mvneclipse 0m 16s branch-2.8 passed +1 findbugs 1m 10s branch-2.8 passed +1 javadoc 0m 20s branch-2.8 passed with JDK v1.8.0_121 +1 javadoc 0m 24s branch-2.8 passed with JDK v1.7.0_121 +1 mvninstall 0m 30s the patch passed +1 compile 0m 26s the patch passed with JDK v1.8.0_121 +1 javac 0m 26s the patch passed +1 compile 0m 29s the patch passed with JDK v1.7.0_121 +1 javac 0m 29s the patch passed +1 checkstyle 0m 16s the patch passed +1 mvnsite 0m 36s the patch passed +1 mvneclipse 0m 14s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 23s the patch passed +1 javadoc 0m 21s the patch passed with JDK v1.8.0_121 +1 javadoc 0m 21s the patch passed with JDK v1.7.0_121 -1 unit 75m 29s hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.7.0_121. +1 asflicense 0m 20s The patch does not generate ASF License warnings. 168m 19s Reason Tests JDK v1.8.0_121 Failed junit tests hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerSurgicalPreemption   hadoop.yarn.server.resourcemanager.TestClientRMTokens   hadoop.yarn.server.resourcemanager.TestAMAuthorization JDK v1.7.0_121 Failed junit tests hadoop.yarn.server.resourcemanager.TestClientRMTokens   hadoop.yarn.server.resourcemanager.TestAMAuthorization Subsystem Report/Notes Docker Image:yetus/hadoop:5af2af1 JIRA Issue YARN-6031 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12851556/YARN-6031-branch-2.8.001.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 034939e402a1 3.13.0-107-generic #154-Ubuntu SMP Tue Dec 20 09:57:27 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision branch-2.8 / 2bbcaa8 Default Java 1.7.0_121 Multi-JDK versions /usr/lib/jvm/java-8-oracle:1.8.0_121 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_121 findbugs v3.0.0 unit https://builds.apache.org/job/PreCommit-YARN-Build/14859/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdk1.7.0_121.txt JDK v1.7.0_121 Test Results https://builds.apache.org/job/PreCommit-YARN-Build/14859/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/14859/console Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          sunilg Sunil G added a comment -

          Thanks Ying Zhang for the clarification. Makes sense for me.

          New ticket could be raised to track test case improvement. I will wait for jenkins to branch-2.8 patch.

          Show
          sunilg Sunil G added a comment - Thanks Ying Zhang for the clarification. Makes sense for me. New ticket could be raised to track test case improvement. I will wait for jenkins to branch-2.8 patch.
          Hide
          Ying Zhang Ying Zhang added a comment - - edited

          I'm thinking it is a separate question. No matter we backport YARN-4805 or not, the test case itself can be improved to avoid running with FairScheduler

          Show
          Ying Zhang Ying Zhang added a comment - - edited I'm thinking it is a separate question. No matter we backport YARN-4805 or not, the test case itself can be improved to avoid running with FairScheduler
          Hide
          sunilg Sunil G added a comment -

          Or do we need to backport YARN-4805 to branch-2.8 ?

          Show
          sunilg Sunil G added a comment - Or do we need to backport YARN-4805 to branch-2.8 ?
          Hide
          Ying Zhang Ying Zhang added a comment -

          Hi Sunil G, sorry for the late reply (was out for the Spring Festival holiday). Here is the patch for branch-2.8, please have a look.
          I've found a problem with the test case when making the patch for branch-2.8. TestRMRestart runs all test cases for CapacityScheduler and FairScheduler respectively, and this test case can only run successfully for CapacityScheduler since it involves running application with node label specified. On trunk, we don't see this problem because due to YARN-4805, TestRMRestart now only runs for CapacityScheduler. I've modified the test case a little bit to just run when it is CapacityScheduler.

            public void testRMRestartAfterNodeLabelDisabled() throws Exception {
              // Skip this test case if it is not CapacityScheduler since NodeLabel is
              // not fully supported yet for FairScheduler and others.
              if (!getSchedulerType().equals(SchedulerType.CAPACITY)) {
                return;
              }
          ...
          

          We should probably make this change to trunk too. Let me know you want to make the change through this JIRA, or I need to open another JIRA to address it?

          Show
          Ying Zhang Ying Zhang added a comment - Hi Sunil G , sorry for the late reply (was out for the Spring Festival holiday). Here is the patch for branch-2.8, please have a look. I've found a problem with the test case when making the patch for branch-2.8. TestRMRestart runs all test cases for CapacityScheduler and FairScheduler respectively, and this test case can only run successfully for CapacityScheduler since it involves running application with node label specified. On trunk, we don't see this problem because due to YARN-4805 , TestRMRestart now only runs for CapacityScheduler. I've modified the test case a little bit to just run when it is CapacityScheduler. public void testRMRestartAfterNodeLabelDisabled() throws Exception { // Skip this test case if it is not CapacityScheduler since NodeLabel is // not fully supported yet for FairScheduler and others. if (!getSchedulerType().equals(SchedulerType.CAPACITY)) { return ; } ... We should probably make this change to trunk too. Let me know you want to make the change through this JIRA, or I need to open another JIRA to address it?
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #11159 (See https://builds.apache.org/job/Hadoop-trunk-Commit/11159/)
          YARN-6031. Application recovery has failed when node label feature is (sunilg: rev 3fa0d540dfca579f3c2840a959b748a7528b02ed)

          • (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMAppManager.java
          • (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMRestart.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #11159 (See https://builds.apache.org/job/Hadoop-trunk-Commit/11159/ ) YARN-6031 . Application recovery has failed when node label feature is (sunilg: rev 3fa0d540dfca579f3c2840a959b748a7528b02ed) (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMAppManager.java (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMRestart.java
          Hide
          sunilg Sunil G added a comment -

          Committed to trunk/branch-2. However patch is not getting cleanly applied to branch-2.8. Ying Zhang, please help to share branch-2.8 patch.

          Show
          sunilg Sunil G added a comment - Committed to trunk/branch-2. However patch is not getting cleanly applied to branch-2.8. Ying Zhang , please help to share branch-2.8 patch.
          Hide
          hadoopqa Hadoop QA added a comment -
          +1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 17s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
          +1 mvninstall 14m 49s trunk passed
          +1 compile 0m 37s trunk passed
          +1 checkstyle 0m 24s trunk passed
          +1 mvnsite 0m 52s trunk passed
          +1 mvneclipse 0m 20s trunk passed
          +1 findbugs 1m 39s trunk passed
          +1 javadoc 0m 30s trunk passed
          +1 mvninstall 0m 50s the patch passed
          +1 compile 0m 49s the patch passed
          +1 javac 0m 49s the patch passed
          +1 checkstyle 0m 27s the patch passed
          +1 mvnsite 0m 50s the patch passed
          +1 mvneclipse 0m 17s the patch passed
          +1 whitespace 0m 1s The patch has no whitespace issues.
          +1 findbugs 1m 46s the patch passed
          +1 javadoc 0m 27s the patch passed
          +1 unit 41m 59s hadoop-yarn-server-resourcemanager in the patch passed.
          +1 asflicense 0m 21s The patch does not generate ASF License warnings.
          68m 47s



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:a9ad5d6
          JIRA Issue YARN-6031
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12848469/YARN-6031.007.patch
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 15390e066e0a 3.13.0-105-generic #152-Ubuntu SMP Fri Dec 2 15:37:11 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / b01514f
          Default Java 1.8.0_111
          findbugs v3.0.0
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/14720/testReport/
          modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/14720/console
          Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - +1 overall Vote Subsystem Runtime Comment 0 reexec 0m 17s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 14m 49s trunk passed +1 compile 0m 37s trunk passed +1 checkstyle 0m 24s trunk passed +1 mvnsite 0m 52s trunk passed +1 mvneclipse 0m 20s trunk passed +1 findbugs 1m 39s trunk passed +1 javadoc 0m 30s trunk passed +1 mvninstall 0m 50s the patch passed +1 compile 0m 49s the patch passed +1 javac 0m 49s the patch passed +1 checkstyle 0m 27s the patch passed +1 mvnsite 0m 50s the patch passed +1 mvneclipse 0m 17s the patch passed +1 whitespace 0m 1s The patch has no whitespace issues. +1 findbugs 1m 46s the patch passed +1 javadoc 0m 27s the patch passed +1 unit 41m 59s hadoop-yarn-server-resourcemanager in the patch passed. +1 asflicense 0m 21s The patch does not generate ASF License warnings. 68m 47s Subsystem Report/Notes Docker Image:yetus/hadoop:a9ad5d6 JIRA Issue YARN-6031 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12848469/YARN-6031.007.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 15390e066e0a 3.13.0-105-generic #152-Ubuntu SMP Fri Dec 2 15:37:11 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / b01514f Default Java 1.8.0_111 findbugs v3.0.0 Test Results https://builds.apache.org/job/PreCommit-YARN-Build/14720/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/14720/console Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          sunilg Sunil G added a comment -

          +1 for latest patch. Pending jenkins.

          Show
          sunilg Sunil G added a comment - +1 for latest patch. Pending jenkins.
          Hide
          sunilg Sunil G added a comment -

          Yes. It fell off from my radar.

          However patch looks stale. Could you please rebase to trunk.

          Show
          sunilg Sunil G added a comment - Yes. It fell off from my radar. However patch looks stale. Could you please rebase to trunk.
          Hide
          Ying Zhang Ying Zhang added a comment -

          Hi Sunil G, would you please help to push this forward?

          Show
          Ying Zhang Ying Zhang added a comment - Hi Sunil G , would you please help to push this forward?
          Hide
          Ying Zhang Ying Zhang added a comment - - edited

          Failed test case (TestRMRestart.testFinishedAppRemovalAfterRMRestart) is known and tracked by YARN-5548.

          Show
          Ying Zhang Ying Zhang added a comment - - edited Failed test case (TestRMRestart.testFinishedAppRemovalAfterRMRestart) is known and tracked by YARN-5548 .
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 17s Docker mode activated.
          +1 @author 0m 1s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
          +1 mvninstall 13m 4s trunk passed
          +1 compile 0m 33s trunk passed
          +1 checkstyle 0m 22s trunk passed
          +1 mvnsite 0m 36s trunk passed
          +1 mvneclipse 0m 16s trunk passed
          +1 findbugs 1m 0s trunk passed
          +1 javadoc 0m 21s trunk passed
          +1 mvninstall 0m 31s the patch passed
          +1 compile 0m 33s the patch passed
          +1 javac 0m 33s the patch passed
          +1 checkstyle 0m 20s the patch passed
          +1 mvnsite 0m 32s the patch passed
          +1 mvneclipse 0m 14s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 1m 5s the patch passed
          +1 javadoc 0m 19s the patch passed
          -1 unit 39m 41s hadoop-yarn-server-resourcemanager in the patch failed.
          +1 asflicense 0m 16s The patch does not generate ASF License warnings.
          61m 19s



          Reason Tests
          Failed junit tests hadoop.yarn.server.resourcemanager.TestRMRestart



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:a9ad5d6
          JIRA Issue YARN-6031
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12846782/YARN-6031.006.patch
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 808f0406b80d 3.13.0-95-generic #142-Ubuntu SMP Fri Aug 12 17:00:09 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / be529da
          Default Java 1.8.0_111
          findbugs v3.0.0
          unit https://builds.apache.org/job/PreCommit-YARN-Build/14639/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/14639/testReport/
          modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/14639/console
          Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 17s Docker mode activated. +1 @author 0m 1s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 13m 4s trunk passed +1 compile 0m 33s trunk passed +1 checkstyle 0m 22s trunk passed +1 mvnsite 0m 36s trunk passed +1 mvneclipse 0m 16s trunk passed +1 findbugs 1m 0s trunk passed +1 javadoc 0m 21s trunk passed +1 mvninstall 0m 31s the patch passed +1 compile 0m 33s the patch passed +1 javac 0m 33s the patch passed +1 checkstyle 0m 20s the patch passed +1 mvnsite 0m 32s the patch passed +1 mvneclipse 0m 14s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 5s the patch passed +1 javadoc 0m 19s the patch passed -1 unit 39m 41s hadoop-yarn-server-resourcemanager in the patch failed. +1 asflicense 0m 16s The patch does not generate ASF License warnings. 61m 19s Reason Tests Failed junit tests hadoop.yarn.server.resourcemanager.TestRMRestart Subsystem Report/Notes Docker Image:yetus/hadoop:a9ad5d6 JIRA Issue YARN-6031 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12846782/YARN-6031.006.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 808f0406b80d 3.13.0-95-generic #142-Ubuntu SMP Fri Aug 12 17:00:09 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / be529da Default Java 1.8.0_111 findbugs v3.0.0 unit https://builds.apache.org/job/PreCommit-YARN-Build/14639/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/14639/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/14639/console Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          sunilg Sunil G added a comment -

          Patch generally looks fie for me. Will wait for jenkins to kick off.
          Also will wait for a day if any others have some comments as well.

          Show
          sunilg Sunil G added a comment - Patch generally looks fie for me. Will wait for jenkins to kick off. Also will wait for a day if any others have some comments as well.
          Hide
          Ying Zhang Ying Zhang added a comment -

          Thanks Sunil G. Done. I was thinking that LOG.debug can do this check on its own, but we can always do it beforehand and follow the current code style in RM

          Show
          Ying Zhang Ying Zhang added a comment - Thanks Sunil G . Done. I was thinking that LOG.debug can do this check on its own, but we can always do it beforehand and follow the current code style in RM
          Hide
          hadoopqa Hadoop QA added a comment -
          +1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 23s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
          +1 mvninstall 14m 24s trunk passed
          +1 compile 0m 35s trunk passed
          +1 checkstyle 0m 23s trunk passed
          +1 mvnsite 0m 37s trunk passed
          +1 mvneclipse 0m 17s trunk passed
          +1 findbugs 1m 5s trunk passed
          +1 javadoc 0m 23s trunk passed
          +1 mvninstall 0m 33s the patch passed
          +1 compile 0m 34s the patch passed
          +1 javac 0m 34s the patch passed
          +1 checkstyle 0m 21s the patch passed
          +1 mvnsite 0m 35s the patch passed
          +1 mvneclipse 0m 14s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 1m 10s the patch passed
          +1 javadoc 0m 19s the patch passed
          +1 unit 41m 14s hadoop-yarn-server-resourcemanager in the patch passed.
          +1 asflicense 0m 18s The patch does not generate ASF License warnings.
          64m 50s



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:a9ad5d6
          JIRA Issue YARN-6031
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12846768/YARN-6031.005.patch
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux ee8bf4a7ffe6 3.13.0-105-generic #152-Ubuntu SMP Fri Dec 2 15:37:11 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 467f5f1
          Default Java 1.8.0_111
          findbugs v3.0.0
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/14638/testReport/
          modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/14638/console
          Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - +1 overall Vote Subsystem Runtime Comment 0 reexec 0m 23s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 14m 24s trunk passed +1 compile 0m 35s trunk passed +1 checkstyle 0m 23s trunk passed +1 mvnsite 0m 37s trunk passed +1 mvneclipse 0m 17s trunk passed +1 findbugs 1m 5s trunk passed +1 javadoc 0m 23s trunk passed +1 mvninstall 0m 33s the patch passed +1 compile 0m 34s the patch passed +1 javac 0m 34s the patch passed +1 checkstyle 0m 21s the patch passed +1 mvnsite 0m 35s the patch passed +1 mvneclipse 0m 14s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 10s the patch passed +1 javadoc 0m 19s the patch passed +1 unit 41m 14s hadoop-yarn-server-resourcemanager in the patch passed. +1 asflicense 0m 18s The patch does not generate ASF License warnings. 64m 50s Subsystem Report/Notes Docker Image:yetus/hadoop:a9ad5d6 JIRA Issue YARN-6031 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12846768/YARN-6031.005.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux ee8bf4a7ffe6 3.13.0-105-generic #152-Ubuntu SMP Fri Dec 2 15:37:11 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 467f5f1 Default Java 1.8.0_111 findbugs v3.0.0 Test Results https://builds.apache.org/job/PreCommit-YARN-Build/14638/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/14638/console Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          sunilg Sunil G added a comment - - edited

          Quick correction: Could u also pls add LOG.isDebugEnabled() before logging.

          Show
          sunilg Sunil G added a comment - - edited Quick correction: Could u also pls add LOG.isDebugEnabled() before logging.
          Hide
          Ying Zhang Ying Zhang added a comment - - edited

          Thanks Sunil G. Done and uploaded a new patch.

          Show
          Ying Zhang Ying Zhang added a comment - - edited Thanks Sunil G . Done and uploaded a new patch.
          Hide
          sunilg Sunil G added a comment -

          Hi Ying Zhang
          When InvalidResourceRequestException is thrown from validateAndCreateResourceRequest, its sure that amReq is not updated. Hence amReq will be null. I think you can write a debug log and come out.

          Show
          sunilg Sunil G added a comment - Hi Ying Zhang When InvalidResourceRequestException is thrown from validateAndCreateResourceRequest , its sure that amReq is not updated. Hence amReq will be null. I think you can write a debug log and come out.
          Hide
          Ying Zhang Ying Zhang added a comment -

          For findbugs error, it might be good to keep the null check in case later code change breaks the assumption.
          Test failure is known and tracked by YARN-5548.

          Show
          Ying Zhang Ying Zhang added a comment - For findbugs error, it might be good to keep the null check in case later code change breaks the assumption. Test failure is known and tracked by YARN-5548 .
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 17s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
          +1 mvninstall 13m 8s trunk passed
          +1 compile 0m 34s trunk passed
          +1 checkstyle 0m 22s trunk passed
          +1 mvnsite 0m 36s trunk passed
          +1 mvneclipse 0m 16s trunk passed
          +1 findbugs 0m 59s trunk passed
          +1 javadoc 0m 20s trunk passed
          +1 mvninstall 0m 31s the patch passed
          +1 compile 0m 30s the patch passed
          +1 javac 0m 30s the patch passed
          +1 checkstyle 0m 20s the patch passed
          +1 mvnsite 0m 32s the patch passed
          +1 mvneclipse 0m 13s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          -1 findbugs 1m 5s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0)
          +1 javadoc 0m 18s the patch passed
          -1 unit 38m 55s hadoop-yarn-server-resourcemanager in the patch failed.
          +1 asflicense 0m 16s The patch does not generate ASF License warnings.
          60m 30s



          Reason Tests
          FindBugs module:hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
            Redundant nullcheck of amReq which is known to be null in org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(ApplicationSubmissionContext, long, String, boolean, long) Redundant null check at RMAppManager.java:is known to be null in org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(ApplicationSubmissionContext, long, String, boolean, long) Redundant null check at RMAppManager.java:[line 404]
          Failed junit tests hadoop.yarn.server.resourcemanager.TestRMRestart



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:a9ad5d6
          JIRA Issue YARN-6031
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12846735/YARN-6031.004.patch
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 4060267ed344 3.13.0-95-generic #142-Ubuntu SMP Fri Aug 12 17:00:09 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 4db119b
          Default Java 1.8.0_111
          findbugs v3.0.0
          findbugs https://builds.apache.org/job/PreCommit-YARN-Build/14635/artifact/patchprocess/new-findbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.html
          unit https://builds.apache.org/job/PreCommit-YARN-Build/14635/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/14635/testReport/
          modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/14635/console
          Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 17s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 13m 8s trunk passed +1 compile 0m 34s trunk passed +1 checkstyle 0m 22s trunk passed +1 mvnsite 0m 36s trunk passed +1 mvneclipse 0m 16s trunk passed +1 findbugs 0m 59s trunk passed +1 javadoc 0m 20s trunk passed +1 mvninstall 0m 31s the patch passed +1 compile 0m 30s the patch passed +1 javac 0m 30s the patch passed +1 checkstyle 0m 20s the patch passed +1 mvnsite 0m 32s the patch passed +1 mvneclipse 0m 13s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. -1 findbugs 1m 5s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0) +1 javadoc 0m 18s the patch passed -1 unit 38m 55s hadoop-yarn-server-resourcemanager in the patch failed. +1 asflicense 0m 16s The patch does not generate ASF License warnings. 60m 30s Reason Tests FindBugs module:hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager   Redundant nullcheck of amReq which is known to be null in org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(ApplicationSubmissionContext, long, String, boolean, long) Redundant null check at RMAppManager.java:is known to be null in org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(ApplicationSubmissionContext, long, String, boolean, long) Redundant null check at RMAppManager.java: [line 404] Failed junit tests hadoop.yarn.server.resourcemanager.TestRMRestart Subsystem Report/Notes Docker Image:yetus/hadoop:a9ad5d6 JIRA Issue YARN-6031 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12846735/YARN-6031.004.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 4060267ed344 3.13.0-95-generic #142-Ubuntu SMP Fri Aug 12 17:00:09 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 4db119b Default Java 1.8.0_111 findbugs v3.0.0 findbugs https://builds.apache.org/job/PreCommit-YARN-Build/14635/artifact/patchprocess/new-findbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.html unit https://builds.apache.org/job/PreCommit-YARN-Build/14635/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/14635/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/14635/console Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          Ying Zhang Ying Zhang added a comment -

          Thanks very much Sunil G for the quick review. Comments addressed in the new patch.

          Show
          Ying Zhang Ying Zhang added a comment - Thanks very much Sunil G for the quick review. Comments addressed in the new patch.
          Hide
          sunilg Sunil G added a comment -

          Patch looks generally fine.

          Few minor nits:

          1.

          String message = "Failed to recover application " + appId
            + ". Node label not enabled but request contains label expression "
            + labelExp + ".";
          

          I think it could be written as "Failed to recover application <appId>. NodeLabel is not enabled in cluster, but AM resource request contains a label expression."
          2. amReqInAppContext -> amReqFromAppContext
          3. In line with comment 1, please update the exception message in createAndPopulateNewRMApp.

          Show
          sunilg Sunil G added a comment - Patch looks generally fine. Few minor nits: 1. String message = "Failed to recover application " + appId + ". Node label not enabled but request contains label expression " + labelExp + "."; I think it could be written as "Failed to recover application <appId>. NodeLabel is not enabled in cluster, but AM resource request contains a label expression." 2. amReqInAppContext -> amReqFromAppContext 3. In line with comment 1, please update the exception message in createAndPopulateNewRMApp.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 14s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
          +1 mvninstall 12m 58s trunk passed
          +1 compile 0m 32s trunk passed
          +1 checkstyle 0m 21s trunk passed
          +1 mvnsite 0m 36s trunk passed
          +1 mvneclipse 0m 16s trunk passed
          +1 findbugs 1m 3s trunk passed
          +1 javadoc 0m 20s trunk passed
          +1 mvninstall 0m 32s the patch passed
          +1 compile 0m 32s the patch passed
          +1 javac 0m 32s the patch passed
          -0 checkstyle 0m 20s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 3 new + 115 unchanged - 0 fixed = 118 total (was 115)
          +1 mvnsite 0m 34s the patch passed
          +1 mvneclipse 0m 14s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 1m 8s the patch passed
          +1 javadoc 0m 19s the patch passed
          -1 unit 39m 13s hadoop-yarn-server-resourcemanager in the patch failed.
          +1 asflicense 0m 17s The patch does not generate ASF License warnings.
          60m 46s



          Reason Tests
          Failed junit tests hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesDelegationTokens



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:a9ad5d6
          JIRA Issue YARN-6031
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12846536/YARN-6031.003.patch
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux c0da29707a81 3.13.0-103-generic #150-Ubuntu SMP Thu Nov 24 10:34:17 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 9594c35
          Default Java 1.8.0_111
          findbugs v3.0.0
          checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/14621/artifact/patchprocess/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
          unit https://builds.apache.org/job/PreCommit-YARN-Build/14621/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/14621/testReport/
          modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/14621/console
          Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 14s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 12m 58s trunk passed +1 compile 0m 32s trunk passed +1 checkstyle 0m 21s trunk passed +1 mvnsite 0m 36s trunk passed +1 mvneclipse 0m 16s trunk passed +1 findbugs 1m 3s trunk passed +1 javadoc 0m 20s trunk passed +1 mvninstall 0m 32s the patch passed +1 compile 0m 32s the patch passed +1 javac 0m 32s the patch passed -0 checkstyle 0m 20s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 3 new + 115 unchanged - 0 fixed = 118 total (was 115) +1 mvnsite 0m 34s the patch passed +1 mvneclipse 0m 14s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 8s the patch passed +1 javadoc 0m 19s the patch passed -1 unit 39m 13s hadoop-yarn-server-resourcemanager in the patch failed. +1 asflicense 0m 17s The patch does not generate ASF License warnings. 60m 46s Reason Tests Failed junit tests hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesDelegationTokens Subsystem Report/Notes Docker Image:yetus/hadoop:a9ad5d6 JIRA Issue YARN-6031 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12846536/YARN-6031.003.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux c0da29707a81 3.13.0-103-generic #150-Ubuntu SMP Thu Nov 24 10:34:17 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 9594c35 Default Java 1.8.0_111 findbugs v3.0.0 checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/14621/artifact/patchprocess/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt unit https://builds.apache.org/job/PreCommit-YARN-Build/14621/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/14621/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/14621/console Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          Ying Zhang Ying Zhang added a comment -

          Thanks Sunil G.
          Uploaded a new patch with test case added.

          Show
          Ying Zhang Ying Zhang added a comment - Thanks Sunil G . Uploaded a new patch with test case added.
          Hide
          sunilg Sunil G added a comment -

          Since RMAppImpl handles APP_REJECTED event and we can move app from NEW to FAILED (via FINAL_SAVING), I think its fine. Currently you are validating resource request and doing an assert. I suggest you can raise YarnException with more meaningful message.
          For test cases, i suggest you can look TestWorkPreservingRMRestartForNodeLabel.

          Show
          sunilg Sunil G added a comment - Since RMAppImpl handles APP_REJECTED event and we can move app from NEW to FAILED (via FINAL_SAVING), I think its fine. Currently you are validating resource request and doing an assert. I suggest you can raise YarnException with more meaningful message. For test cases, i suggest you can look TestWorkPreservingRMRestartForNodeLabel .
          Hide
          Ying Zhang Ying Zhang added a comment - - edited

          I've uploaded a new patch YARN-6031.002.patch with the suggested approach. Please kindly review and see if this is what we wanted. I'm working on adding test case now. Thank you.
          Sunil G, for your comment below:

          I think if we can create RMAppImpl object and push transition into FAILED (with clear diags), then user/admin can easily know what happened (can then take action to remove from statestore or not). But not all apps may fail. For eg
          An app which was running and there are no more outstanding requests (AM Container will be going to default label). Old containers may belong some specific label.
          An app which was running and there are more outstanding requests to some labels (AM Container will be going to default label)

          I'm not sure how to achieve this. With current patch, all applications with specified node label expression will be rejected and fail during recovery (including those which had already successfully finished before we disabling Node label and restarting RM.) Please share your thoughts if you think this should be improved.

          Show
          Ying Zhang Ying Zhang added a comment - - edited I've uploaded a new patch YARN-6031 .002.patch with the suggested approach. Please kindly review and see if this is what we wanted. I'm working on adding test case now. Thank you. Sunil G , for your comment below: I think if we can create RMAppImpl object and push transition into FAILED (with clear diags), then user/admin can easily know what happened (can then take action to remove from statestore or not). But not all apps may fail. For eg An app which was running and there are no more outstanding requests (AM Container will be going to default label). Old containers may belong some specific label. An app which was running and there are more outstanding requests to some labels (AM Container will be going to default label) I'm not sure how to achieve this. With current patch, all applications with specified node label expression will be rejected and fail during recovery (including those which had already successfully finished before we disabling Node label and restarting RM.) Please share your thoughts if you think this should be improved.
          Hide
          Ying Zhang Ying Zhang added a comment -

          Oh I see, thanks. Will update the patch soon.

          Show
          Ying Zhang Ying Zhang added a comment - Oh I see, thanks. Will update the patch soon.
          Hide
          sunilg Sunil G added a comment -

          If we are not creating RMAppImpl object and starting the app transitions, we will not have the app in RM to track as failed. RMAppManager.createAndPopulateNewRMApp ensures that the app object is added to rmContext.getRMApps().

          Show
          sunilg Sunil G added a comment - If we are not creating RMAppImpl object and starting the app transitions, we will not have the app in RM to track as failed. RMAppManager.createAndPopulateNewRMApp ensures that the app object is added to rmContext.getRMApps() .
          Hide
          Ying Zhang Ying Zhang added a comment - - edited

          Hi Wangda Tan, I have a quick question here, if we just want to send APP_REJECTED message, why do we need to create RMAppImpl anyway? It looks like we just need to catch the InvalidResourceRequest exception in recoverApplicaiton and then send reject message. Or is it for printing proper error message? This can also be done in recoverApplication().

          Show
          Ying Zhang Ying Zhang added a comment - - edited Hi Wangda Tan , I have a quick question here, if we just want to send APP_REJECTED message, why do we need to create RMAppImpl anyway? It looks like we just need to catch the InvalidResourceRequest exception in recoverApplicaiton and then send reject message. Or is it for printing proper error message? This can also be done in recoverApplication().
          Hide
          leftnoteasy Wangda Tan added a comment -

          Thanks Ying Zhang for updating patch, and thanks comments from Daniel Templeton / Sunil G.

          I agree with approach suggested by Sunil G. We can create RMAppImpl when InvalidResourceRequest thrown (For example, create a null ResourceRequest), and print proper error message

          After createAndPopulateNewRMApp, we can check if (amRequest == null) and (not unmanaged-am), if yes, send APP_REJECTED message, just like handling errors when credential parse error found, see submitApplication.

          Thoughts?

          Show
          leftnoteasy Wangda Tan added a comment - Thanks Ying Zhang for updating patch, and thanks comments from Daniel Templeton / Sunil G . I agree with approach suggested by Sunil G . We can create RMAppImpl when InvalidResourceRequest thrown (For example, create a null ResourceRequest), and print proper error message After createAndPopulateNewRMApp , we can check if (amRequest == null) and (not unmanaged-am), if yes, send APP_REJECTED message, just like handling errors when credential parse error found, see submitApplication . Thoughts?
          Hide
          Ying Zhang Ying Zhang added a comment -

          It makes sense to me.

          If we skip and continue, then we will be loosing the record of that app from RM end (It will be there in Store still). We may be pushing detail information to logs, but user/admin need not have to take a look in logs as recovery went smooth (discarding few apps).

          Yes, that's true, those app doesn't show up in the application list in RM UI after recovery. Just information in log saying what's going on.

          Show
          Ying Zhang Ying Zhang added a comment - It makes sense to me. If we skip and continue, then we will be loosing the record of that app from RM end (It will be there in Store still). We may be pushing detail information to logs, but user/admin need not have to take a look in logs as recovery went smooth (discarding few apps). Yes, that's true, those app doesn't show up in the application list in RM UI after recovery. Just information in log saying what's going on.
          Hide
          sunilg Sunil G added a comment -

          I gave a second thought.

          If we skip and continue, then we will be loosing the record of that app from RM end (It will be there in Store still). We may be pushing detail information to logs, but user/admin need not have to take a look in logs as recovery went smooth (discarding few apps). If those apps are considered as failed, then user/admin can take some actions as mentioned by Daniel.

          I think if we can create RMAppImpl object and push transition into FAILED (with clear diags), then user/admin can easily know what happened (can then take action to remove from statestore or not). But not all apps may fail. For eg

          • An app which was running and there are no more outstanding requests (AM Container will be going to default label). Old containers may belong some specific label.
          • An app which was running and there are more outstanding requests to some labels (AM Container will be going to default label)

          In first case, app will still continue running. In second case, its upto the app to decide when they get InvalidResourceRequestException in allocate call.

          I feel we can consider accepting the app and then mark as failed with a validation once RMAppImpl transitions are started (need not to be done from scheduler, we can do in RECOVER event if possible). Thoughts?

          Show
          sunilg Sunil G added a comment - I gave a second thought. If we skip and continue, then we will be loosing the record of that app from RM end (It will be there in Store still). We may be pushing detail information to logs, but user/admin need not have to take a look in logs as recovery went smooth (discarding few apps). If those apps are considered as failed, then user/admin can take some actions as mentioned by Daniel. I think if we can create RMAppImpl object and push transition into FAILED (with clear diags), then user/admin can easily know what happened (can then take action to remove from statestore or not). But not all apps may fail. For eg An app which was running and there are no more outstanding requests (AM Container will be going to default label). Old containers may belong some specific label. An app which was running and there are more outstanding requests to some labels (AM Container will be going to default label) In first case, app will still continue running. In second case, its upto the app to decide when they get InvalidResourceRequestException in allocate call. I feel we can consider accepting the app and then mark as failed with a validation once RMAppImpl transitions are started (need not to be done from scheduler, we can do in RECOVER event if possible). Thoughts?
          Hide
          bibinchundatt Bibin A Chundatt added a comment -

          Daniel Templeton

          so that when using -remove-application-from-state-store you know what you're purging.

          Another concern about removal of application from store is already running AM will be in dormant state. But if we allow application to recover and make sure the application is killed from scheduler then application will be killed with a reason.

          Currently in MR side if AM is running and request for non available label MR AM kills application the same could be implemented in all application.

          Show
          bibinchundatt Bibin A Chundatt added a comment - Daniel Templeton so that when using -remove-application-from-state-store you know what you're purging. Another concern about removal of application from store is already running AM will be in dormant state. But if we allow application to recover and make sure the application is killed from scheduler then application will be killed with a reason. Currently in MR side if AM is running and request for non available label MR AM kills application the same could be implemented in all application.
          Hide
          Ying Zhang Ying Zhang added a comment - - edited

          Do you think we can make the log message a bit more explicit, i.e. say that the failure was because node labels have been disabled and point out the property that the admin should use to disable/enable node labels?

          Hi Daniel Templeton, the following error message will be printed in RM log:

          2016-12-28 01:00:22,694 WARN  resourcemanager.RMAppManager (RMAppManager.java:validateAndCreateResourceRequest(400)) - RM app submission failed in validating AM resource request for application application_xxxxxx
          org.apache.hadoop.yarn.exceptions.InvalidLabelResourceRequestException: Invalid resource request, node label not enabled but request contains label expression
                  at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:225)
                  at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:248)
                  at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:396)
                  at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:341)
                  at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:321)
                  at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:439)
                  at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1165)
                  at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574)
          ... ...
          2016-12-28 01:00:22,694 ERROR resourcemanager.RMAppManager (RMAppManager.java:recover(455)) - Failed to recover application application_xxxxxx
          

          The first error message is printed by the check which we fail at the first place, the second error message is printed by the code in the patch. I'm thinking this would be enough hint for the root cause.

          Show
          Ying Zhang Ying Zhang added a comment - - edited Do you think we can make the log message a bit more explicit, i.e. say that the failure was because node labels have been disabled and point out the property that the admin should use to disable/enable node labels? Hi Daniel Templeton , the following error message will be printed in RM log: 2016-12-28 01:00:22,694 WARN resourcemanager.RMAppManager (RMAppManager.java:validateAndCreateResourceRequest(400)) - RM app submission failed in validating AM resource request for application application_xxxxxx org.apache.hadoop.yarn.exceptions.InvalidLabelResourceRequestException: Invalid resource request, node label not enabled but request contains label expression at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:225) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:248) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:396) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:341) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:321) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:439) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1165) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574) ... ... 2016-12-28 01:00:22,694 ERROR resourcemanager.RMAppManager (RMAppManager.java:recover(455)) - Failed to recover application application_xxxxxx The first error message is printed by the check which we fail at the first place, the second error message is printed by the code in the patch. I'm thinking this would be enough hint for the root cause.
          Hide
          Ying Zhang Ying Zhang added a comment -

          So what's the next move? I'm a little confused. Are we going to address this issue by the general approach proposed by Daniel Templeton and Sunil G?

          Show
          Ying Zhang Ying Zhang added a comment - So what's the next move? I'm a little confused. Are we going to address this issue by the general approach proposed by Daniel Templeton and Sunil G ?
          Hide
          Ying Zhang Ying Zhang added a comment -

          We could ignore/reset labels to default in resourcerequest when nodelabels are disabled.

          Agree with Daniel, reset might not be a good idea. Admin should be aware of the failure and take proper action.

          IIUC ignore validation on recovery also should work.

          I was thinking the same at the first place (see my comment at YARN-4465). Then I agree what Sunil G said for the same reason as above. We should not hide the failure/wrong configuration.

          IMHO should be acceptable since any application submitted with labels when feature is disabled gets rejected.

          Yes, you're right. I agree with the current way, just want to clarify so that everyone is on the same page

          Show
          Ying Zhang Ying Zhang added a comment - We could ignore/reset labels to default in resourcerequest when nodelabels are disabled. Agree with Daniel, reset might not be a good idea. Admin should be aware of the failure and take proper action. IIUC ignore validation on recovery also should work. I was thinking the same at the first place (see my comment at YARN-4465 ). Then I agree what Sunil G said for the same reason as above. We should not hide the failure/wrong configuration. IMHO should be acceptable since any application submitted with labels when feature is disabled gets rejected. Yes, you're right. I agree with the current way, just want to clarify so that everyone is on the same page
          Hide
          templedf Daniel Templeton added a comment - - edited

          IIUC ignore validation on recovery also should work.

          Then you end up with unschedulable apps in the system, which can't be good.

          We could ignore/reset labels to default in resourcerequest when nodelabels are disabled.

          The issue there is that the labels may have some important meaning to the job, so defaulting the labels may be bad. As applications can have side-effects, I think it's better to have the failure up front than let the application potentially fail somewhere down the line. The admin is then immediately made aware that he screwed up by disabling a feature that was still in use.

          Show
          templedf Daniel Templeton added a comment - - edited IIUC ignore validation on recovery also should work. Then you end up with unschedulable apps in the system, which can't be good. We could ignore/reset labels to default in resourcerequest when nodelabels are disabled. The issue there is that the labels may have some important meaning to the job, so defaulting the labels may be bad. As applications can have side-effects, I think it's better to have the failure up front than let the application potentially fail somewhere down the line. The admin is then immediately made aware that he screwed up by disabling a feature that was still in use.
          Hide
          bibinchundatt Bibin A Chundatt added a comment - - edited

          As Sunil G mentioned earlier ignoring application could create stale application in state store.

          Ying Zhang IIUC ignore validation on recovery also should work.

            private static void validateResourceRequest(ResourceRequest resReq,
                Resource maximumResource, QueueInfo queueInfo, RMContext rmContext)
                throws InvalidResourceRequestException {
              Configuration conf = rmContext.getYarnConfiguration();
              // If Node label is not enabled throw exception
              if (null != conf && !YarnConfiguration.areNodeLabelsEnabled(conf)) {
                String labelExp = resReq.getNodeLabelExpression();
                if (!(RMNodeLabelsManager.NO_LABEL.equals(labelExp)
                    || null == labelExp)) {
                  throw new InvalidLabelResourceRequestException(
                      "Invalid resource request, node label not enabled "
                          + "but request contains label expression");
                }
              }
          

          Thoughts??

          The current fact is (with or without this fix): application submitted with node label expression explicitly specified will fail during recovery

          IMHO should be acceptable since any application submitted with labels when feature is disabled gets rejected.

          Solution 2:
          We could ignore/reset labels to default in resourcerequest when nodelabels are disabled. Havn't looked at impact of the same.
          An elaborate testing would be needed to see how metrics are impacted. Disadvantage is client will never get to know that reset happened in RM side

          YARN-4562 will try to handle ignore loading label configuration when disabled.

          Daniel Templeton i do agree that admin would require some way to get application info when recovery fails so that bulk update in state store is possible.

          Show
          bibinchundatt Bibin A Chundatt added a comment - - edited As Sunil G mentioned earlier ignoring application could create stale application in state store. Ying Zhang IIUC ignore validation on recovery also should work. private static void validateResourceRequest(ResourceRequest resReq, Resource maximumResource, QueueInfo queueInfo, RMContext rmContext) throws InvalidResourceRequestException { Configuration conf = rmContext.getYarnConfiguration(); // If Node label is not enabled throw exception if ( null != conf && !YarnConfiguration.areNodeLabelsEnabled(conf)) { String labelExp = resReq.getNodeLabelExpression(); if (!(RMNodeLabelsManager.NO_LABEL.equals(labelExp) || null == labelExp)) { throw new InvalidLabelResourceRequestException( "Invalid resource request, node label not enabled " + "but request contains label expression" ); } } Thoughts?? The current fact is (with or without this fix): application submitted with node label expression explicitly specified will fail during recovery IMHO should be acceptable since any application submitted with labels when feature is disabled gets rejected. Solution 2: We could ignore/reset labels to default in resourcerequest when nodelabels are disabled. Havn't looked at impact of the same. An elaborate testing would be needed to see how metrics are impacted. Disadvantage is client will never get to know that reset happened in RM side YARN-4562 will try to handle ignore loading label configuration when disabled. Daniel Templeton i do agree that admin would require some way to get application info when recovery fails so that bulk update in state store is possible.
          Hide
          templedf Daniel Templeton added a comment -

          I agree that -force-recovery could cause a significant information loss, but it's something that the admin has to do explicitly, and it's only application information, so it's not the end of the world. With a -dump-application-information option, the admin has the choice to either 1) look at each app that fails the recovery and decide whether to purge it or do something else (like turn node labels back on), or 2) do a bulk purge with -force-recovery.

          It might also be good to have another option, something like -dry-run-recovery, that would tell the admin the IDs of all the applications that will fail during recovery so that she doesn't have to keep doing them one at a time. In fact, I could even see making that the default behavior before failing the resource manager.

          In any case, I don't think the approach proposed in this JIRA, to just ignore the failed app, is going to work out.

          Show
          templedf Daniel Templeton added a comment - I agree that -force-recovery could cause a significant information loss, but it's something that the admin has to do explicitly, and it's only application information, so it's not the end of the world. With a -dump-application-information option, the admin has the choice to either 1) look at each app that fails the recovery and decide whether to purge it or do something else (like turn node labels back on), or 2) do a bulk purge with -force-recovery . It might also be good to have another option, something like -dry-run-recovery , that would tell the admin the IDs of all the applications that will fail during recovery so that she doesn't have to keep doing them one at a time. In fact, I could even see making that the default behavior before failing the resource manager. In any case, I don't think the approach proposed in this JIRA, to just ignore the failed app, is going to work out.
          Hide
          sunilg Sunil G added a comment -

          Yes. Makes sense. This is more less a work for admin then. I am not so sure whether RM can take a call and internally remove. But it may be costly as we are deleting a user record. But if necessary informations are pushed to logs, then we may be good to remove data internally itself. Thoughts'?

          Show
          sunilg Sunil G added a comment - Yes. Makes sense. This is more less a work for admin then. I am not so sure whether RM can take a call and internally remove. But it may be costly as we are deleting a user record. But if necessary informations are pushed to logs, then we may be good to remove data internally itself. Thoughts'?
          Hide
          templedf Daniel Templeton added a comment -

          max_applications may hit and valid apps may get emitted at some point of time.

          Exactly my concern. So far, the answer has been that the recovery should just fail, and the admin should clear the app. See your comments in YARN-4401.

          Personally, I think that's a bad experience for the admin, but resolving it will require a bit of infrastructure work to make something useful happen. I think a -force-recovery option that purges the bad apps after dumping full info to the logs would be a good start. It would also help to have an option to dump the info about an app from the state store without having the RM running, so that when using -remove-application-from-state-store you know what you're purging.

          Show
          templedf Daniel Templeton added a comment - max_applications may hit and valid apps may get emitted at some point of time. Exactly my concern. So far, the answer has been that the recovery should just fail, and the admin should clear the app. See your comments in YARN-4401 . Personally, I think that's a bad experience for the admin, but resolving it will require a bit of infrastructure work to make something useful happen. I think a -force-recovery option that purges the bad apps after dumping full info to the logs would be a good start. It would also help to have an option to dump the info about an app from the state store without having the RM running, so that when using -remove-application-from-state-store you know what you're purging.
          Hide
          sunilg Sunil G added a comment -

          Yes Daniel Templeton. You are correct. We will end up having many flaky apps in state store. Offline had a chat with Bibin A Chundatt also, and there may be a potential problem with that too. max_applications may hit and valid apps may get emitted at some point of time.
          We can forcefully remove state store, however we may loose information abt it as its a failed app. With clear logging, we can evict such apps from state store. I am not so sure whether we need to delete immediately or can put to an async monitor to delete later. Thoughts?

          Show
          sunilg Sunil G added a comment - Yes Daniel Templeton . You are correct. We will end up having many flaky apps in state store. Offline had a chat with Bibin A Chundatt also, and there may be a potential problem with that too. max_applications may hit and valid apps may get emitted at some point of time. We can forcefully remove state store, however we may loose information abt it as its a failed app. With clear logging, we can evict such apps from state store. I am not so sure whether we need to delete immediately or can put to an async monitor to delete later. Thoughts?
          Hide
          templedf Daniel Templeton added a comment -

          Yep, tests are needed. Love the long explanatory comment. Do you think we can make the log message a bit more explicit, i.e. say that the failure was because node labels have been disabled and point out the property that the admin should use to disable/enable node labels?

          Also, what happens to the app in the state store? If we fail to recover it and just ignore it, it will sit there forever, I suspect. It's probably a bad thing if the RM and state store don't agree on what apps are active.

          Show
          templedf Daniel Templeton added a comment - Yep, tests are needed. Love the long explanatory comment. Do you think we can make the log message a bit more explicit, i.e. say that the failure was because node labels have been disabled and point out the property that the admin should use to disable/enable node labels? Also, what happens to the app in the state store? If we fail to recover it and just ignore it, it will sit there forever, I suspect. It's probably a bad thing if the RM and state store don't agree on what apps are active.
          Hide
          sunilg Sunil G added a comment - - edited

          Thanks Ying Zhang,

          Overall approach makes sense to me. You are basically trying to act on checking for label disabled check etc only when an exception is thrown. Its more or less same as the earlier suggestion. So its ok.

          However few more points. With this patch, now the app recovery will continue.

          • If App's AM resource request was not having any specific node label, but other containers may have. Since we send InvalidResourceRequest on those case, it might be fine for now.
          • In above case, what will happen to ongoing containers (w.r.t RM's data structure)?. If its a running app w/o any outstanding requests, we might need to consider running containers as for NO_LABEL. I am not sure whether this will happen as of today. I ll also check.

          These could be out of scope for this ticket. However lets check opinion from other folks as well.

          Note: Please add more test cases to cover this patch.

          Show
          sunilg Sunil G added a comment - - edited Thanks Ying Zhang , Overall approach makes sense to me. You are basically trying to act on checking for label disabled check etc only when an exception is thrown. Its more or less same as the earlier suggestion. So its ok. However few more points. With this patch, now the app recovery will continue. If App's AM resource request was not having any specific node label, but other containers may have. Since we send InvalidResourceRequest on those case, it might be fine for now. In above case, what will happen to ongoing containers (w.r.t RM's data structure)?. If its a running app w/o any outstanding requests, we might need to consider running containers as for NO_LABEL. I am not sure whether this will happen as of today. I ll also check. These could be out of scope for this ticket. However lets check opinion from other folks as well. Note : Please add more test cases to cover this patch.
          Hide
          Ying Zhang Ying Zhang added a comment - - edited

          Uploaded a patch, which is based on Wangda Tan's comment on YARN-4465: swallow the InvalidResourceRequest exception when recovering, only fail the recovery for this application and print a error message, then let the rest of the recovery continue.

          Sunil G, your suggestion also makes sense to me. Actually, the code change using your approach would be made at the same place as in this patch with small modification: in function recover(), inside the for loop, if the conditions are met, skip calling "recoverApplication" and log a message like "skip recover application ..." instead. Difference is that using this approach we'll always check for these conditions even though it might not be a normal case, while using the approach in the patch, we just need to react when the exception happens. I'm ok with each approach since the overhead is not that big.

          Let's see what others think Wangda Tan, Bibin A Chundatt

          Just want to clarify. The current fact is (with or without this fix): application submitted with node label expression explicitly specified will fail during recovery, while application submitted without node label expression specified will succeed, no matter whether or not there is default node label expression for the target queue. This is due to the following code snippet, the calling for "checkQueueLabelInLabelManager" which will check if node label exists in node label manager(node label manager has no label at all if Node Label being disabled) has been skipped for recovery:

          SchedulerUtils.java
            public static void normalizeAndValidateRequest(ResourceRequest resReq,
                Resource maximumResource, String queueName, YarnScheduler scheduler,
                boolean isRecovery, RMContext rmContext, QueueInfo queueInfo)
                throws InvalidResourceRequestException {
              ... ...
          
              SchedulerUtils.normalizeNodeLabelExpressionInRequest(resReq, queueInfo);
              if (!isRecovery) {
                validateResourceRequest(resReq, maximumResource, queueInfo, rmContext);  // calling checkQueueLabelInLabelManager
              }
          

          This is not exactly the same as what happens when submitting a job in normal case (i.e., not during recovery). While in normal case, when there is default node label expression defined for queue with node label disabled, the application will also get rejected due to invalid resource request even if it doesn't specify node label expression. I believe this will get fixed after YARN-4652 being addressed.

          Show
          Ying Zhang Ying Zhang added a comment - - edited Uploaded a patch, which is based on Wangda Tan 's comment on YARN-4465 : swallow the InvalidResourceRequest exception when recovering, only fail the recovery for this application and print a error message, then let the rest of the recovery continue. Sunil G , your suggestion also makes sense to me. Actually, the code change using your approach would be made at the same place as in this patch with small modification: in function recover(), inside the for loop, if the conditions are met, skip calling "recoverApplication" and log a message like "skip recover application ..." instead. Difference is that using this approach we'll always check for these conditions even though it might not be a normal case, while using the approach in the patch, we just need to react when the exception happens. I'm ok with each approach since the overhead is not that big. Let's see what others think Wangda Tan , Bibin A Chundatt Just want to clarify. The current fact is (with or without this fix): application submitted with node label expression explicitly specified will fail during recovery, while application submitted without node label expression specified will succeed, no matter whether or not there is default node label expression for the target queue. This is due to the following code snippet, the calling for "checkQueueLabelInLabelManager" which will check if node label exists in node label manager(node label manager has no label at all if Node Label being disabled) has been skipped for recovery: SchedulerUtils.java public static void normalizeAndValidateRequest(ResourceRequest resReq, Resource maximumResource, String queueName, YarnScheduler scheduler, boolean isRecovery, RMContext rmContext, QueueInfo queueInfo) throws InvalidResourceRequestException { ... ... SchedulerUtils.normalizeNodeLabelExpressionInRequest(resReq, queueInfo); if (!isRecovery) { validateResourceRequest(resReq, maximumResource, queueInfo, rmContext); // calling checkQueueLabelInLabelManager } This is not exactly the same as what happens when submitting a job in normal case (i.e., not during recovery). While in normal case, when there is default node label expression defined for queue with node label disabled, the application will also get rejected due to invalid resource request even if it doesn't specify node label expression. I believe this will get fixed after YARN-4652 being addressed.
          Hide
          sunilg Sunil G added a comment -

          Thanks Ying Zhang for raising this issue.

          With the help of isRecovery flag, we can try skip the recovery of the application if below conditions are met

          • Node Labels are disabled
          • AM resource request has Node Label Expression associated with it

          In this scenario, we can skip the recovery of app just skipping the recovery loop.

          Our target could be to continue recovery for other apps and RM has to come up active. Make sense?

          Show
          sunilg Sunil G added a comment - Thanks Ying Zhang for raising this issue. With the help of isRecovery flag, we can try skip the recovery of the application if below conditions are met Node Labels are disabled AM resource request has Node Label Expression associated with it In this scenario, we can skip the recovery of app just skipping the recovery loop. Our target could be to continue recovery for other apps and RM has to come up active. Make sense?

            People

            • Assignee:
              Ying Zhang Ying Zhang
              Reporter:
              Ying Zhang Ying Zhang
            • Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development