Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-3813 Support Application timeout feature in YARN.
  3. YARN-6009

RM fails to start during an upgrade - Failed to load/recover state (YarnException: Invalid application timeout, value=0 for type=LIFETIME)

    Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.9.0, 3.0.0-alpha2
    • Component/s: resourcemanager
    • Labels:
      None

      Description

      ResourceManager fails to start during an upgrade with the following exceptions -

      Exception 1:

      2016-12-09 14:57:23,508 INFO  capacity.CapacityScheduler (CapacityScheduler.java:initScheduler(328)) - Initialized CapacityScheduler with calculator=class org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator, minimumAllocation=<<memory:256, vCores:1>>, maximumAllocation=<<memory:101376, vCores:64>>, asynchronousScheduling=false, asyncScheduleInterval=5ms
      2016-12-09 14:57:23,509 WARN  ha.ActiveStandbyElector (ActiveStandbyElector.java:becomeActive(863)) - Exception handling the winning of election
      org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
              at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:129)
              at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:859)
              at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:463)
              at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:611)
              at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
      Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when transitioning to Active mode
              at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:318)
              at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:127)
              ... 4 more
      Caused by: org.apache.hadoop.service.ServiceStateException: org.apache.hadoop.yarn.exceptions.YarnException: Invalid application timeout, value=0 for type=LIFETIME
              at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
              at org.apache.hadoop.service.AbstractService.start(AbstractService.java:204)
              at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:991)
              at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1032)
              at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1028)
              at java.security.AccessController.doPrivileged(Native Method)
              at javax.security.auth.Subject.doAs(Subject.java:422)
              at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
              at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1028)
              at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:313)
              ... 5 more
      Caused by: org.apache.hadoop.yarn.exceptions.YarnException: Invalid application timeout, value=0 for type=LIFETIME
              at org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.validateApplicationTimeouts(RMServerUtils.java:305)
              at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:365)
              at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:330)
              at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:463)
              at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1184)
              at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:594)
              at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
              ... 13 more
      

      Exception 2:

      2016-12-09 14:57:26,162 INFO  rmapp.RMAppImpl (RMAppImpl.java:handle(790)) - application_1477927786494_0008 State change from NEW to FINISHED
      2016-12-09 14:57:26,162 ERROR resourcemanager.ResourceManager (ResourceManager.java:serviceStart(599)) - Failed to load/recover state
      org.apache.hadoop.yarn.exceptions.YarnException: Invalid application timeout, value=0 for type=LIFETIME
              at org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.validateApplicationTimeouts(RMServerUtils.java:305)
              at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:365)
              at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:330)
              at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:463)
              at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1184)
              at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:594)
              at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
              at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:991)
              at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1032)
              at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1028)
              at java.security.AccessController.doPrivileged(Native Method)
              at javax.security.auth.Subject.doAs(Subject.java:422)
              at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
              at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1028)
              at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:313)
              at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:127)
              at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:859)
              at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:463)
              at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:611)
              at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
      2016-12-09 14:57:26,162 INFO  service.AbstractService (AbstractService.java:noteFailure(272)) - Service RMActiveServices failed in state STARTED; cause: org.apache.hadoop.yarn.exceptions.YarnException: Invalid application timeout, value=0 for type=LIFETIME
      org.apache.hadoop.yarn.exceptions.YarnException: Invalid application timeout, value=0 for type=LIFETIME
              at org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.validateApplicationTimeouts(RMServerUtils.java:305)
              at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:365)
              at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:330)
              at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:463)
              at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1184)
              at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:594)
              at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
              at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:991)
              at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1032)
              at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1028)
              at java.security.AccessController.doPrivileged(Native Method)
              at javax.security.auth.Subject.doAs(Subject.java:422)
              at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
              at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1028)
              at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:313)
              at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:127)
              at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:859)
              at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:463)
              at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:611)
              at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
      

      1. YARN-6009.01.patch
        2 kB
        Rohith Sharma K S

        Activity

        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #11083 (See https://builds.apache.org/job/Hadoop-trunk-Commit/11083/)
        YARN-6009. Skip validating app timeout value on recovery. Contributed by (jianhe: rev 020316458dfe6059b700f8d93a9791e4cb817b3f)

        • (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMAppManager.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #11083 (See https://builds.apache.org/job/Hadoop-trunk-Commit/11083/ ) YARN-6009 . Skip validating app timeout value on recovery. Contributed by (jianhe: rev 020316458dfe6059b700f8d93a9791e4cb817b3f) (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMAppManager.java
        Hide
        jianhe Jian He added a comment -

        Committed to trunk and branch-2, thanks Rohith,

        Thanks Daniel Templeton for the review !

        Show
        jianhe Jian He added a comment - Committed to trunk and branch-2, thanks Rohith, Thanks Daniel Templeton for the review !
        Hide
        templedf Daniel Templeton added a comment -
        Show
        templedf Daniel Templeton added a comment - Thanks, Rohith Sharma K S .
        Hide
        rohithsharma Rohith Sharma K S added a comment -

        Client were allowed to configure 0 but server was ignoring. But when user configure any timeout, then it has to be considered or throw an error if invalid. So, validation fails letting user knows about configured invalid.

        Show
        rohithsharma Rohith Sharma K S added a comment - Client were allowed to configure 0 but server was ignoring. But when user configure any timeout, then it has to be considered or throw an error if invalid. So, validation fails letting user knows about configured invalid.
        Hide
        templedf Daniel Templeton added a comment -

        Any idea why YARN-5611 made a 0 timeout illegal if the meaning of 0 timeout has not changed?

        Show
        templedf Daniel Templeton added a comment - Any idea why YARN-5611 made a 0 timeout illegal if the meaning of 0 timeout has not changed?
        Hide
        jianhe Jian He added a comment -

        makes sense to me, I'll commit later today if no more comments

        Show
        jianhe Jian He added a comment - makes sense to me, I'll commit later today if no more comments
        Hide
        rohithsharma Rohith Sharma K S added a comment -

        What is the impact of allowing an app to be recovered with a 0 timeout value?

        No impact, just by passing to make service up.

        Is that going to cause the recovered app to immediately expire?

        Noting happens to app. App will NOT be monitored for timeout. Moreover, App with 0 timeout was not monitored before RM restart also.

        Also, you may as well do the validation before the queue mapping.

        Both queue mapping and timeout check are pre-validation process. The order of preference should not matter.

        Show
        rohithsharma Rohith Sharma K S added a comment - What is the impact of allowing an app to be recovered with a 0 timeout value? No impact, just by passing to make service up. Is that going to cause the recovered app to immediately expire? Noting happens to app. App will NOT be monitored for timeout. Moreover, App with 0 timeout was not monitored before RM restart also. Also, you may as well do the validation before the queue mapping. Both queue mapping and timeout check are pre-validation process. The order of preference should not matter.
        Hide
        templedf Daniel Templeton added a comment -

        What is the impact of allowing an app to be recovered with a 0 timeout value? Looks like YARN-5611 changes the timeout to be absolute rather than relative. Is that going to cause the recovered app to immediately expire? Also, you may as well do the validation before the queue mapping.

        Show
        templedf Daniel Templeton added a comment - What is the impact of allowing an app to be recovered with a 0 timeout value? Looks like YARN-5611 changes the timeout to be absolute rather than relative. Is that going to cause the recovered app to immediately expire? Also, you may as well do the validation before the queue mapping.
        Hide
        hadoopqa Hadoop QA added a comment -
        -1 overall



        Vote Subsystem Runtime Comment
        0 reexec 0m 12s Docker mode activated.
        +1 @author 0m 0s The patch does not contain any @author tags.
        -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
        +1 mvninstall 14m 37s trunk passed
        +1 compile 0m 36s trunk passed
        +1 checkstyle 0m 23s trunk passed
        +1 mvnsite 0m 40s trunk passed
        +1 mvneclipse 0m 18s trunk passed
        +1 findbugs 1m 8s trunk passed
        +1 javadoc 0m 25s trunk passed
        +1 mvninstall 0m 37s the patch passed
        +1 compile 0m 34s the patch passed
        +1 javac 0m 34s the patch passed
        +1 checkstyle 0m 20s the patch passed
        +1 mvnsite 0m 35s the patch passed
        +1 mvneclipse 0m 14s the patch passed
        +1 whitespace 0m 0s The patch has no whitespace issues.
        +1 findbugs 1m 16s the patch passed
        +1 javadoc 0m 22s the patch passed
        -1 unit 42m 4s hadoop-yarn-server-resourcemanager in the patch failed.
        +1 asflicense 0m 17s The patch does not generate ASF License warnings.
        66m 4s



        Reason Tests
        Failed junit tests hadoop.yarn.server.resourcemanager.TestRMRestart



        Subsystem Report/Notes
        Docker Image:yetus/hadoop:a9ad5d6
        JIRA Issue YARN-6009
        JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12845480/YARN-6009.01.patch
        Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
        uname Linux 0083599930d9 3.13.0-105-generic #152-Ubuntu SMP Fri Dec 2 15:37:11 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
        Build tool maven
        Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
        git revision trunk / a0a2761
        Default Java 1.8.0_111
        findbugs v3.0.0
        unit https://builds.apache.org/job/PreCommit-YARN-Build/14551/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
        Test Results https://builds.apache.org/job/PreCommit-YARN-Build/14551/testReport/
        modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
        Console output https://builds.apache.org/job/PreCommit-YARN-Build/14551/console
        Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 12s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 mvninstall 14m 37s trunk passed +1 compile 0m 36s trunk passed +1 checkstyle 0m 23s trunk passed +1 mvnsite 0m 40s trunk passed +1 mvneclipse 0m 18s trunk passed +1 findbugs 1m 8s trunk passed +1 javadoc 0m 25s trunk passed +1 mvninstall 0m 37s the patch passed +1 compile 0m 34s the patch passed +1 javac 0m 34s the patch passed +1 checkstyle 0m 20s the patch passed +1 mvnsite 0m 35s the patch passed +1 mvneclipse 0m 14s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 16s the patch passed +1 javadoc 0m 22s the patch passed -1 unit 42m 4s hadoop-yarn-server-resourcemanager in the patch failed. +1 asflicense 0m 17s The patch does not generate ASF License warnings. 66m 4s Reason Tests Failed junit tests hadoop.yarn.server.resourcemanager.TestRMRestart Subsystem Report/Notes Docker Image:yetus/hadoop:a9ad5d6 JIRA Issue YARN-6009 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12845480/YARN-6009.01.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 0083599930d9 3.13.0-105-generic #152-Ubuntu SMP Fri Dec 2 15:37:11 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / a0a2761 Default Java 1.8.0_111 findbugs v3.0.0 unit https://builds.apache.org/job/PreCommit-YARN-Build/14551/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/14551/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/14551/console Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
        Hide
        rohithsharma Rohith Sharma K S added a comment -

        There is open JIRA for test failure i.e YARN-5548. Test failure is unrelated to patch.

        Show
        rohithsharma Rohith Sharma K S added a comment - There is open JIRA for test failure i.e YARN-5548 . Test failure is unrelated to patch.
        Hide
        jianhe Jian He added a comment -

        patch looks good to me. The UT failure passed locally for me.
        Retry the jenkins

        Show
        jianhe Jian He added a comment - patch looks good to me. The UT failure passed locally for me. Retry the jenkins
        Hide
        hadoopqa Hadoop QA added a comment -
        -1 overall



        Vote Subsystem Runtime Comment
        0 reexec 0m 14s Docker mode activated.
        +1 @author 0m 0s The patch does not contain any @author tags.
        -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
        +1 mvninstall 14m 31s trunk passed
        +1 compile 0m 36s trunk passed
        +1 checkstyle 0m 23s trunk passed
        +1 mvnsite 0m 39s trunk passed
        +1 mvneclipse 0m 17s trunk passed
        +1 findbugs 1m 11s trunk passed
        +1 javadoc 0m 27s trunk passed
        +1 mvninstall 0m 45s the patch passed
        +1 compile 0m 36s the patch passed
        +1 javac 0m 36s the patch passed
        +1 checkstyle 0m 20s the patch passed
        +1 mvnsite 0m 35s the patch passed
        +1 mvneclipse 0m 15s the patch passed
        +1 whitespace 0m 0s The patch has no whitespace issues.
        +1 findbugs 1m 16s the patch passed
        +1 javadoc 0m 19s the patch passed
        -1 unit 41m 56s hadoop-yarn-server-resourcemanager in the patch failed.
        +1 asflicense 0m 18s The patch does not generate ASF License warnings.
        66m 3s



        Reason Tests
        Failed junit tests hadoop.yarn.server.resourcemanager.TestRMRestart



        Subsystem Report/Notes
        Docker Image:yetus/hadoop:a9ad5d6
        JIRA Issue YARN-6009
        JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12845480/YARN-6009.01.patch
        Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
        uname Linux de4c16810a9b 3.13.0-105-generic #152-Ubuntu SMP Fri Dec 2 15:37:11 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
        Build tool maven
        Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
        git revision trunk / e49e0a6
        Default Java 1.8.0_111
        findbugs v3.0.0
        unit https://builds.apache.org/job/PreCommit-YARN-Build/14546/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt
        Test Results https://builds.apache.org/job/PreCommit-YARN-Build/14546/testReport/
        modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
        Console output https://builds.apache.org/job/PreCommit-YARN-Build/14546/console
        Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 14s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 mvninstall 14m 31s trunk passed +1 compile 0m 36s trunk passed +1 checkstyle 0m 23s trunk passed +1 mvnsite 0m 39s trunk passed +1 mvneclipse 0m 17s trunk passed +1 findbugs 1m 11s trunk passed +1 javadoc 0m 27s trunk passed +1 mvninstall 0m 45s the patch passed +1 compile 0m 36s the patch passed +1 javac 0m 36s the patch passed +1 checkstyle 0m 20s the patch passed +1 mvnsite 0m 35s the patch passed +1 mvneclipse 0m 15s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 16s the patch passed +1 javadoc 0m 19s the patch passed -1 unit 41m 56s hadoop-yarn-server-resourcemanager in the patch failed. +1 asflicense 0m 18s The patch does not generate ASF License warnings. 66m 3s Reason Tests Failed junit tests hadoop.yarn.server.resourcemanager.TestRMRestart Subsystem Report/Notes Docker Image:yetus/hadoop:a9ad5d6 JIRA Issue YARN-6009 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12845480/YARN-6009.01.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux de4c16810a9b 3.13.0-105-generic #152-Ubuntu SMP Fri Dec 2 15:37:11 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / e49e0a6 Default Java 1.8.0_111 findbugs v3.0.0 unit https://builds.apache.org/job/PreCommit-YARN-Build/14546/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/14546/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/14546/console Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
        Hide
        rohithsharma Rohith Sharma K S added a comment -

        updated patch for not validating timeout values during recovery.

        Show
        rohithsharma Rohith Sharma K S added a comment - updated patch for not validating timeout values during recovery.
        Hide
        templedf Daniel Templeton added a comment - - edited

        Gour Saha, ignoring an application that failed to recover should not be something that happens quietly by default. There are lots of scenarios where that behavior could cause problems. I agree, though, that it should be possible to startup nonetheless. YARN-6035 would give admins an explicit option to get a forced startup. Also see the discussion on YARN-6031.

        Show
        templedf Daniel Templeton added a comment - - edited Gour Saha , ignoring an application that failed to recover should not be something that happens quietly by default. There are lots of scenarios where that behavior could cause problems. I agree, though, that it should be possible to startup nonetheless. YARN-6035 would give admins an explicit option to get a forced startup. Also see the discussion on YARN-6031 .
        Hide
        gsaha Gour Saha added a comment -

        Rohith Sharma K S I understand that, but I am a little worried here. No matter what the issue with the state store of a particular app may be, it should not block the RM from starting. Note, this is not just limited to lifetime property. We can log appropriate messages for the problematic apps (and maybe even update the app diagnostics) and move on with graceful start of RM. The app owners can later work on the individual problematic apps, but at least the cluster will be up and running, ready to serve new apps.

        Show
        gsaha Gour Saha added a comment - Rohith Sharma K S I understand that, but I am a little worried here. No matter what the issue with the state store of a particular app may be, it should not block the RM from starting. Note, this is not just limited to lifetime property. We can log appropriate messages for the problematic apps (and maybe even update the app diagnostics) and move on with graceful start of RM. The app owners can later work on the individual problematic apps, but at least the cluster will be up and running, ready to serve new apps.
        Hide
        rohithsharma Rohith Sharma K S added a comment -

        More precisely,

        1. Applications were allowed to submit with zero timeout configured for any ApplicationTimeoutTypes. But now, is is mandatory that timeout value should be greater than zero. This validation happens at server side in RMServerUtils.validateApplicationTimeouts.
        2. Earlier, user input timeout value is directly stored in RMStateStore say timeout=10seconds. Now, this value will be changed to absolute time(currentTimeInMillis+10seconds)

        the point one is causing failure during upgrade from YARN-4205.

        Show
        rohithsharma Rohith Sharma K S added a comment - More precisely, Applications were allowed to submit with zero timeout configured for any ApplicationTimeoutTypes. But now, is is mandatory that timeout value should be greater than zero. This validation happens at server side in RMServerUtils.validateApplicationTimeouts. Earlier, user input timeout value is directly stored in RMStateStore say timeout=10seconds. Now, this value will be changed to absolute time(currentTimeInMillis+10seconds) the point one is causing failure during upgrade from YARN-4205 .
        Hide
        jianhe Jian He added a comment -

        Rohith Sharma K S, could you clarify which code logic changed?

        Show
        jianhe Jian He added a comment - Rohith Sharma K S , could you clarify which code logic changed?
        Hide
        rohithsharma Rohith Sharma K S added a comment -

        Gour Saha
        This is mainly because cluster was running YARN-4205 patch initially. And trying to upgrade with other timeout patches cluster.

        But in other timeout patch I think YARN-5611, the design of timeout value that will be stored in RMStateStore and validation during application submission are got changed.
        This is causing the recovery failure for upgrade from YARN-4205 patch. So, any upgrade from YARN-4205 patch to latest do not work properly.

        With YARN-4205+YARN-5611 cluster, this issue does not appear.

        To move forward in your cluster to make cluster up, This application need to be deleted from state store using CLI ./yarn resourcemanager -remove-application-from-state-store $appId so that RM service will be up.

        Show
        rohithsharma Rohith Sharma K S added a comment - Gour Saha This is mainly because cluster was running YARN-4205 patch initially. And trying to upgrade with other timeout patches cluster. But in other timeout patch I think YARN-5611 , the design of timeout value that will be stored in RMStateStore and validation during application submission are got changed. This is causing the recovery failure for upgrade from YARN-4205 patch. So, any upgrade from YARN-4205 patch to latest do not work properly. With YARN-4205 + YARN-5611 cluster, this issue does not appear. To move forward in your cluster to make cluster up, This application need to be deleted from state store using CLI ./yarn resourcemanager -remove-application-from-state-store $appId so that RM service will be up.
        Hide
        gsaha Gour Saha added a comment -
        Show
        gsaha Gour Saha added a comment - /cc Rohith Sharma K S

          People

          • Assignee:
            rohithsharma Rohith Sharma K S
            Reporter:
            gsaha Gour Saha
          • Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development