Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.9.0, 3.0.0-alpha2
    • Component/s: None
    • Labels:
      None

      Description

      YARN-5620 added support for re-initialization of Containers using a new launch Context.
      This JIRA proposes to use the above feature to support upgrade and subsequent rollback or commit of the upgrade.

      1. YARN-5637.007.patch
        36 kB
        Arun Suresh
      2. YARN-5637.006.patch
        39 kB
        Arun Suresh
      3. YARN-5637.005.patch
        36 kB
        Arun Suresh
      4. YARN-5637.004.patch
        34 kB
        Arun Suresh
      5. YARN-5637.003.patch
        34 kB
        Arun Suresh
      6. YARN-5637.002.patch
        34 kB
        Arun Suresh
      7. YARN-5637.001.patch
        36 kB
        Arun Suresh

        Activity

        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #10458 (See https://builds.apache.org/job/Hadoop-trunk-Commit/10458/)
        YARN-5637. Changes in NodeManager to support Container rollback and (arun suresh: rev 3552c2b99dff4f21489ff284f9dcba40e897a1e5)

        • (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/ContainerImpl.java
        • (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/ContainerReInitEvent.java
        • (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/ContainerEventType.java
        • (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/TestContainerManager.java
        • (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestContainerManagerWithLCE.java
        • (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/ContainerManagerImpl.java
        • (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/Container.java
        • (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/webapp/MockContainer.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #10458 (See https://builds.apache.org/job/Hadoop-trunk-Commit/10458/ ) YARN-5637 . Changes in NodeManager to support Container rollback and (arun suresh: rev 3552c2b99dff4f21489ff284f9dcba40e897a1e5) (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/ContainerImpl.java (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/ContainerReInitEvent.java (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/ContainerEventType.java (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/TestContainerManager.java (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestContainerManagerWithLCE.java (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/ContainerManagerImpl.java (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/Container.java (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/webapp/MockContainer.java
        Hide
        asuresh Arun Suresh added a comment -

        Committed this to trunk and branch-2. Thanks Jian He and Varun Vasudev for the reviews...

        Show
        asuresh Arun Suresh added a comment - Committed this to trunk and branch-2. Thanks Jian He and Varun Vasudev for the reviews...
        Hide
        vvasudev Varun Vasudev added a comment -

        Makes sense. +1.

        Show
        vvasudev Varun Vasudev added a comment - Makes sense. +1.
        Hide
        asuresh Arun Suresh added a comment - - edited

        Thanks for the review Varun Vasudev.
        We don't really delete the new ResourceSet when we rollback. To clarify the scenario you mentioned, As part of the upgrade a rogue jar is symlinked into the Container's working directory and added to its CLASSPATH env variable.
        I see 2 cases:

        1. The rogue jar's symlink has over-written the earlier version's symlink. In this case, on rollback, the symlink will be re over-written with the old jar and the relaunched old process will see only the old jar.
        2. The rogue jar's is a new symlink in the container's working dir. In this case, on rollback, the symlink will still remain in the working dir and still point to the old jar. But since the CLASSPATH will also be reverted to the old CLASSPATH (since it's part of the launch context) which does not contain this new jar... I am guessing it should be fine.

        But yeah, we can maybe add more sophistication to actually delete / unlink the new new resources from the working directory on rollback. Let's handle that, as you suggested as part of another JIRA, and lets continue with the current approach as a first cut.

        Hope this made sense. If so, I will go ahead and commit this.

        Show
        asuresh Arun Suresh added a comment - - edited Thanks for the review Varun Vasudev . We don't really delete the new ResourceSet when we rollback. To clarify the scenario you mentioned, As part of the upgrade a rogue jar is symlinked into the Container's working directory and added to its CLASSPATH env variable. I see 2 cases: The rogue jar's symlink has over-written the earlier version's symlink. In this case, on rollback, the symlink will be re over-written with the old jar and the relaunched old process will see only the old jar. The rogue jar's is a new symlink in the container's working dir. In this case, on rollback, the symlink will still remain in the working dir and still point to the old jar. But since the CLASSPATH will also be reverted to the old CLASSPATH (since it's part of the launch context) which does not contain this new jar... I am guessing it should be fine. But yeah, we can maybe add more sophistication to actually delete / unlink the new new resources from the working directory on rollback. Let's handle that, as you suggested as part of another JIRA, and lets continue with the current approach as a first cut. Hope this made sense. If so, I will go ahead and commit this.
        Hide
        vvasudev Varun Vasudev added a comment -

        Thanks for the patch Arun Suresh! My apologies for the delay in responding. The patch looks mostly good to me. The only question I have to around deleting resources as part of the rollback. Let's say you download a new jar as part of the upgrade. When you do a rollback, this jar will be part of the classpath and this could result in the rollback failing(due to class/function not found issues). Is my understanding wrong?

        I wouldn't hold up the patch for fixing this - we should probably do it in a different JIRA.

        Show
        vvasudev Varun Vasudev added a comment - Thanks for the patch Arun Suresh ! My apologies for the delay in responding. The patch looks mostly good to me. The only question I have to around deleting resources as part of the rollback. Let's say you download a new jar as part of the upgrade. When you do a rollback, this jar will be part of the classpath and this could result in the rollback failing(due to class/function not found issues). Is my understanding wrong? I wouldn't hold up the patch for fixing this - we should probably do it in a different JIRA.
        Hide
        asuresh Arun Suresh added a comment -

        The lone checkstyle warning wont be fixed since it conflicts with existing indentation.
        Based on Jian He's lgtm, I plan to commit this in a day or so.. unless folks have some objection (cc: Varun Vasudev..)

        Show
        asuresh Arun Suresh added a comment - The lone checkstyle warning wont be fixed since it conflicts with existing indentation. Based on Jian He 's lgtm, I plan to commit this in a day or so.. unless folks have some objection (cc: Varun Vasudev ..)
        Hide
        hadoopqa Hadoop QA added a comment -
        -1 overall



        Vote Subsystem Runtime Comment
        0 reexec 0m 17s Docker mode activated.
        +1 @author 0m 0s The patch does not contain any @author tags.
        +1 test4tests 0m 0s The patch appears to include 3 new or modified test files.
        +1 mvninstall 6m 41s trunk passed
        +1 compile 0m 26s trunk passed
        +1 checkstyle 0m 19s trunk passed
        +1 mvnsite 0m 27s trunk passed
        +1 mvneclipse 0m 13s trunk passed
        +1 findbugs 0m 40s trunk passed
        +1 javadoc 0m 17s trunk passed
        +1 mvninstall 0m 22s the patch passed
        +1 compile 0m 24s the patch passed
        +1 javac 0m 24s the patch passed
        -1 checkstyle 0m 16s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: The patch generated 1 new + 267 unchanged - 0 fixed = 268 total (was 267)
        +1 mvnsite 0m 24s the patch passed
        +1 mvneclipse 0m 10s the patch passed
        +1 whitespace 0m 0s The patch has no whitespace issues.
        +1 findbugs 0m 46s the patch passed
        +1 javadoc 0m 14s the patch passed
        +1 unit 14m 31s hadoop-yarn-server-nodemanager in the patch passed.
        +1 asflicense 0m 15s The patch does not generate ASF License warnings.
        27m 21s



        Subsystem Report/Notes
        Docker Image:yetus/hadoop:9560f25
        JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12828988/YARN-5637.007.patch
        JIRA Issue YARN-5637
        Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
        uname Linux e6290b6f133d 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
        Build tool maven
        Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
        git revision trunk / f67237c
        Default Java 1.8.0_101
        findbugs v3.0.0
        checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/13137/artifact/patchprocess/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt
        Test Results https://builds.apache.org/job/PreCommit-YARN-Build/13137/testReport/
        modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
        Console output https://builds.apache.org/job/PreCommit-YARN-Build/13137/console
        Powered by Apache Yetus 0.3.0 http://yetus.apache.org

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 17s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 3 new or modified test files. +1 mvninstall 6m 41s trunk passed +1 compile 0m 26s trunk passed +1 checkstyle 0m 19s trunk passed +1 mvnsite 0m 27s trunk passed +1 mvneclipse 0m 13s trunk passed +1 findbugs 0m 40s trunk passed +1 javadoc 0m 17s trunk passed +1 mvninstall 0m 22s the patch passed +1 compile 0m 24s the patch passed +1 javac 0m 24s the patch passed -1 checkstyle 0m 16s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: The patch generated 1 new + 267 unchanged - 0 fixed = 268 total (was 267) +1 mvnsite 0m 24s the patch passed +1 mvneclipse 0m 10s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 0m 46s the patch passed +1 javadoc 0m 14s the patch passed +1 unit 14m 31s hadoop-yarn-server-nodemanager in the patch passed. +1 asflicense 0m 15s The patch does not generate ASF License warnings. 27m 21s Subsystem Report/Notes Docker Image:yetus/hadoop:9560f25 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12828988/YARN-5637.007.patch JIRA Issue YARN-5637 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux e6290b6f133d 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / f67237c Default Java 1.8.0_101 findbugs v3.0.0 checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/13137/artifact/patchprocess/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/13137/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/13137/console Powered by Apache Yetus 0.3.0 http://yetus.apache.org This message was automatically generated.
        Hide
        asuresh Arun Suresh added a comment -

        Thanks Jian He,
        Uploading patch with ur suggestion and some fixes to deal with the failing TestNodeStatusUpdater and TestAuxService.
        The TestDefaultContainerExecutor test is handled by YARN-5657.

        Show
        asuresh Arun Suresh added a comment - Thanks Jian He , Uploading patch with ur suggestion and some fixes to deal with the failing TestNodeStatusUpdater and TestAuxService. The TestDefaultContainerExecutor test is handled by YARN-5657 .
        Hide
        jianhe Jian He added a comment -

        lgtm, only comment is
        This

              if (oldLaunchContext != null) {
                this.oldLaunchContext = oldLaunchContext;
                this.oldResourceSet = oldResourceSet;
              } else {
                this.oldLaunchContext = null;
                this.oldResourceSet = null;
              }
        

        seems equivalent to

                this.oldLaunchContext = oldLaunchContext;
                this.oldResourceSet = oldResourceSet;
        
        Show
        jianhe Jian He added a comment - lgtm, only comment is This if (oldLaunchContext != null ) { this .oldLaunchContext = oldLaunchContext; this .oldResourceSet = oldResourceSet; } else { this .oldLaunchContext = null ; this .oldResourceSet = null ; } seems equivalent to this .oldLaunchContext = oldLaunchContext; this .oldResourceSet = oldResourceSet;
        Hide
        hadoopqa Hadoop QA added a comment -
        -1 overall



        Vote Subsystem Runtime Comment
        0 reexec 0m 13s Docker mode activated.
        +1 @author 0m 0s The patch does not contain any @author tags.
        +1 test4tests 0m 0s The patch appears to include 3 new or modified test files.
        +1 mvninstall 7m 53s trunk passed
        +1 compile 0m 29s trunk passed
        +1 checkstyle 0m 20s trunk passed
        +1 mvnsite 0m 28s trunk passed
        +1 mvneclipse 0m 13s trunk passed
        +1 findbugs 0m 46s trunk passed
        +1 javadoc 0m 17s trunk passed
        +1 mvninstall 0m 24s the patch passed
        +1 compile 0m 24s the patch passed
        +1 javac 0m 24s the patch passed
        -1 checkstyle 0m 17s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: The patch generated 3 new + 267 unchanged - 0 fixed = 270 total (was 267)
        +1 mvnsite 0m 27s the patch passed
        +1 mvneclipse 0m 10s the patch passed
        +1 whitespace 0m 0s The patch has no whitespace issues.
        +1 findbugs 0m 49s the patch passed
        +1 javadoc 0m 15s the patch passed
        -1 unit 14m 44s hadoop-yarn-server-nodemanager in the patch failed.
        +1 asflicense 0m 17s The patch does not generate ASF License warnings.
        29m 7s



        Reason Tests
        Failed junit tests hadoop.yarn.server.nodemanager.TestDefaultContainerExecutor
          hadoop.yarn.server.nodemanager.containermanager.TestAuxServices
          hadoop.yarn.server.nodemanager.TestNodeStatusUpdater



        Subsystem Report/Notes
        Docker Image:yetus/hadoop:9560f25
        JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12828934/YARN-5637.006.patch
        JIRA Issue YARN-5637
        Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
        uname Linux 483320417e5f 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
        Build tool maven
        Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
        git revision trunk / 8a40953
        Default Java 1.8.0_101
        findbugs v3.0.0
        checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/13134/artifact/patchprocess/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt
        unit https://builds.apache.org/job/PreCommit-YARN-Build/13134/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt
        unit test logs https://builds.apache.org/job/PreCommit-YARN-Build/13134/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt
        Test Results https://builds.apache.org/job/PreCommit-YARN-Build/13134/testReport/
        modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
        Console output https://builds.apache.org/job/PreCommit-YARN-Build/13134/console
        Powered by Apache Yetus 0.3.0 http://yetus.apache.org

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 13s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 3 new or modified test files. +1 mvninstall 7m 53s trunk passed +1 compile 0m 29s trunk passed +1 checkstyle 0m 20s trunk passed +1 mvnsite 0m 28s trunk passed +1 mvneclipse 0m 13s trunk passed +1 findbugs 0m 46s trunk passed +1 javadoc 0m 17s trunk passed +1 mvninstall 0m 24s the patch passed +1 compile 0m 24s the patch passed +1 javac 0m 24s the patch passed -1 checkstyle 0m 17s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: The patch generated 3 new + 267 unchanged - 0 fixed = 270 total (was 267) +1 mvnsite 0m 27s the patch passed +1 mvneclipse 0m 10s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 0m 49s the patch passed +1 javadoc 0m 15s the patch passed -1 unit 14m 44s hadoop-yarn-server-nodemanager in the patch failed. +1 asflicense 0m 17s The patch does not generate ASF License warnings. 29m 7s Reason Tests Failed junit tests hadoop.yarn.server.nodemanager.TestDefaultContainerExecutor   hadoop.yarn.server.nodemanager.containermanager.TestAuxServices   hadoop.yarn.server.nodemanager.TestNodeStatusUpdater Subsystem Report/Notes Docker Image:yetus/hadoop:9560f25 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12828934/YARN-5637.006.patch JIRA Issue YARN-5637 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 483320417e5f 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 8a40953 Default Java 1.8.0_101 findbugs v3.0.0 checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/13134/artifact/patchprocess/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt unit https://builds.apache.org/job/PreCommit-YARN-Build/13134/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt unit test logs https://builds.apache.org/job/PreCommit-YARN-Build/13134/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/13134/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/13134/console Powered by Apache Yetus 0.3.0 http://yetus.apache.org This message was automatically generated.
        Hide
        asuresh Arun Suresh added a comment -

        Simplifying patch based on some offline discussions with Jian He

        Essentially, the behavior now is:
        If the new process dies after launch,

        • If autoCommit is true, it will simply follow the rules specified by the ContainerRetryContext found in the new launch context (terminate immediately, restart, restart infinitely etc.)
        • If autoCommit is false, It will do exactly as above but at the end of the retry sequence, instead of terminating, it will ReInitialize back to the old launch context.

        If the new process startsup fine:

        • If autoCommit is false, the AM has the option of either
          • Committing the container if it feels the re-initialization is successful. The rollback context is deleted.
          • Explicitly Rollback the container to previous launch context if it feels there is something wrong
          • Not do anything. The rollback context will hang around until another re-init is called.
        Show
        asuresh Arun Suresh added a comment - Simplifying patch based on some offline discussions with Jian He Essentially, the behavior now is: If the new process dies after launch, If autoCommit is true , it will simply follow the rules specified by the ContainerRetryContext found in the new launch context (terminate immediately, restart, restart infinitely etc.) If autoCommit is false , It will do exactly as above but at the end of the retry sequence, instead of terminating, it will ReInitialize back to the old launch context. If the new process startsup fine: If autoCommit is false , the AM has the option of either Committing the container if it feels the re-initialization is successful. The rollback context is deleted. Explicitly Rollback the container to previous launch context if it feels there is something wrong Not do anything. The rollback context will hang around until another re-init is called.
        Hide
        hadoopqa Hadoop QA added a comment -
        -1 overall



        Vote Subsystem Runtime Comment
        0 reexec 0m 20s Docker mode activated.
        +1 @author 0m 0s The patch does not contain any @author tags.
        +1 test4tests 0m 0s The patch appears to include 3 new or modified test files.
        +1 mvninstall 8m 5s trunk passed
        +1 compile 0m 29s trunk passed
        +1 checkstyle 0m 19s trunk passed
        +1 mvnsite 0m 31s trunk passed
        +1 mvneclipse 0m 14s trunk passed
        +1 findbugs 0m 47s trunk passed
        +1 javadoc 0m 17s trunk passed
        +1 mvninstall 0m 28s the patch passed
        +1 compile 0m 26s the patch passed
        +1 javac 0m 26s the patch passed
        -1 checkstyle 0m 18s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: The patch generated 8 new + 267 unchanged - 0 fixed = 275 total (was 267)
        +1 mvnsite 0m 30s the patch passed
        +1 mvneclipse 0m 10s the patch passed
        +1 whitespace 0m 0s The patch has no whitespace issues.
        +1 findbugs 1m 1s the patch passed
        +1 javadoc 0m 18s the patch passed
        -1 unit 15m 36s hadoop-yarn-server-nodemanager in the patch failed.
        +1 asflicense 0m 16s The patch does not generate ASF License warnings.
        30m 49s



        Reason Tests
        Failed junit tests hadoop.yarn.server.nodemanager.containermanager.queuing.TestQueuingContainerManager
          hadoop.yarn.server.nodemanager.TestDefaultContainerExecutor



        Subsystem Report/Notes
        Docker Image:yetus/hadoop:9560f25
        JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12828684/YARN-5637.005.patch
        JIRA Issue YARN-5637
        Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
        uname Linux 3f446ba56641 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
        Build tool maven
        Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
        git revision trunk / 7cad7b7
        Default Java 1.8.0_101
        findbugs v3.0.0
        checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/13115/artifact/patchprocess/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt
        unit https://builds.apache.org/job/PreCommit-YARN-Build/13115/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt
        unit test logs https://builds.apache.org/job/PreCommit-YARN-Build/13115/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt
        Test Results https://builds.apache.org/job/PreCommit-YARN-Build/13115/testReport/
        modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
        Console output https://builds.apache.org/job/PreCommit-YARN-Build/13115/console
        Powered by Apache Yetus 0.3.0 http://yetus.apache.org

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 20s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 3 new or modified test files. +1 mvninstall 8m 5s trunk passed +1 compile 0m 29s trunk passed +1 checkstyle 0m 19s trunk passed +1 mvnsite 0m 31s trunk passed +1 mvneclipse 0m 14s trunk passed +1 findbugs 0m 47s trunk passed +1 javadoc 0m 17s trunk passed +1 mvninstall 0m 28s the patch passed +1 compile 0m 26s the patch passed +1 javac 0m 26s the patch passed -1 checkstyle 0m 18s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: The patch generated 8 new + 267 unchanged - 0 fixed = 275 total (was 267) +1 mvnsite 0m 30s the patch passed +1 mvneclipse 0m 10s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 1s the patch passed +1 javadoc 0m 18s the patch passed -1 unit 15m 36s hadoop-yarn-server-nodemanager in the patch failed. +1 asflicense 0m 16s The patch does not generate ASF License warnings. 30m 49s Reason Tests Failed junit tests hadoop.yarn.server.nodemanager.containermanager.queuing.TestQueuingContainerManager   hadoop.yarn.server.nodemanager.TestDefaultContainerExecutor Subsystem Report/Notes Docker Image:yetus/hadoop:9560f25 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12828684/YARN-5637.005.patch JIRA Issue YARN-5637 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 3f446ba56641 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 7cad7b7 Default Java 1.8.0_101 findbugs v3.0.0 checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/13115/artifact/patchprocess/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt unit https://builds.apache.org/job/PreCommit-YARN-Build/13115/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt unit test logs https://builds.apache.org/job/PreCommit-YARN-Build/13115/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/13115/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/13115/console Powered by Apache Yetus 0.3.0 http://yetus.apache.org This message was automatically generated.
        Hide
        asuresh Arun Suresh added a comment -

        Fixing checkstyle, javadocs and tests

        Show
        asuresh Arun Suresh added a comment - Fixing checkstyle, javadocs and tests
        Hide
        hadoopqa Hadoop QA added a comment -
        -1 overall



        Vote Subsystem Runtime Comment
        0 reexec 0m 18s Docker mode activated.
        +1 @author 0m 0s The patch does not contain any @author tags.
        +1 test4tests 0m 0s The patch appears to include 3 new or modified test files.
        +1 mvninstall 7m 14s trunk passed
        +1 compile 0m 26s trunk passed
        +1 checkstyle 0m 20s trunk passed
        +1 mvnsite 0m 27s trunk passed
        +1 mvneclipse 0m 13s trunk passed
        +1 findbugs 0m 44s trunk passed
        +1 javadoc 0m 18s trunk passed
        +1 mvninstall 0m 22s the patch passed
        +1 compile 0m 25s the patch passed
        +1 javac 0m 25s the patch passed
        -1 checkstyle 0m 18s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: The patch generated 3 new + 267 unchanged - 0 fixed = 270 total (was 267)
        +1 mvnsite 0m 26s the patch passed
        +1 mvneclipse 0m 10s the patch passed
        +1 whitespace 0m 0s The patch has no whitespace issues.
        +1 findbugs 0m 48s the patch passed
        -1 javadoc 0m 16s hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager generated 5 new + 240 unchanged - 0 fixed = 245 total (was 240)
        -1 unit 14m 42s hadoop-yarn-server-nodemanager in the patch failed.
        +1 asflicense 0m 15s The patch does not generate ASF License warnings.
        28m 25s



        Reason Tests
        Failed junit tests hadoop.yarn.server.nodemanager.TestDefaultContainerExecutor
          hadoop.yarn.server.nodemanager.TestContainerManagerWithLCE



        Subsystem Report/Notes
        Docker Image:yetus/hadoop:9560f25
        JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12828663/YARN-5637.004.patch
        JIRA Issue YARN-5637
        Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
        uname Linux f7d665720cf6 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
        Build tool maven
        Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
        git revision trunk / 7cad7b7
        Default Java 1.8.0_101
        findbugs v3.0.0
        checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/13113/artifact/patchprocess/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt
        javadoc https://builds.apache.org/job/PreCommit-YARN-Build/13113/artifact/patchprocess/diff-javadoc-javadoc-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt
        unit https://builds.apache.org/job/PreCommit-YARN-Build/13113/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt
        unit test logs https://builds.apache.org/job/PreCommit-YARN-Build/13113/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt
        Test Results https://builds.apache.org/job/PreCommit-YARN-Build/13113/testReport/
        modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
        Console output https://builds.apache.org/job/PreCommit-YARN-Build/13113/console
        Powered by Apache Yetus 0.3.0 http://yetus.apache.org

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 18s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 3 new or modified test files. +1 mvninstall 7m 14s trunk passed +1 compile 0m 26s trunk passed +1 checkstyle 0m 20s trunk passed +1 mvnsite 0m 27s trunk passed +1 mvneclipse 0m 13s trunk passed +1 findbugs 0m 44s trunk passed +1 javadoc 0m 18s trunk passed +1 mvninstall 0m 22s the patch passed +1 compile 0m 25s the patch passed +1 javac 0m 25s the patch passed -1 checkstyle 0m 18s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: The patch generated 3 new + 267 unchanged - 0 fixed = 270 total (was 267) +1 mvnsite 0m 26s the patch passed +1 mvneclipse 0m 10s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 0m 48s the patch passed -1 javadoc 0m 16s hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager generated 5 new + 240 unchanged - 0 fixed = 245 total (was 240) -1 unit 14m 42s hadoop-yarn-server-nodemanager in the patch failed. +1 asflicense 0m 15s The patch does not generate ASF License warnings. 28m 25s Reason Tests Failed junit tests hadoop.yarn.server.nodemanager.TestDefaultContainerExecutor   hadoop.yarn.server.nodemanager.TestContainerManagerWithLCE Subsystem Report/Notes Docker Image:yetus/hadoop:9560f25 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12828663/YARN-5637.004.patch JIRA Issue YARN-5637 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux f7d665720cf6 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 7cad7b7 Default Java 1.8.0_101 findbugs v3.0.0 checkstyle https://builds.apache.org/job/PreCommit-YARN-Build/13113/artifact/patchprocess/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt javadoc https://builds.apache.org/job/PreCommit-YARN-Build/13113/artifact/patchprocess/diff-javadoc-javadoc-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt unit https://builds.apache.org/job/PreCommit-YARN-Build/13113/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt unit test logs https://builds.apache.org/job/PreCommit-YARN-Build/13113/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/13113/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/13113/console Powered by Apache Yetus 0.3.0 http://yetus.apache.org This message was automatically generated.
        Hide
        asuresh Arun Suresh added a comment -

        Rebasing patch with latest YARN-5620

        Show
        asuresh Arun Suresh added a comment - Rebasing patch with latest YARN-5620
        Hide
        asuresh Arun Suresh added a comment -

        Updating patch.. thanks again for the review Jian He..

        Here, we could make reInitEvent.getResourceSet() be merged with existing resourceSet.localizedResource upfront, so that both oldResourceSet and newResourceSet contain full copy of resources, rather than delta.

        This was actually intentional. Consider the case where the original process has many resources to Localize but the upgrade launch script just needs a binary change in addition to the existing resources. If the resourceSets were merged upfront, then in the ReInitializeContainerTransition, the ContainerLocalizationRequestEvent that gets sent would include ALL the resources, instead of just the single resource. The Container will have to remain in the REINITIALIZING state till it receives RESOURCE_LOCALIZED events for all the resources in the combined resultset before being able to launch.

        the container.reInitContext!= null check is not needed..

        Think we do, else it might cause an NPE when the LaunchTransition happens as part of the initial container startup.

        I found the resourceSet is also not updated when rollback in RetryFailureTransition

        Good catch... I also like your refactoring.. i've incorporated it in the latest patch

        Show
        asuresh Arun Suresh added a comment - Updating patch.. thanks again for the review Jian He .. Here, we could make reInitEvent.getResourceSet() be merged with existing resourceSet.localizedResource upfront, so that both oldResourceSet and newResourceSet contain full copy of resources, rather than delta. This was actually intentional. Consider the case where the original process has many resources to Localize but the upgrade launch script just needs a binary change in addition to the existing resources. If the resourceSets were merged upfront, then in the ReInitializeContainerTransition , the ContainerLocalizationRequestEvent that gets sent would include ALL the resources, instead of just the single resource. The Container will have to remain in the REINITIALIZING state till it receives RESOURCE_LOCALIZED events for all the resources in the combined resultset before being able to launch. the container.reInitContext!= null check is not needed.. Think we do, else it might cause an NPE when the LaunchTransition happens as part of the initial container startup. I found the resourceSet is also not updated when rollback in RetryFailureTransition Good catch... I also like your refactoring.. i've incorporated it in the latest patch
        Hide
        jianhe Jian He added a comment -

        Thanks Arun, some more comments:

        • Here, we could make reInitEvent.getResourceSet() be merged with existing resourceSet.localizedResource upfront, so that both oldResourceSet and newResourceSet contain full copy of resources, rather than delta. Doing this, the logic of container.resourceSet = container.reInitContext.mergedResourceSet(); will not needed. We can simply set it with container.resourceSet = reInitContext.newResoureSet, similar to what’s being done for container.launchContext = reInitContext.newLaunchContext
          return new ReInitializationContext(reInitEvent.getReInitLaunchContext(),
              reInitEvent.getResourceSet(), container.getLaunchContext(),
              container.resourceSet, reInitEvent.getRetryFailureContext(), 
              reInitEvent.isAutoCommit());
          
          
        • nit: the container.reInitContext!= null check is not needed.
          if (container.reInitContext != null 
              && container.reInitContext.autoCommit) {
          
        • I found the resourceSet is also not updated when rollback in RetryFailureTransition, I also tried some refactoring, may be something like below:
                ContainerRetryContext retryContext = container.containerRetryContext;
                int remainingAttempts = container.remainingRetryAttempts;
                if (container.reInitContext != null) {
                  retryContext = container.reInitContext.retryOnFailueContext;
                  remainingAttempts = container.reInitContext.retryAttemptsRemaining;
                }
          
                if (shouldRetry(container.exitCode, retryContext,remainingAttempts)) {
                  // TODO state-store operation
                  doRelaunch(container, container.remainingRetryAttempts,
                      container.containerRetryContext.getRetryInterval());
                } else if (container.canRollback()) {
                  // rollback
                  container.reInitContext = new ReInitializationContext(
                      container.reInitContext.oldLaunchContext,
                      container.reInitContext.oldResourceSet, null, null,
                      container.containerRetryContext, true);
                  new KilledExternallyForReInitTransition().transition(container, event);
                } else {
                  // fail
                  new ExitedWithFailureTransition(true).transition(container, event);
                  return ContainerState.EXITED_WITH_FAILURE;
                }
              }
          
            public static boolean shouldRetry(int errorCode,
                ContainerRetryContext retryContext, int remainingRetryAttempts) {
              if (retryContext == null) {
                return false;
              }
            .....
          
        • testContainerUpgradeRollbackDueToFailure: comment does not match code
              // Wait for new processStartfile to be created
              while (!oldStartFile.exists() && timeoutSecs++ < 20) {
          
        Show
        jianhe Jian He added a comment - Thanks Arun, some more comments: Here, we could make reInitEvent.getResourceSet() be merged with existing resourceSet.localizedResource upfront, so that both oldResourceSet and newResourceSet contain full copy of resources, rather than delta. Doing this, the logic of container.resourceSet = container.reInitContext.mergedResourceSet(); will not needed. We can simply set it with container.resourceSet = reInitContext.newResoureSet , similar to what’s being done for container.launchContext = reInitContext.newLaunchContext return new ReInitializationContext(reInitEvent.getReInitLaunchContext(),     reInitEvent.getResourceSet(), container.getLaunchContext(),     container.resourceSet, reInitEvent.getRetryFailureContext(),     reInitEvent.isAutoCommit()); nit: the container.reInitContext!= null check is not needed. if (container.reInitContext != null     && container.reInitContext.autoCommit) { I found the resourceSet is also not updated when rollback in RetryFailureTransition, I also tried some refactoring, may be something like below: ContainerRetryContext retryContext = container.containerRetryContext; int remainingAttempts = container.remainingRetryAttempts; if (container.reInitContext != null ) { retryContext = container.reInitContext.retryOnFailueContext; remainingAttempts = container.reInitContext.retryAttemptsRemaining; } if (shouldRetry(container.exitCode, retryContext,remainingAttempts)) { // TODO state-store operation doRelaunch(container, container.remainingRetryAttempts, container.containerRetryContext.getRetryInterval()); } else if (container.canRollback()) { // rollback container.reInitContext = new ReInitializationContext( container.reInitContext.oldLaunchContext, container.reInitContext.oldResourceSet, null , null , container.containerRetryContext, true ); new KilledExternallyForReInitTransition().transition(container, event); } else { // fail new ExitedWithFailureTransition( true ).transition(container, event); return ContainerState.EXITED_WITH_FAILURE; } } public static boolean shouldRetry( int errorCode, ContainerRetryContext retryContext, int remainingRetryAttempts) { if (retryContext == null ) { return false ; } ..... testContainerUpgradeRollbackDueToFailure: comment does not match code // Wait for new processStartfile to be created while (!oldStartFile.exists() && timeoutSecs++ < 20) {
        Hide
        asuresh Arun Suresh added a comment - - edited

        Updating patch based on Jian He's suggesting and rebasing with latest YARN-5620 patch.

        Do we need to do something for this condition ? else, it can be removed,

        Yeah.. can be removed.. I had put that there to remind me of something.. forgot to remove it

        In RollbackContainerTransition: the container.getResourceSet() will return all resources including current and previous version. We should re-request only the previous version's resources, rather than the union of both?

        In the latest patch, the resourceSet is reverted to previous state as well.

        I still have question on the commit API, how does AM use this API in practice ?

        Commit is just a way for the AM to tell the NM that it is fine with the upgrade (after it performs some upgrade diagnostics check on the container perhaps) and the container is working as it should be.. After the AM does a commit, the container cannot be rolledback and any bookkeeping required to rollback (the reInitContext for eg.) can is deleted by the NM.

        Prior to a commit, if the upgraded Container fails, NM can choose to automatically rollback. After the AM issues a commit, NM will not be able to rollback.

        Of course the AM is still free to call 'upgrade' again, with an old launch context.

        By default, autoCommit is 'true' which means, as soon as the container is upgraded, it is also committed.

        ..one implication for this API is that we'll have to persiste the commit state for NM recovery later on.

        Yes.. we would.. I plan to open a JIRA to address NMStateStore changes for this as well as YARN-5620

        Also, should the rollback API be always be able to rollback ?

        Once Commit has been called, you cannot rollback. The AM would have to explicitly call the upgrade API again with the previous launchContext.

        ContainerLaunchContext already has the ContainerRetryContext ? can we reuse that retryContext?

        I wanted to distinguish between the retry policy used to retry a failed container and the policy used to decide failure retries during upgrades. It is possible both can be the same. I just put that argument there in the upgrade() API to make it explicit.

        The ContainerImpl#ContainerRetryContext is not updated to new value on upgrade.

        This is fixed in the latest YARN-5620 patch

        RetryFailureTranstion: it's a bit complicated.. is it possible to simplify it something like below:

        I refactored it a bit.. let me know if its ok.

        Show
        asuresh Arun Suresh added a comment - - edited Updating patch based on Jian He 's suggesting and rebasing with latest YARN-5620 patch. Do we need to do something for this condition ? else, it can be removed, Yeah.. can be removed.. I had put that there to remind me of something.. forgot to remove it In RollbackContainerTransition: the container.getResourceSet() will return all resources including current and previous version. We should re-request only the previous version's resources, rather than the union of both? In the latest patch, the resourceSet is reverted to previous state as well. I still have question on the commit API, how does AM use this API in practice ? Commit is just a way for the AM to tell the NM that it is fine with the upgrade (after it performs some upgrade diagnostics check on the container perhaps) and the container is working as it should be.. After the AM does a commit, the container cannot be rolledback and any bookkeeping required to rollback (the reInitContext for eg.) can is deleted by the NM. Prior to a commit, if the upgraded Container fails, NM can choose to automatically rollback. After the AM issues a commit, NM will not be able to rollback. Of course the AM is still free to call 'upgrade' again, with an old launch context. By default, autoCommit is 'true' which means, as soon as the container is upgraded, it is also committed. ..one implication for this API is that we'll have to persiste the commit state for NM recovery later on. Yes.. we would.. I plan to open a JIRA to address NMStateStore changes for this as well as YARN-5620 Also, should the rollback API be always be able to rollback ? Once Commit has been called, you cannot rollback. The AM would have to explicitly call the upgrade API again with the previous launchContext. ContainerLaunchContext already has the ContainerRetryContext ? can we reuse that retryContext? I wanted to distinguish between the retry policy used to retry a failed container and the policy used to decide failure retries during upgrades. It is possible both can be the same. I just put that argument there in the upgrade() API to make it explicit. The ContainerImpl#ContainerRetryContext is not updated to new value on upgrade. This is fixed in the latest YARN-5620 patch RetryFailureTranstion: it's a bit complicated.. is it possible to simplify it something like below: I refactored it a bit.. let me know if its ok.
        Hide
        jianhe Jian He added a comment -

        Thanks Arun, some comments:

        • Do we need to do something for this condition ? else, it can be removed,
                if (container.reInitContext == null) {
          
                }
          
        • In RollbackContainerTransition: the container.getResourceSet() will return all resources including current and previous version. We should re-request only the previous version's resources, rather than the union of both? we probably need to maintain current and previous versions of (ContainerLaunchContext + ResourceSet) ?
              @Override
              protected ResourceSet getReInitResourceSet(ContainerImpl
                  container, ContainerEvent event) {
                return container.getResourceSet();
              }
          
        • I still have question on the commit API, how does AM use this API in practice ? one implication for this API is that we'll have to persiste the commit state for NM recovery later on.
        • Also, should the rollback API be always be able to rollback ?
        • ContainerLaunchContext already has the ContainerRetryContext ? can we reuse that retryContext?
            public void upgradeContainer(ContainerId containerId,
                ContainerLaunchContext upgradeLaunchContext, boolean autoCommit,
                ContainerRetryContext retryFailureContext) throws YarnException {
          
        • The ContainerImpl#ContainerRetryContext is not updated to new value on upgrade.
        • RetryFailureTranstion: it's a bit complicated.. is it possible to simplify it something like below:
          if (shouldRetry(exitCode, retryContext)) {
          	
          } else if (shouldRollback) {
          	
          } else {
          	// exit
          }
          
        Show
        jianhe Jian He added a comment - Thanks Arun, some comments: Do we need to do something for this condition ? else, it can be removed, if (container.reInitContext == null ) { } In RollbackContainerTransition: the container.getResourceSet() will return all resources including current and previous version. We should re-request only the previous version's resources, rather than the union of both? we probably need to maintain current and previous versions of (ContainerLaunchContext + ResourceSet) ? @Override protected ResourceSet getReInitResourceSet(ContainerImpl container, ContainerEvent event) { return container.getResourceSet(); } I still have question on the commit API, how does AM use this API in practice ? one implication for this API is that we'll have to persiste the commit state for NM recovery later on. Also, should the rollback API be always be able to rollback ? ContainerLaunchContext already has the ContainerRetryContext ? can we reuse that retryContext? public void upgradeContainer(ContainerId containerId, ContainerLaunchContext upgradeLaunchContext, boolean autoCommit, ContainerRetryContext retryFailureContext) throws YarnException { The ContainerImpl#ContainerRetryContext is not updated to new value on upgrade. RetryFailureTranstion: it's a bit complicated.. is it possible to simplify it something like below: if (shouldRetry(exitCode, retryContext)) { } else if (shouldRollback) { } else { // exit }
        Hide
        asuresh Arun Suresh added a comment -

        Uploading initial patch based on the framework laid out in YARN-5620.
        I havn't submitted the patch yet since it depends on YARN-5620.. will do once that goes in.

        Jian He, Varun Vasudev... do let me know what you think.

        Some assumptions:

        • AutoCommit is set to true by default
        • If retryFailurePolicy is NOT specified, then the Container will be terminated immediately if upgrade fails.
        • If retryFailurePolicy is specified, the upgraded Container will be retried/relaunch as per the policy.. and if that fails... then container will be rolledback to previous state.
        • If autoCommit is false and user has called the commitUpgarde API, then no rollback will be allowed after that.
        • If autoCommit is true then no Explicit or implicit (upgraded process fails to start) rollback will be allowed.
        Show
        asuresh Arun Suresh added a comment - Uploading initial patch based on the framework laid out in YARN-5620 . I havn't submitted the patch yet since it depends on YARN-5620 .. will do once that goes in. Jian He , Varun Vasudev ... do let me know what you think. Some assumptions: AutoCommit is set to true by default If retryFailurePolicy is NOT specified, then the Container will be terminated immediately if upgrade fails. If retryFailurePolicy is specified, the upgraded Container will be retried/relaunch as per the policy.. and if that fails... then container will be rolledback to previous state. If autoCommit is false and user has called the commitUpgarde API, then no rollback will be allowed after that. If autoCommit is true then no Explicit or implicit (upgraded process fails to start) rollback will be allowed.

          People

          • Assignee:
            asuresh Arun Suresh
            Reporter:
            asuresh Arun Suresh
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development