Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-6798

Fix NM startup failure with old state store due to version mismatch

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 3.0.0-alpha4
    • Fix Version/s: 2.9.0, 3.0.0-beta1
    • Component/s: nodemanager
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Hide
      <!-- markdown -->

      This fixes the LevelDB state store for the NodeManager. As of this patch, the state store versions now correspond to the following table.

      * Previous Patch: YARN-5049
        * LevelDB Key: queued
        * Hadoop Versions: 2.9.0, 3.0.0-alpha1
        * Corresponding LevelDB Version: 1.2
      * Previous Patch: YARN-6127
        * LevelDB Key: AMRMProxy/NextMasterKey
        * Hadoop Versions: 2.9.0, 3.0.0-alpha4
        * Corresponding LevelDB Version: 1.1
      Show
      <!-- markdown --> This fixes the LevelDB state store for the NodeManager. As of this patch, the state store versions now correspond to the following table. * Previous Patch: YARN-5049   * LevelDB Key: queued   * Hadoop Versions: 2.9.0, 3.0.0-alpha1   * Corresponding LevelDB Version: 1.2 * Previous Patch: YARN-6127   * LevelDB Key: AMRMProxy/NextMasterKey   * Hadoop Versions: 2.9.0, 3.0.0-alpha4   * Corresponding LevelDB Version: 1.1

      Description

      YARN-6703 rolled back the state store version number for the RM from 2.0 to 1.4.

      YARN-6127 bumped the version for the NM to 3.0

      private static final Version CURRENT_VERSION_INFO = Version.newInstance(3, 0);

      YARN-5049 bumped the version for the NM to 2.0

      private static final Version CURRENT_VERSION_INFO = Version.newInstance(2, 0);

      During an upgrade, all NMs died after upgrading a C6 cluster from alpha2 to alpha4.

      2017-07-07 15:48:17,259 FATAL org.apache.hadoop.yarn.server.nodemanager.NodeManager: Error starting NodeManager
      org.apache.hadoop.service.ServiceStateException: java.io.IOException: Incompatible version for NM state: expecting NM state version 3.0, but loading version 2.0
              at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105)
              at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
              at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:246)
              at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:307)
              at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
              at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:748)
              at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:809)
      Caused by: java.io.IOException: Incompatible version for NM state: expecting NM state version 3.0, but loading version 2.0
              at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.checkVersion(NMLeveldbStateStoreService.java:1454)
              at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:1308)
              at org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:307)
              at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
              ... 5 more
      2017-07-07 15:48:17,277 INFO org.apache.hadoop.yarn.server.nodemanager.NodeManager: SHUTDOWN_MSG:
      /************************************************************
      SHUTDOWN_MSG: Shutting down NodeManager at xxx.gce.cloudera.com/aa.bb.cc.dd
      ************************************************************/
      
      1. YARN-6798.v1.patch
        1 kB
        Botong Huang
      2. YARN-6798.v2.patch
        1 kB
        Ray Chiang

        Issue Links

          Activity

          Hide
          botong Botong Huang added a comment - - edited

          Thanks Karthik Kambatla for catching it. I've created YARN-7074 to fix the typo.

          Show
          botong Botong Huang added a comment - - edited Thanks Karthik Kambatla for catching it. I've created YARN-7074 to fix the typo.
          Hide
          kasha Karthik Kambatla added a comment -

          Nit pick on the patch: the second line says 1.2 to 1.2. The intention was likely 1.1 to 1.2?

          Show
          kasha Karthik Kambatla added a comment - Nit pick on the patch: the second line says 1.2 to 1.2 . The intention was likely 1.1 to 1.2 ?
          Hide
          subru Subru Krishnan added a comment -

          Backported this to branch-2 based on Karthik Kambatla's feedback here on YARN-6127.

          Show
          subru Subru Krishnan added a comment - Backported this to branch-2 based on Karthik Kambatla 's feedback here on YARN-6127 .
          Hide
          botong Botong Huang added a comment -

          Thanks Ray Chiang!

          Show
          botong Botong Huang added a comment - Thanks Ray Chiang !
          Hide
          rchiang Ray Chiang added a comment -

          Committed to trunk.

          Thanks Botong Huang for the contribution! Thanks Jason Lowe and Arun Suresh for the comments!

          Show
          rchiang Ray Chiang added a comment - Committed to trunk. Thanks Botong Huang for the contribution! Thanks Jason Lowe and Arun Suresh for the comments!
          Hide
          rchiang Ray Chiang added a comment -

          +1

          I'm going to commit this tomorrow unless I hear otherwise.

          Show
          rchiang Ray Chiang added a comment - +1 I'm going to commit this tomorrow unless I hear otherwise.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 11s Docker mode activated.
                Prechecks
          +1 @author 0m 0s The patch does not contain any @author tags.
          -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
                trunk Compile Tests
          +1 mvninstall 12m 41s trunk passed
          +1 compile 0m 28s trunk passed
          +1 checkstyle 0m 18s trunk passed
          +1 mvnsite 0m 26s trunk passed
          -1 findbugs 0m 41s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager in trunk has 5 extant Findbugs warnings.
          +1 javadoc 0m 18s trunk passed
                Patch Compile Tests
          +1 mvninstall 0m 22s the patch passed
          +1 compile 0m 25s the patch passed
          +1 javac 0m 25s the patch passed
          +1 checkstyle 0m 14s the patch passed
          +1 mvnsite 0m 23s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 0m 44s the patch passed
          +1 javadoc 0m 14s the patch passed
                Other Tests
          +1 unit 12m 53s hadoop-yarn-server-nodemanager in the patch passed.
          +1 asflicense 0m 17s The patch does not generate ASF License warnings.
          31m 53s



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:14b5c93
          JIRA Issue YARN-6798
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12877635/YARN-6798.v2.patch
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 1a3c587bbb28 4.4.0-43-generic #63-Ubuntu SMP Wed Oct 12 13:48:03 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / b0e78ae
          Default Java 1.8.0_131
          findbugs v3.1.0-RC1
          findbugs https://builds.apache.org/job/PreCommit-YARN-Build/16468/artifact/patchprocess/branch-findbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager-warnings.html
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/16468/testReport/
          modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/16468/console
          Powered by Apache Yetus 0.6.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 11s Docker mode activated.       Prechecks +1 @author 0m 0s The patch does not contain any @author tags. -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.       trunk Compile Tests +1 mvninstall 12m 41s trunk passed +1 compile 0m 28s trunk passed +1 checkstyle 0m 18s trunk passed +1 mvnsite 0m 26s trunk passed -1 findbugs 0m 41s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager in trunk has 5 extant Findbugs warnings. +1 javadoc 0m 18s trunk passed       Patch Compile Tests +1 mvninstall 0m 22s the patch passed +1 compile 0m 25s the patch passed +1 javac 0m 25s the patch passed +1 checkstyle 0m 14s the patch passed +1 mvnsite 0m 23s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 0m 44s the patch passed +1 javadoc 0m 14s the patch passed       Other Tests +1 unit 12m 53s hadoop-yarn-server-nodemanager in the patch passed. +1 asflicense 0m 17s The patch does not generate ASF License warnings. 31m 53s Subsystem Report/Notes Docker Image:yetus/hadoop:14b5c93 JIRA Issue YARN-6798 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12877635/YARN-6798.v2.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 1a3c587bbb28 4.4.0-43-generic #63-Ubuntu SMP Wed Oct 12 13:48:03 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / b0e78ae Default Java 1.8.0_131 findbugs v3.1.0-RC1 findbugs https://builds.apache.org/job/PreCommit-YARN-Build/16468/artifact/patchprocess/branch-findbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager-warnings.html Test Results https://builds.apache.org/job/PreCommit-YARN-Build/16468/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/16468/console Powered by Apache Yetus 0.6.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          rchiang Ray Chiang added a comment -

          Updated Botong's patch with the newer version organization.

          Show
          rchiang Ray Chiang added a comment - Updated Botong's patch with the newer version organization.
          Hide
          botong Botong Huang added a comment -

          Sounds good, thx!

          Show
          botong Botong Huang added a comment - Sounds good, thx!
          Hide
          rchiang Ray Chiang added a comment - - edited

          Updating the version table:

          Patch LevelDBKey(s) Hadoop Versions Commit Date NM LevelDB Version
          YARN-5049 queued (2.9.0, 3.0.0-alpha1) May 11, 2016 1.2
          YARN-6127 AMRMProxy/NextMasterKey (2.9.0, 3.0.0-alpha4) June 22, 2017 1.1
          Show
          rchiang Ray Chiang added a comment - - edited Updating the version table: Patch LevelDBKey(s) Hadoop Versions Commit Date NM LevelDB Version YARN-5049 queued (2.9.0, 3.0.0-alpha1) May 11, 2016 1.2 YARN-6127 AMRMProxy/NextMasterKey (2.9.0, 3.0.0-alpha4) June 22, 2017 1.1
          Hide
          rchiang Ray Chiang added a comment -

          Thanks Arun Suresh! Botong Huang, it looks like we'll use 1.2 as our current version.

          Show
          rchiang Ray Chiang added a comment - Thanks Arun Suresh ! Botong Huang , it looks like we'll use 1.2 as our current version.
          Hide
          asuresh Arun Suresh added a comment -

          Ray Chiang, I've committed YARN-5049 to branch-2 (cherry-picked and set version the version to 1.2)

          Show
          asuresh Arun Suresh added a comment - Ray Chiang , I've committed YARN-5049 to branch-2 (cherry-picked and set version the version to 1.2)
          Hide
          rchiang Ray Chiang added a comment -

          Thanks Arun Suresh. That's what I get for relying on JIRA and forgetting to check git.

          Show
          rchiang Ray Chiang added a comment - Thanks Arun Suresh . That's what I get for relying on JIRA and forgetting to check git.
          Hide
          asuresh Arun Suresh added a comment -

          I had rolled back YARN-5049 from branch-2 precisely because of the major version bump. Will cherry-pick and update it today with a minor version bump. You can mark this

          Show
          asuresh Arun Suresh added a comment - I had rolled back YARN-5049 from branch-2 precisely because of the major version bump. Will cherry-pick and update it today with a minor version bump. You can mark this
          Hide
          rchiang Ray Chiang added a comment -

          Finally got a bit of time to look at the previous patches. I see a minor issue.

          Patch LevelDBKey(s) Hadoop Versions Commit Date
          YARN-5049 queued 3.0.0-alpha1 May 11, 2016
          YARN-6127 AMRMProxy/NextMasterKey (2.9.0, 3.0.0-alpha4) June 22, 2017

          So, branch-2 has just YARN-6127, while trunk has YARN-5049 and YARN-6127. If we label YARN-5049 as 1.1 and YARN-6127 as 1.2, then branch-2's having a 1.2 version won't quite be accurate. If do the reverse, we'd be chronologically backward (which seems okay to me, but I'd like a second opinion).

          Show
          rchiang Ray Chiang added a comment - Finally got a bit of time to look at the previous patches. I see a minor issue. Patch LevelDBKey(s) Hadoop Versions Commit Date YARN-5049 queued 3.0.0-alpha1 May 11, 2016 YARN-6127 AMRMProxy/NextMasterKey (2.9.0, 3.0.0-alpha4) June 22, 2017 So, branch-2 has just YARN-6127 , while trunk has YARN-5049 and YARN-6127 . If we label YARN-5049 as 1.1 and YARN-6127 as 1.2, then branch-2's having a 1.2 version won't quite be accurate. If do the reverse, we'd be chronologically backward (which seems okay to me, but I'd like a second opinion).
          Hide
          rchiang Ray Chiang added a comment -

          It would be helpful to have a release note that calls out the incompatibility with 3.0-alpha releases and that users who are upgrading from one of those releases will need to erase the NM state store on each node before upgrading.

          Agreed. I intend to modify the release notes for this JIRA and the previous two to make this versioning issue clear.

          Show
          rchiang Ray Chiang added a comment - It would be helpful to have a release note that calls out the incompatibility with 3.0-alpha releases and that users who are upgrading from one of those releases will need to erase the NM state store on each node before upgrading. Agreed. I intend to modify the release notes for this JIRA and the previous two to make this versioning issue clear.
          Hide
          jlowe Jason Lowe added a comment -

          IMHO we should only need to bump the major version if any of the following are true:

          • Older NM software will explode when it tries to recover the state store
          • Older NM software fails to do something crucial during recovery due to ignoring something in the state store

          otherwise we can keep the major version the same and simply bump the minor version. It looks like the two features added to the state store in a way where we can remain on 1.x, but I haven't dug into it deeply to be sure.

          This will be incompatible the previous alphas and anyone running directly from branch-2 builds.

          True, but that's the risk of running on unreleased software (as is the case with branch-2). Anyone could check in something that isn't backwards-compatible that needs to be subsequently fixed, and that could break users who happened to deploy in-between. AFAIK we don't make any commitments to compatibility except for official Apache Hadoop releases.

          I would argue the same applies to alpha releases. The whole point of calling it alpha is to convey that APIs may be unstable and could disappear or change in an incompatible way in the next release. It will be annoying to users who expect to do a rolling upgrade from 3.0-alphaX, but given the "alpha" tag I would not expect anyone to have deployed this in a production environment such that they cannot live with a downtime when upgrading to a subsequent release.

          It would be helpful to have a release note that calls out the incompatibility with 3.0-alpha releases and that users who are upgrading from one of those releases will need to erase the NM state store on each node before upgrading.

          Show
          jlowe Jason Lowe added a comment - IMHO we should only need to bump the major version if any of the following are true: Older NM software will explode when it tries to recover the state store Older NM software fails to do something crucial during recovery due to ignoring something in the state store otherwise we can keep the major version the same and simply bump the minor version. It looks like the two features added to the state store in a way where we can remain on 1.x, but I haven't dug into it deeply to be sure. This will be incompatible the previous alphas and anyone running directly from branch-2 builds. True, but that's the risk of running on unreleased software (as is the case with branch-2). Anyone could check in something that isn't backwards-compatible that needs to be subsequently fixed, and that could break users who happened to deploy in-between. AFAIK we don't make any commitments to compatibility except for official Apache Hadoop releases. I would argue the same applies to alpha releases. The whole point of calling it alpha is to convey that APIs may be unstable and could disappear or change in an incompatible way in the next release. It will be annoying to users who expect to do a rolling upgrade from 3.0-alphaX, but given the "alpha" tag I would not expect anyone to have deployed this in a production environment such that they cannot live with a downtime when upgrading to a subsequent release. It would be helpful to have a release note that calls out the incompatibility with 3.0-alpha releases and that users who are upgrading from one of those releases will need to erase the NM state store on each node before upgrading.
          Hide
          botong Botong Huang added a comment -

          Yeah, I guess we need to decide to go with 1.1 or 2.1.

          Show
          botong Botong Huang added a comment - Yeah, I guess we need to decide to go with 1.1 or 2.1.
          Hide
          rchiang Ray Chiang added a comment -

          The failed unit test looks like YARN-5857.

          I'm okay with this update as it is. This will be incompatible the previous alphas and anyone running directly from branch-2 builds. Does anyone have any problems with that?

          Show
          rchiang Ray Chiang added a comment - The failed unit test looks like YARN-5857 . I'm okay with this update as it is. This will be incompatible the previous alphas and anyone running directly from branch-2 builds. Does anyone have any problems with that?
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 15s Docker mode activated.
                Prechecks
          +1 @author 0m 0s The patch does not contain any @author tags.
          -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
                trunk Compile Tests
          +1 mvninstall 14m 17s trunk passed
          +1 compile 0m 31s trunk passed
          +1 checkstyle 0m 19s trunk passed
          +1 mvnsite 0m 33s trunk passed
          -1 findbugs 0m 50s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager in trunk has 5 extant Findbugs warnings.
          +1 javadoc 0m 18s trunk passed
                Patch Compile Tests
          +1 mvninstall 0m 27s the patch passed
          +1 compile 0m 28s the patch passed
          +1 javac 0m 28s the patch passed
          +1 checkstyle 0m 16s the patch passed
          +1 mvnsite 0m 27s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 0m 52s the patch passed
          +1 javadoc 0m 19s the patch passed
                Other Tests
          -1 unit 12m 53s hadoop-yarn-server-nodemanager in the patch failed.
          +1 asflicense 0m 17s The patch does not generate ASF License warnings.
          34m 18s



          Reason Tests
          Failed junit tests hadoop.yarn.server.nodemanager.containermanager.logaggregation.TestLogAggregationService



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:14b5c93
          JIRA Issue YARN-6798
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12876737/YARN-6798.v1.patch
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux be6b0927285c 3.13.0-116-generic #163-Ubuntu SMP Fri Mar 31 14:13:22 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / d670c3a
          Default Java 1.8.0_131
          findbugs v3.1.0-RC1
          findbugs https://builds.apache.org/job/PreCommit-YARN-Build/16375/artifact/patchprocess/branch-findbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager-warnings.html
          unit https://builds.apache.org/job/PreCommit-YARN-Build/16375/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt
          Test Results https://builds.apache.org/job/PreCommit-YARN-Build/16375/testReport/
          modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
          Console output https://builds.apache.org/job/PreCommit-YARN-Build/16375/console
          Powered by Apache Yetus 0.6.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 15s Docker mode activated.       Prechecks +1 @author 0m 0s The patch does not contain any @author tags. -1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.       trunk Compile Tests +1 mvninstall 14m 17s trunk passed +1 compile 0m 31s trunk passed +1 checkstyle 0m 19s trunk passed +1 mvnsite 0m 33s trunk passed -1 findbugs 0m 50s hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager in trunk has 5 extant Findbugs warnings. +1 javadoc 0m 18s trunk passed       Patch Compile Tests +1 mvninstall 0m 27s the patch passed +1 compile 0m 28s the patch passed +1 javac 0m 28s the patch passed +1 checkstyle 0m 16s the patch passed +1 mvnsite 0m 27s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 0m 52s the patch passed +1 javadoc 0m 19s the patch passed       Other Tests -1 unit 12m 53s hadoop-yarn-server-nodemanager in the patch failed. +1 asflicense 0m 17s The patch does not generate ASF License warnings. 34m 18s Reason Tests Failed junit tests hadoop.yarn.server.nodemanager.containermanager.logaggregation.TestLogAggregationService Subsystem Report/Notes Docker Image:yetus/hadoop:14b5c93 JIRA Issue YARN-6798 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12876737/YARN-6798.v1.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux be6b0927285c 3.13.0-116-generic #163-Ubuntu SMP Fri Mar 31 14:13:22 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / d670c3a Default Java 1.8.0_131 findbugs v3.1.0-RC1 findbugs https://builds.apache.org/job/PreCommit-YARN-Build/16375/artifact/patchprocess/branch-findbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager-warnings.html unit https://builds.apache.org/job/PreCommit-YARN-Build/16375/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/16375/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager Console output https://builds.apache.org/job/PreCommit-YARN-Build/16375/console Powered by Apache Yetus 0.6.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          botong Botong Huang added a comment -

          v1 patch uploaded that roll back version to 1.1, with added notes. What do you guys think?

          Show
          botong Botong Huang added a comment - v1 patch uploaded that roll back version to 1.1, with added notes. What do you guys think?
          Hide
          asuresh Arun Suresh added a comment -

          I agree with Jason.

          To move forward, I suggest we follow YARN-6703 and rollback the version to 1.2 perhaps - since AFAIK, we are just introducing new container states (and the features that need the new states are available only in the new releases) in case of YARN-5049 and the AMRMProxy state, in case of YARN-6127, which is completely new.

          Show
          asuresh Arun Suresh added a comment - I agree with Jason. To move forward, I suggest we follow YARN-6703 and rollback the version to 1.2 perhaps - since AFAIK, we are just introducing new container states (and the features that need the new states are available only in the new releases) in case of YARN-5049 and the AMRMProxy state, in case of YARN-6127 , which is completely new.
          Hide
          jlowe Jason Lowe added a comment -

          I don't know the full story behind these various version bumps, but we need to stop the habit of bumping the major version in the state store without providing a migration path for older versions.

          If we really need to bump the state store major version to support a new feature, my preference would be to do this in a lazy fashion as much as possible, i.e.: the major version should not be updated in the state store until the new feature is enabled/used. That way we don't lose the ability to rollback the release to the old version if something goes terribly wrong after the upgrade before the new feature is used. If that's not possible for some reason then the code needs to recognize the older state store versions on startup and either do a one-time pass over the data on startup to migrate it to the new schema or otherwise deal with it on the fly during reading for recovery.

          Show
          jlowe Jason Lowe added a comment - I don't know the full story behind these various version bumps, but we need to stop the habit of bumping the major version in the state store without providing a migration path for older versions. If we really need to bump the state store major version to support a new feature, my preference would be to do this in a lazy fashion as much as possible, i.e.: the major version should not be updated in the state store until the new feature is enabled/used. That way we don't lose the ability to rollback the release to the old version if something goes terribly wrong after the upgrade before the new feature is used. If that's not possible for some reason then the code needs to recognize the older state store versions on startup and either do a one-time pass over the data on startup to migrate it to the new schema or otherwise deal with it on the fly during reading for recovery.
          Hide
          rchiang Ray Chiang added a comment -

          Konstantinos Karanasos, Subru Krishnan, Botong Huang, Arun Suresh. It looks like you guys bumped the NM version twice. Is this behavior desirable or is it preferable to have more compatible state store versions (i.e. 1.0 -> 1.1 -> 1.2 instead of 1.0 -> 2.0 -> 3.0).

          Plus, anyone else who has thoughts about NM rolling upgrade, please chime in.

          Show
          rchiang Ray Chiang added a comment - Konstantinos Karanasos , Subru Krishnan , Botong Huang , Arun Suresh . It looks like you guys bumped the NM version twice. Is this behavior desirable or is it preferable to have more compatible state store versions (i.e. 1.0 -> 1.1 -> 1.2 instead of 1.0 -> 2.0 -> 3.0). Plus, anyone else who has thoughts about NM rolling upgrade, please chime in.

            People

            • Assignee:
              botong Botong Huang
              Reporter:
              rchiang Ray Chiang
            • Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development