Hadoop Common
  1. Hadoop Common
  2. HADOOP-8209

Add option to relax build-version check for branch-1

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.0.0
    • Fix Version/s: 1.1.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      In 1.x DNs currently refuse to connect to NNs if their build revision (ie svn revision) do not match. TTs refuse to connect to JTs if their build version (version, revision, user, and source checksum) do not match.

      This prevents rolling upgrades, which is intentional, see the discussion in HADOOP-5203. The primary motivation in that jira was (1) it's difficult to guarantee every build on a large cluster got deployed correctly, builds don't get rolled back to old versions by accident etc, and (2) mixed versions can lead to execution problems that are hard to debug.

      However there are also cases when users know they two builds are compatible, eg when deploying a new build which contains the same contents as the previous one, plus a critical security patch that does not affect compatibility. Currently deploying a 1 line patch requires taking down the entire cluster (or trying to work around the issue by lying about the build revision or checksum, yuck). These users would like to be able to perform a rolling upgrade.

      In order to support this, let's add an option that is off by default, but, when enabled, makes the DN and TT version check just check for an exact version match (eg "1.0.2") but ignore the build revision (DN) and the source checksum (TT). Two builds still need to match the major, minor, and point numbers, but nothing else.

      1. hadoop-8209.txt
        19 kB
        Eli Collins
      2. hadoop-8209.txt
        25 kB
        Eli Collins
      3. hadoop-8209.txt
        24 kB
        Eli Collins

        Issue Links

          Activity

          Hide
          Andrew Purtell added a comment -

          +1

          We use this practice.

          This makes it possible to do a rolling restart of DataNodes without taking down service by bouncing the NameNode. This is most useful when the change scope is restricted to the DN. If HA is backported to branch-1 we could handle most NN changes similarly: Upgrade the NNs one at a time with manual failover for no downtime. One issue remaining is that modification of a NN<->DN interface method requires a kludgy migration over three updates.

          It is also possible to do this with TaskTrackers, but this will fail currently running tasks on the TT. Even so we can still stage in a TT bugfix release, just more slowly. Bouncing the JobTracker remains a big deal, but the maintenance window for that becomes very short if everything else has been rolled out ahead of time. With some "HA JT" option for branch-1 (Corona?) this might also have no downtime.

          Show
          Andrew Purtell added a comment - +1 We use this practice. This makes it possible to do a rolling restart of DataNodes without taking down service by bouncing the NameNode. This is most useful when the change scope is restricted to the DN. If HA is backported to branch-1 we could handle most NN changes similarly: Upgrade the NNs one at a time with manual failover for no downtime. One issue remaining is that modification of a NN<->DN interface method requires a kludgy migration over three updates. It is also possible to do this with TaskTrackers, but this will fail currently running tasks on the TT. Even so we can still stage in a TT bugfix release, just more slowly. Bouncing the JobTracker remains a big deal, but the maintenance window for that becomes very short if everything else has been rolled out ahead of time. With some "HA JT" option for branch-1 (Corona?) this might also have no downtime.
          Hide
          Eli Collins added a comment -

          Patch attached. Adds new config option hadoop.relaxed.worker.version.check to relax the version check to just the version number. Aside from the new DN and TT tests that cover the current/default behavior and the new behavior, I tested on a cluster and verified that (1) DNs/TTs with different revisions can not join by default, and (2) using the new flag they can (and the new log message for this case is appropriate).

          Show
          Eli Collins added a comment - Patch attached. Adds new config option hadoop.relaxed.worker.version.check to relax the version check to just the version number. Aside from the new DN and TT tests that cover the current/default behavior and the new behavior, I tested on a cluster and verified that (1) DNs/TTs with different revisions can not join by default, and (2) using the new flag they can (and the new log message for this case is appropriate).
          Hide
          Eli Collins added a comment -

          Here are test-patch results. This doesn't introduce new findbugs, a null patch has 8 as well (HADOOP-7847).

               [exec] 
               [exec] -1 overall.  
               [exec] 
               [exec]     +1 @author.  The patch does not contain any @author tags.
               [exec] 
               [exec]     +1 tests included.  The patch appears to include 4 new or modified tests.
               [exec] 
               [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
               [exec] 
               [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.
               [exec] 
               [exec]     -1 findbugs.  The patch appears to introduce 8 new Findbugs (version 1.3.9) warnings.
               [exec] 
          
          Show
          Eli Collins added a comment - Here are test-patch results. This doesn't introduce new findbugs, a null patch has 8 as well ( HADOOP-7847 ). [exec] [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 4 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] -1 findbugs. The patch appears to introduce 8 new Findbugs (version 1.3.9) warnings. [exec]
          Hide
          Aaron T. Myers added a comment -

          Patch looks pretty good to me, Eli. Just a few small comments:

          1. Not obvious to me why we have these static version methods in the Storage class, which themselves just delegate to static methods of the VersionInfo class.
          2. Recommend adding additional detail to the AssertionErrors, including the revisions and versions that didn't match.
          3. Recommend adding an explanation to the DN log message about why the communication is being allowed, e.g.: "... because versions match exactly ('" + version + "') and hadoop.relaxed.worker.version.check is enabled." Ditto for TT.
          4. Similarly the log message explaining why communication isn't being allowed might mention whether the check failed because of strict revision checking, or relaxed version checking.
          5. Why call the new method "getInfoVersion" in JobTracker? getVersion, as was done in Storage, seems to make more sense to me.
          6. In TestTaskTrackerVersionCheck#testDefaultVersionCheck, I don't think you actually test that different revisions are still disallowed by default, since you change both the revision and version simultaneously in the test.
          Show
          Aaron T. Myers added a comment - Patch looks pretty good to me, Eli. Just a few small comments: Not obvious to me why we have these static version methods in the Storage class, which themselves just delegate to static methods of the VersionInfo class. Recommend adding additional detail to the AssertionErrors, including the revisions and versions that didn't match. Recommend adding an explanation to the DN log message about why the communication is being allowed, e.g.: "... because versions match exactly ('" + version + "') and hadoop.relaxed.worker.version.check is enabled." Ditto for TT. Similarly the log message explaining why communication isn't being allowed might mention whether the check failed because of strict revision checking, or relaxed version checking. Why call the new method "getInfoVersion" in JobTracker? getVersion, as was done in Storage, seems to make more sense to me. In TestTaskTrackerVersionCheck#testDefaultVersionCheck, I don't think you actually test that different revisions are still disallowed by default, since you change both the revision and version simultaneously in the test.
          Hide
          Eli Collins added a comment -

          Thanks for the review ATM. Updated patch, and re-tested.

          #1 Yea, was following the existing method but better to use VersionInfo directly, done.
          #2-4 Done
          #5 Because JT#getVersion exists (for MXBean#getVersion), in the updated patch I've addressed this via new NN/JT methods getBuildVersion which return the version, I renamed VersionInfo#getBuildVersion to getFullVersion to clear up the distinction between the build's version and what was called the "build version".
          #6 Fixed, took the same assert from testRelaxedVersionCheck and changed the polarity

          Show
          Eli Collins added a comment - Thanks for the review ATM. Updated patch, and re-tested. #1 Yea, was following the existing method but better to use VersionInfo directly, done. #2-4 Done #5 Because JT#getVersion exists (for MXBean#getVersion), in the updated patch I've addressed this via new NN/JT methods getBuildVersion which return the version, I renamed VersionInfo#getBuildVersion to getFullVersion to clear up the distinction between the build's version and what was called the "build version". #6 Fixed, took the same assert from testRelaxedVersionCheck and changed the polarity
          Hide
          Aaron T. Myers added a comment -

          I reviewed the delta, and it largely looks good. One tiny nit: looks like you variously spelled the word "disallow" either as "dissallow" or "dissalow".

          Patch looks good otherwise - +1. Please do also run the branch-1 test suite on the latest patch before committing.

          Show
          Aaron T. Myers added a comment - I reviewed the delta, and it largely looks good. One tiny nit: looks like you variously spelled the word "disallow" either as "dissallow" or "dissalow". Patch looks good otherwise - +1. Please do also run the branch-1 test suite on the latest patch before committing.
          Hide
          Eli Collins added a comment -

          Thanks ATM. Will fix the spelling misstake, running the full suite now.

          Show
          Eli Collins added a comment - Thanks ATM. Will fix the spelling misstake, running the full suite now.
          Hide
          Sanjay Radia added a comment -
          • I thought we were focusing on rolling upgrades for Hadoop 2 not Hadoop 1 given that wire compatibility is only in Hadoop 2.
          • The jira title should be "Add option to relax build-version check for branch-1"
          Show
          Sanjay Radia added a comment - I thought we were focusing on rolling upgrades for Hadoop 2 not Hadoop 1 given that wire compatibility is only in Hadoop 2. The jira title should be "Add option to relax build-version check for branch-1"
          Hide
          Eli Collins added a comment -

          Sanjay,
          See this discussion on HDFS-2983. For v1 this only enables rolling upgrade when there's an exact version match (eg v1.0.2), this is still very useful though as it allows people to perform rolling upgrades for a security patch or an EBF that doesn't affect compatibility (most don't).
          Updated the jira tile.

          Show
          Eli Collins added a comment - Sanjay, See this discussion on HDFS-2983 . For v1 this only enables rolling upgrade when there's an exact version match (eg v1.0.2), this is still very useful though as it allows people to perform rolling upgrades for a security patch or an EBF that doesn't affect compatibility (most don't). Updated the jira tile.
          Hide
          Eli Collins added a comment -

          Ran all the tests. There are 5 MR failures on branch-1, confirmed they all fail on a clean tree and filed MAPREDUCE-4142 for them.

          Show
          Eli Collins added a comment - Ran all the tests. There are 5 MR failures on branch-1, confirmed they all fail on a clean tree and filed MAPREDUCE-4142 for them.
          Hide
          Tom White added a comment -

          > I renamed VersionInfo#getBuildVersion to getFullVersion to clear up the distinction between the build's version and what was called the "build version".

          Rather than renaming, maybe add getFullVersion() and deprecate getBuildVersion()? Also, is this change needed on trunk as well?

          Show
          Tom White added a comment - > I renamed VersionInfo#getBuildVersion to getFullVersion to clear up the distinction between the build's version and what was called the "build version". Rather than renaming, maybe add getFullVersion() and deprecate getBuildVersion()? Also, is this change needed on trunk as well?
          Hide
          Eli Collins added a comment -

          Thanks for the feedback Tom. Looked at this again and I think it's better to leave VersionInfo as is (not rename getBuildVersion to getFullVersion) and just make InterTrackerProtocol match, eg InterTrackerProtocol#getBuildVersion should return VersionInfo#getBuildVersion and add InterTrackerProtocol#getVIVersion that returns VersionInfo#getVersion (there's already a getVersion method for MXBean). Sound good?

          Show
          Eli Collins added a comment - Thanks for the feedback Tom. Looked at this again and I think it's better to leave VersionInfo as is (not rename getBuildVersion to getFullVersion) and just make InterTrackerProtocol match, eg InterTrackerProtocol#getBuildVersion should return VersionInfo#getBuildVersion and add InterTrackerProtocol#getVIVersion that returns VersionInfo#getVersion (there's already a getVersion method for MXBean). Sound good?
          Hide
          Eli Collins added a comment -

          Forgot to mention, leaving VersionInfo as is means we don't need to do anything for trunk.

          Show
          Eli Collins added a comment - Forgot to mention, leaving VersionInfo as is means we don't need to do anything for trunk.
          Hide
          Eli Collins added a comment -

          Patch attached. Minor change from the last one.

          Show
          Eli Collins added a comment - Patch attached. Minor change from the last one.
          Hide
          Tom White added a comment -

          Agree that not changing VersionInfo is better. +1 from me if Jenkins comes back OK.

          Show
          Tom White added a comment - Agree that not changing VersionInfo is better. +1 from me if Jenkins comes back OK.
          Hide
          Eli Collins added a comment -

          Thanks Tom, re-ran test-patch and the tests.

          Show
          Eli Collins added a comment - Thanks Tom, re-ran test-patch and the tests.
          Hide
          Eli Collins added a comment -

          Thanks for the reviews ATM and Tom, I've committed this to branch-1.

          Show
          Eli Collins added a comment - Thanks for the reviews ATM and Tom, I've committed this to branch-1.
          Hide
          Matt Foley added a comment -

          Closed upon release of Hadoop-1.1.0.

          Show
          Matt Foley added a comment - Closed upon release of Hadoop-1.1.0.

            People

            • Assignee:
              Eli Collins
              Reporter:
              Eli Collins
            • Votes:
              0 Vote for this issue
              Watchers:
              17 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development