Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.19.0
    • Fix Version/s: 0.21.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      At start time, TT checks whether its version is compatible with JT.
      The condition is too restrictive.
      It will shut down itself if one of the following conditions fail:

      • the version numbers must match
      • the revision numbers must match
      • the user ids who build the jar must match
      • the build times must match

      I think it should check the major part of the version numbers only (thus any version like 0.19.xxxx should be compatible).

      1. HADOOP-5203.patch
        0.7 kB
        Doug Cutting
      2. HADOOP-5203-md5.patch
        3 kB
        Rick Cox
      3. HADOOP-5203-md5-0.18.3.patch
        3 kB
        Bill Au
      4. HADOOP-5203-md5.patch
        3 kB
        Rick Cox
      5. HADOOP-5203-md5-0.19.2.patch
        3 kB
        Steve Rowe
      6. HADOOP-5203-md5-0.20.1.patch
        3 kB
        Bill Au
      7. HADOOP-5203-md5-1.1.patch
        3 kB
        Matt Foley

        Issue Links

          Activity

          Hide
          Owen O'Malley added a comment -

          This is a feature. One common failure mode is that some nodes fail to get the update of a new jar file. Without this feature, they attempt to join the cluster and can create serious problems. Currently, both map/reduce and hdfs require that masters and slaves are from the same build, which is a very good thing.

          Show
          Owen O'Malley added a comment - This is a feature. One common failure mode is that some nodes fail to get the update of a new jar file. Without this feature, they attempt to join the cluster and can create serious problems. Currently, both map/reduce and hdfs require that masters and slaves are from the same build, which is a very good thing.
          Hide
          Runping Qi added a comment -

          Well, this is a bad feature

          Here is a situation one may face. Say you have a cluster with different machines, say some are 32 bits and others are 64 bits.
          One needs to build Hadoop need on different machines. It is almost impossible to have the builds happen at the exact same time.
          Thus, with tgis feature, it is almost impossible to deploy Hadoop on this kind of environment.

          Also, suppose you discover some problems with TT and you want to go ahead to fix it and build a reease. You don't want to shut down the job tracker. But with this feature, you cannot deploy the task trackers without shutting down the job tracker.

          Show
          Runping Qi added a comment - Well, this is a bad feature Here is a situation one may face. Say you have a cluster with different machines, say some are 32 bits and others are 64 bits. One needs to build Hadoop need on different machines. It is almost impossible to have the builds happen at the exact same time. Thus, with tgis feature, it is almost impossible to deploy Hadoop on this kind of environment. Also, suppose you discover some problems with TT and you want to go ahead to fix it and build a reease. You don't want to shut down the job tracker. But with this feature, you cannot deploy the task trackers without shutting down the job tracker.
          Hide
          Owen O'Malley added a comment -

          It you build the releases like the HowToRelease page describes, you'll get both 32 and 64 bit binaries with a single date stamp.

          Hadoop doesn't support rolling upgrades and patching 1/2 of the TaskTrackers sounds like a recipe for disaster.

          Show
          Owen O'Malley added a comment - It you build the releases like the HowToRelease page describes, you'll get both 32 and 64 bit binaries with a single date stamp. Hadoop doesn't support rolling upgrades and patching 1/2 of the TaskTrackers sounds like a recipe for disaster.
          Hide
          Runping Qi added a comment -

          You may be able to do a compatible build for 32bits and 64bits version.
          How about for the environments with different OS?Different architectures?
          We happen to be in an environment hadoop has to be built for different kinds of environments on different machines.

          Show
          Runping Qi added a comment - You may be able to do a compatible build for 32bits and 64bits version. How about for the environments with different OS?Different architectures? We happen to be in an environment hadoop has to be built for different kinds of environments on different machines.
          Hide
          Owen O'Malley added a comment -

          Assuming you can build in a NFS directory that is cross-mounted on the various hosts, just ssh over to the relevant host and build the native code on each platform. On the last platform, generate the tar ball and you will have a single build with all of the native bits compiled on multiple platforms.

          Show
          Owen O'Malley added a comment - Assuming you can build in a NFS directory that is cross-mounted on the various hosts, just ssh over to the relevant host and build the native code on each platform. On the last platform, generate the tar ball and you will have a single build with all of the native bits compiled on multiple platforms.
          Hide
          Rick Cox added a comment -

          That's reasonable if you're manually building Hadoop and only Hadoop, but in many scenarios, Hadoop is just one component, and is automatically built as part of a much larger build system. It would be basically impossible to automate the suggested build procedure using our internal build system. The one OS distro I'm at all familiar with (NetBSD pkgsrc) seems like it would also have trouble building like that; I'm guessing other from-source distributions will as well.

          Show
          Rick Cox added a comment - That's reasonable if you're manually building Hadoop and only Hadoop, but in many scenarios, Hadoop is just one component, and is automatically built as part of a much larger build system. It would be basically impossible to automate the suggested build procedure using our internal build system. The one OS distro I'm at all familiar with (NetBSD pkgsrc) seems like it would also have trouble building like that; I'm guessing other from-source distributions will as well.
          Hide
          Doug Cutting added a comment -

          Here's a patch that lets you override the script that writes the version file. Would this work for you?

          Show
          Doug Cutting added a comment - Here's a patch that lets you override the script that writes the version file. Would this work for you?
          Hide
          Rick Cox added a comment -

          Thank you for the patch. I've applied it and verified that it integrates with our build system, though I do still believe that this including the date will cause this problem (which now has a supported work-around) to occur for other people using other build systems.

          Show
          Rick Cox added a comment - Thank you for the patch. I've applied it and verified that it integrates with our build system, though I do still believe that this including the date will cause this problem (which now has a supported work-around) to occur for other people using other build systems.
          Hide
          Doug Cutting added a comment -

          > I do still believe that this including the date will cause this problem (which now has a supported work-around) to occur for other people using other build systems.

          And Owen feels strongly that the current behavior protects us against real hazards. He's had clusters that only partially upgraded and encountered unexpected difficult to diagnose problems. We need to reach a consensus. So the question is not whether this is the way you would implement this in a vacuum, but whether this is a solution that we can all agree to. Do you have an alternate proposal that you prefer that Owen might support? If not, can you live with the solution I suggested?

          Show
          Doug Cutting added a comment - > I do still believe that this including the date will cause this problem (which now has a supported work-around) to occur for other people using other build systems. And Owen feels strongly that the current behavior protects us against real hazards. He's had clusters that only partially upgraded and encountered unexpected difficult to diagnose problems. We need to reach a consensus. So the question is not whether this is the way you would implement this in a vacuum, but whether this is a solution that we can all agree to. Do you have an alternate proposal that you prefer that Owen might support? If not, can you live with the solution I suggested?
          Hide
          Rick Cox added a comment -

          Does "only takstrackers and jobtrackers built from exactly the same source files should communicate" accurately describe the desired policy?

          If so, I can try to provide a patch for saveVersions.sh that will md5sum the entire source and use that in the version identifier, as an alternative way to enforce that policy. A single character source-level change would be declared incompatible, but rebuilding from exactly the same source would be OK.

          Show
          Rick Cox added a comment - Does "only takstrackers and jobtrackers built from exactly the same source files should communicate" accurately describe the desired policy? If so, I can try to provide a patch for saveVersions.sh that will md5sum the entire source and use that in the version identifier, as an alternative way to enforce that policy. A single character source-level change would be declared incompatible, but rebuilding from exactly the same source would be OK.
          Hide
          Doug Cutting added a comment -

          > I can try to provide a patch for saveVersions.sh that will md5sum the entire source and use that in the version identifier [ ...]

          That sounds like it could work well to me. The tricky bit might be concisely and reliably determining the set of files to checksum. On one hand we don't want to include svn, git or other generated files, and on the other we not want to have hard-to-maintain lists of patterns to include or ignore.

          Show
          Doug Cutting added a comment - > I can try to provide a patch for saveVersions.sh that will md5sum the entire source and use that in the version identifier [ ...] That sounds like it could work well to me. The tricky bit might be concisely and reliably determining the set of files to checksum. On one hand we don't want to include svn, git or other generated files, and on the other we not want to have hard-to-maintain lists of patterns to include or ignore.
          Hide
          Runping Qi added a comment -

          I prefer to provide options to allow the user to specify the fields of buildversion to be compared.
          The default option should be the same as it is now (all fields). But the user can, for example, specify to compare only the hadoop version numbers and revision numbers.

          Show
          Runping Qi added a comment - I prefer to provide options to allow the user to specify the fields of buildversion to be compared. The default option should be the same as it is now (all fields). But the user can, for example, specify to compare only the hadoop version numbers and revision numbers.
          Hide
          Doug Cutting added a comment -

          > I prefer to provide options to allow the user to specify the fields of buildversion to be compared.

          Doesn't my patch let you implement something equivalent, by letting you control what's in the version string?

          Show
          Doug Cutting added a comment - > I prefer to provide options to allow the user to specify the fields of buildversion to be compared. Doesn't my patch let you implement something equivalent, by letting you control what's in the version string?
          Hide
          Runping Qi added a comment -

          My thinking is that it is preferable that one can specify the subset of fields of the buildVersion as an argument to a task tracker.
          The task tracker uses that info to check whether it is compatible with the job tracker.
          This way, we don't have to customize the saveVersion.sh.

          Show
          Runping Qi added a comment - My thinking is that it is preferable that one can specify the subset of fields of the buildVersion as an argument to a task tracker. The task tracker uses that info to check whether it is compatible with the job tracker. This way, we don't have to customize the saveVersion.sh.
          Hide
          Doug Cutting added a comment -

          Do you have a patch?

          Show
          Doug Cutting added a comment - Do you have a patch?
          Hide
          Runping Qi added a comment -

          Rick and I chatted about this a moment ago.
          We agreed that the checksum based approach should work better.

          Show
          Runping Qi added a comment - Rick and I chatted about this a moment ago. We agreed that the checksum based approach should work better.
          Hide
          Bill Au added a comment -

          What's that status of the patch for the checksum based approach? I am using 0.18.3 and am willing to back port such a patch to 0.18.3 if necessary. In the meantime it looks like I will have to use Doug's approach in order to get around this problem since I am using a cluster with mixed hardware and OS.

          Show
          Bill Au added a comment - What's that status of the patch for the checksum based approach? I am using 0.18.3 and am willing to back port such a patch to 0.18.3 if necessary. In the meantime it looks like I will have to use Doug's approach in order to get around this problem since I am using a cluster with mixed hardware and OS.
          Hide
          Rick Cox added a comment -

          Doug's patch definitely provides a functional work around. I'm attaching the current state of the checksum patch. The core part of the patch is this shell line, which computes a checksum of the *.java files in src:

          srcChecksum=`find src -name '*.java' | LC_ALL=C sort | xargs md5sum | md5sum | cut -d ' ' -f 1`

          I'd definitely appreciate suggestions for improving that. Some other obvious solutions don't work because they don't produce the same checksums on various platforms (including using a "tar cf - -T -" in place of the "xargs md5sum"). For now, that does produce the same checksum value on Mac OS X and Linux; I'll dig out a Cygwin box and test there next, but don't have access to a Solaris box. For reference, the checksum it computes is 8238dee4c81d21734c34d7e5c420cfe2 for source at svn revision 762607 plus the attached patch.

          I used *.java because it seems pretty unlikely you could produce an incompatible protocol change without modifying any java files, but could expand that if it seems too restrictive. This does allow documentation changes and build file changes, for example.

          Show
          Rick Cox added a comment - Doug's patch definitely provides a functional work around. I'm attaching the current state of the checksum patch. The core part of the patch is this shell line, which computes a checksum of the *.java files in src: srcChecksum=`find src -name '*.java' | LC_ALL=C sort | xargs md5sum | md5sum | cut -d ' ' -f 1` I'd definitely appreciate suggestions for improving that. Some other obvious solutions don't work because they don't produce the same checksums on various platforms (including using a "tar cf - -T -" in place of the "xargs md5sum"). For now, that does produce the same checksum value on Mac OS X and Linux; I'll dig out a Cygwin box and test there next, but don't have access to a Solaris box. For reference, the checksum it computes is 8238dee4c81d21734c34d7e5c420cfe2 for source at svn revision 762607 plus the attached patch. I used *.java because it seems pretty unlikely you could produce an incompatible protocol change without modifying any java files, but could expand that if it seems too restrictive. This does allow documentation changes and build file changes, for example.
          Hide
          Doug Cutting added a comment -

          Unless I'm missing something, your patch doesn't actually change how the version is checked, only how it's computed. Did you intend to combine our patches, or is your patch incomplete?

          Show
          Doug Cutting added a comment - Unless I'm missing something, your patch doesn't actually change how the version is checked, only how it's computed. Did you intend to combine our patches, or is your patch incomplete?
          Hide
          Rick Cox added a comment -

          It's a bit obfuscated, but the VersionInfo.getBuildVersion() is used by the tasktrackers to compare versions:

          TaskTracker.java:1056: if(!VersionInfo.getBuildVersion().equals(jobTrackerBV)) {

          This patch changes from including the date to including the srcChecksum in the string returned by VersionInfo.getBuildVersion().

          The namenode-datanode compatibility check uses Storage.getBuildVersion(), but that calls VersionInfo.getRevision(), so didn't have the precise problem addressed by this JIRA (though it might be worth changing it to match).

          Show
          Rick Cox added a comment - It's a bit obfuscated, but the VersionInfo.getBuildVersion() is used by the tasktrackers to compare versions: TaskTracker.java:1056: if(!VersionInfo.getBuildVersion().equals(jobTrackerBV)) { This patch changes from including the date to including the srcChecksum in the string returned by VersionInfo.getBuildVersion(). The namenode-datanode compatibility check uses Storage.getBuildVersion(), but that calls VersionInfo.getRevision(), so didn't have the precise problem addressed by this JIRA (though it might be worth changing it to match).
          Hide
          Bill Au added a comment -

          I back ported Rick's patch to 0.18.3. I have tested it in my cluster which include 5 different hardware/OS combination. All the different TaskTracker was able to come up without any problem. I am attaching my patch.

          Also want to confirm what Rick has pointed out. All the Datanodes have no problem talking to the Namenode in my cluster of mixed machines.

          Show
          Bill Au added a comment - I back ported Rick's patch to 0.18.3. I have tested it in my cluster which include 5 different hardware/OS combination. All the different TaskTracker was able to come up without any problem. I am attaching my patch. Also want to confirm what Rick has pointed out. All the Datanodes have no problem talking to the Namenode in my cluster of mixed machines.
          Hide
          Doug Cutting added a comment -

          This looks fine to me. Owen?

          Show
          Doug Cutting added a comment - This looks fine to me. Owen?
          Hide
          Owen O'Malley added a comment -

          +1

          Show
          Owen O'Malley added a comment - +1
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12404866/HADOOP-5203-md5-0.18.3.patch
          against trunk revision 763107.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no tests are needed for this patch.

          -1 patch. The patch command could not apply the patch.

          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/163/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12404866/HADOOP-5203-md5-0.18.3.patch against trunk revision 763107. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. -1 patch. The patch command could not apply the patch. Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/163/console This message is automatically generated.
          Hide
          Bill Au added a comment -

          It looks like the wrong patch was applied to the trunk. HADOOP-5203-md5.patch is for the trunk. HADOOP-5203-md5-0.18.3.patch is for the 0.18 branch.

          Show
          Bill Au added a comment - It looks like the wrong patch was applied to the trunk. HADOOP-5203 -md5.patch is for the trunk. HADOOP-5203 -md5-0.18.3.patch is for the 0.18 branch.
          Hide
          Rick Cox added a comment -

          Reattaching the same patch as before to make it the latest patch for Hudson.

          Show
          Rick Cox added a comment - Reattaching the same patch as before to make it the latest patch for Hudson.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12405011/HADOOP-5203-md5.patch
          against trunk revision 763502.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no tests are needed for this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/169/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/169/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/169/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/169/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12405011/HADOOP-5203-md5.patch against trunk revision 763502. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/169/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/169/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/169/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/169/console This message is automatically generated.
          Hide
          Rick Cox added a comment -

          Tests with errors were TestMRServerPorts (appears to have encountered ports already in use) and TestQueueCapacities (capacity-scheduler, appears to have issues with file access and locking), both of which are unlikely to be related to this patch.

          Show
          Rick Cox added a comment - Tests with errors were TestMRServerPorts (appears to have encountered ports already in use) and TestQueueCapacities (capacity-scheduler, appears to have issues with file access and locking), both of which are unlikely to be related to this patch.
          Hide
          Sharad Agarwal added a comment -

          I committed this. Thanks Rick.

          Show
          Sharad Agarwal added a comment - I committed this. Thanks Rick.
          Hide
          Steve Rowe added a comment -

          Here is a version of Bill Au's 0.18.3 backported patch, adjusted to apply cleanly to 0.19.2.

          Show
          Steve Rowe added a comment - Here is a version of Bill Au's 0.18.3 backported patch, adjusted to apply cleanly to 0.19.2.
          Hide
          Bill Au added a comment -

          Same patch adjusted to apply cleanly to 0.20.1.

          Show
          Bill Au added a comment - Same patch adjusted to apply cleanly to 0.20.1.
          Hide
          Matt Foley added a comment -

          Note that the "Fix Version" is indeed correct; this patch was committed to trunk after the 0.20 branch point, so only v0.21 and higher have the fix. HADOOP-7960 ports it to branch-1 so that v1.1 and higher will also have the fix.

          Show
          Matt Foley added a comment - Note that the "Fix Version" is indeed correct; this patch was committed to trunk after the 0.20 branch point, so only v0.21 and higher have the fix. HADOOP-7960 ports it to branch-1 so that v1.1 and higher will also have the fix.
          Hide
          Matt Foley added a comment -

          This is almost identical to Bill Au's version for v0.20.1, so thanks Bill!

          Show
          Matt Foley added a comment - This is almost identical to Bill Au's version for v0.20.1, so thanks Bill!

            People

            • Assignee:
              Rick Cox
              Reporter:
              Runping Qi
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development