Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-1817

JT and TT should not have to match build versions

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Not A Problem
    • Affects Version/s: 0.20.1, 0.20.2, 0.20.3, 0.21.0, 0.22.0
    • Fix Version/s: 0.22.0
    • Component/s: tasktracker
    • Labels:
      None

      Description

      TaskTracker#offerService checks for a match with the JT VersionInfo#getBuildVersion, and fails if they are the same version but happen to be built at a different time of day. It seems like the correct test is VersionInfo#getRevision. fwiw the NN and DN do not have to match build versions.

        Activity

        Hide
        Eli Collins added a comment -

        I see this was already discussed in HADOOP-5203, would be people be willing to allow this as optional behavior where the default is the current behavior? Users would like to be able to run MR on a cluster with heterogeneous nodes (ie same bits but different build times due to different packages).

        Show
        Eli Collins added a comment - I see this was already discussed in HADOOP-5203 , would be people be willing to allow this as optional behavior where the default is the current behavior? Users would like to be able to run MR on a cluster with heterogeneous nodes (ie same bits but different build times due to different packages).
        Hide
        Owen O'Malley added a comment -

        That's a feature, not a bug. Trust me, you think you want this behavior, but it will bite you in the tail. The critical issue is that if you make a fix to the framework, you need to know that all of the task trackers have the fix. Otherwise, you are asking for a long and painful debugging session. I've been there and done that, which is why that code is in there. smile

        HADOOP-5203 already replaced using the checksum of the source rather than the date of compilation, because that was Runping's use case. Note that you can already handle multiple archs as long as you build in the same directory.

        Show
        Owen O'Malley added a comment - That's a feature, not a bug. Trust me, you think you want this behavior, but it will bite you in the tail. The critical issue is that if you make a fix to the framework, you need to know that all of the task trackers have the fix. Otherwise, you are asking for a long and painful debugging session. I've been there and done that, which is why that code is in there. smile HADOOP-5203 already replaced using the checksum of the source rather than the date of compilation, because that was Runping's use case. Note that you can already handle multiple archs as long as you build in the same directory.
        Hide
        Eli Collins added a comment -

        Hey Owen,

        Thanks for chiming in. Seems like there must be a way we can improve the check so that it really tests that the versions of the code is the same without requiring that the bits be built on exactly the same minute. That's my understanding of what getRevision would do. Any fix will bump the revision number, if the JT gets updated with this revision number it should refuse service to an old TT since the revision won't match right? ie any change to the source should bump the revision number – if that's not the case then it seems like the real bug is that we need a better notion of "revision number". If you're recompiling the code w/o checking in a fix to bump the revision number then there be dragons...

        The motivation is that some users don't run the same version of linux on all the machines in their cluster. For various reasons not everyone standardizes on one install for all machines, even though that's best practice, which means they use different packages, and it's reasonable that different packages are not built at exactly the same time of day. So the option for them today is to either reinstall the OS on a bunch of machines or run two instances of Hadoop, both of which are pretty painful too.

        Thanks,
        Eli

        Show
        Eli Collins added a comment - Hey Owen, Thanks for chiming in. Seems like there must be a way we can improve the check so that it really tests that the versions of the code is the same without requiring that the bits be built on exactly the same minute. That's my understanding of what getRevision would do. Any fix will bump the revision number, if the JT gets updated with this revision number it should refuse service to an old TT since the revision won't match right? ie any change to the source should bump the revision number – if that's not the case then it seems like the real bug is that we need a better notion of "revision number". If you're recompiling the code w/o checking in a fix to bump the revision number then there be dragons... The motivation is that some users don't run the same version of linux on all the machines in their cluster. For various reasons not everyone standardizes on one install for all machines, even though that's best practice, which means they use different packages, and it's reasonable that different packages are not built at exactly the same time of day. So the option for them today is to either reinstall the OS on a bunch of machines or run two instances of Hadoop, both of which are pretty painful too. Thanks, Eli
        Hide
        Owen O'Malley added a comment -

        SIgh

        Look at the patch for HADOOP-5203. After that patch, it does not depend on when the compilation was done.

        Even before the patch for HADOOP-5203 there were work arounds, but it isn't an issue any more.

        Show
        Owen O'Malley added a comment - SIgh Look at the patch for HADOOP-5203 . After that patch, it does not depend on when the compilation was done. Even before the patch for HADOOP-5203 there were work arounds, but it isn't an issue any more.
        Hide
        Eli Collins added a comment -

        Arg, sorry for the noise! I should have read your previous comment more closely.

        We need to update the javadoc to comply with reality, people actually read them. Will do in a separate jira.

           /**
            * Returns the buildVersion which includes version, 
        -   * revision, user and date. 
        +   * revision, user and source checksum.
            */
        

        Btw http://en.wikipedia.org/wiki/Sigh is a funny read. People sigh to cool down their organs. Who knew...

        Show
        Eli Collins added a comment - Arg, sorry for the noise! I should have read your previous comment more closely. We need to update the javadoc to comply with reality, people actually read them. Will do in a separate jira. /** * Returns the buildVersion which includes version, - * revision, user and date. + * revision, user and source checksum. */ Btw http://en.wikipedia.org/wiki/Sigh is a funny read. People sigh to cool down their organs. Who knew...

          People

          • Assignee:
            Unassigned
            Reporter:
            Eli Collins
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development