Mahout
  1. Mahout
  2. MAHOUT-994

mahout script shouldn't rely on HADOOP_HOME since that was deprecated in all major Hadoop branches

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.6
    • Fix Version/s: 0.7
    • Component/s: Integration
    • Labels:
      None

      Description

      Mahout should follow the Pig and Hive example and not rely explicitly on HADOOP_HOME and HADOOP_CONF_DIR

      1. MAHOUT-994.patch
        3 kB
        tom pierce
      2. MAHOUT-994.patch.txt
        3 kB
        Roman Shaposhnik

        Activity

        Hide
        Dmitriy Lyubimov added a comment - - edited

        What it should be relied on in new Hadoop branches to find the hadoop client libraries and settings?

        Could you please describe the solution?

        Show
        Dmitriy Lyubimov added a comment - - edited What it should be relied on in new Hadoop branches to find the hadoop client libraries and settings? Could you please describe the solution?
        Hide
        Roman Shaposhnik added a comment -

        Sure. In fact, let me use Pig as a good example of how to handle these cases (scroll to line #162):
        http://svn.apache.org/viewvc/pig/trunk/bin/pig?revision=1228547&view=markup

        Given that Pig, just like Mahout supports a local mode where it doesn't need Hadoop at all, I think
        if we can agree on the approach they are taking I can submit a patch for you guys, essenntially
        replacing lines #202-243 of:
        http://svn.apache.org/viewvc/mahout/trunk/bin/mahout?revision=1186718&view=markup
        with the code similar to what Pig has.

        What do you think?

        Show
        Roman Shaposhnik added a comment - Sure. In fact, let me use Pig as a good example of how to handle these cases (scroll to line #162): http://svn.apache.org/viewvc/pig/trunk/bin/pig?revision=1228547&view=markup Given that Pig, just like Mahout supports a local mode where it doesn't need Hadoop at all, I think if we can agree on the approach they are taking I can submit a patch for you guys, essenntially replacing lines #202-243 of: http://svn.apache.org/viewvc/mahout/trunk/bin/mahout?revision=1186718&view=markup with the code similar to what Pig has. What do you think?
        Hide
        Dmitriy Lyubimov added a comment -

        looking at the pig code and your references in terms of lines, i am not sure i can see how they add $HADOOP_HOME/lib files to the classpath. I also see that they still use HADOOP_CONF_DIR to add to classpath in order to add default hadoop configuration resources.

        And finally your reference points to initialization of HADOOP_BIN which again i am not sure how they use it further along as i don't see that it is being used to form hadoop classpath or configuration anywhere further along. I am not sure 'which hadoop' can be relied upon so much to deduce libraries and conf reliably as different distros tend to place binaries and configurations into different actual locations (such as /usr/bin/hadoop) which are linked up into actual hadoop home layout somewhere else. That's why i asked you to clarify how you want to do it since i am still dubious "which" can be reliably used to deduce HADOOP_HOME.

        Show
        Dmitriy Lyubimov added a comment - looking at the pig code and your references in terms of lines, i am not sure i can see how they add $HADOOP_HOME/lib files to the classpath. I also see that they still use HADOOP_CONF_DIR to add to classpath in order to add default hadoop configuration resources. And finally your reference points to initialization of HADOOP_BIN which again i am not sure how they use it further along as i don't see that it is being used to form hadoop classpath or configuration anywhere further along. I am not sure 'which hadoop' can be relied upon so much to deduce libraries and conf reliably as different distros tend to place binaries and configurations into different actual locations (such as /usr/bin/hadoop) which are linked up into actual hadoop home layout somewhere else. That's why i asked you to clarify how you want to do it since i am still dubious "which" can be reliably used to deduce HADOOP_HOME.
        Hide
        Roman Shaposhnik added a comment -

        There's no need to add $HADOOP_HOME/lib to the classpath as long as you're using hadoop itself as a launcher script. In fact, it is dangerous to assume that there's $HADOOP_HOME/lib to begin with. This is definitely not the case with Hadoop 0.23 and might not be the case with Hadoop 1.X

        Explicit setting of HADOOP_CONF_DIR needs to be dropped from Pig script and I will not recommend keeping it in Mahout. Once again – using hadoop launcher script itself is the best way to add these things to the proper places in th CP. If you try to do it explicitly you're essentially second guessing hadoop implementation and your scripts become brittle.

        'which' is definitely quite reliable. In fact, if you look at Pig, Hive and HBase they have all transitioned to using this technique to get the Hadoop user wants. The only difference is defaults (some projects prefer hadoop from the PATH to override anything else, some prefer it the other way). Here's another example of how HBase does it (scroll to line #219):
        http://svn.apache.org/viewvc/hbase/trunk/bin/hbase?revision=1296661&view=markup

        One again – if you try to deduce the libraries yourself – your scripts become brittle. You can only rely on hadoop implementation to set these things up for you. This is an uncomfortable truth of at least 3 different incompatible Hadoop implementations being actively deployed in the wild.

        Does it make sense?

        Show
        Roman Shaposhnik added a comment - There's no need to add $HADOOP_HOME/lib to the classpath as long as you're using hadoop itself as a launcher script. In fact, it is dangerous to assume that there's $HADOOP_HOME/lib to begin with. This is definitely not the case with Hadoop 0.23 and might not be the case with Hadoop 1.X Explicit setting of HADOOP_CONF_DIR needs to be dropped from Pig script and I will not recommend keeping it in Mahout. Once again – using hadoop launcher script itself is the best way to add these things to the proper places in th CP. If you try to do it explicitly you're essentially second guessing hadoop implementation and your scripts become brittle. 'which' is definitely quite reliable. In fact, if you look at Pig, Hive and HBase they have all transitioned to using this technique to get the Hadoop user wants. The only difference is defaults (some projects prefer hadoop from the PATH to override anything else, some prefer it the other way). Here's another example of how HBase does it (scroll to line #219): http://svn.apache.org/viewvc/hbase/trunk/bin/hbase?revision=1296661&view=markup One again – if you try to deduce the libraries yourself – your scripts become brittle. You can only rely on hadoop implementation to set these things up for you. This is an uncomfortable truth of at least 3 different incompatible Hadoop implementations being actively deployed in the wild. Does it make sense?
        Hide
        Dmitriy Lyubimov added a comment - - edited

        I am not sure we use hadoop executable to launch Mahout stuff. there are definitely cases where we don't use Hadoop at all (but the code generally does assume hadoop library avaiability on classpath, otherwise some classes and strategies may break with ClassNotFoundException during load).

        I mean, we are not bound by having to launch any MR stuff.

        What is your suggestion to handle "hadoopless" cases and avoid ClassNotFound?

        Show
        Dmitriy Lyubimov added a comment - - edited I am not sure we use hadoop executable to launch Mahout stuff. there are definitely cases where we don't use Hadoop at all (but the code generally does assume hadoop library avaiability on classpath, otherwise some classes and strategies may break with ClassNotFoundException during load). I mean, we are not bound by having to launch any MR stuff. What is your suggestion to handle "hadoopless" cases and avoid ClassNotFound?
        Hide
        Dmitriy Lyubimov added a comment -

        But you are right, it looks like we are not guessing hadoop libraries for mahout local cases. i guess if what i've described, really happens, then it will fail.

        Show
        Dmitriy Lyubimov added a comment - But you are right, it looks like we are not guessing hadoop libraries for mahout local cases. i guess if what i've described, really happens, then it will fail.
        Hide
        Dmitriy Lyubimov added a comment -

        I think it's a little bit disconcerting as we have a lot of utility classes that may be using hadoop classes even locally (e.g. reading or writing local sequence files). Can you suggest a strategy for Mahout local mode and still be able to bind to hadoop I/O classes?

        Show
        Dmitriy Lyubimov added a comment - I think it's a little bit disconcerting as we have a lot of utility classes that may be using hadoop classes even locally (e.g. reading or writing local sequence files). Can you suggest a strategy for Mahout local mode and still be able to bind to hadoop I/O classes?
        Hide
        Lance Norskog added a comment -

        Don't the job jars pack up Hadoop libs like all the other dependencies?

        Show
        Lance Norskog added a comment - Don't the job jars pack up Hadoop libs like all the other dependencies?
        Hide
        tom pierce added a comment -

        It would make a lot of sense to me to unify with other Hadoop projects and start using the hadoop launcher script whenever we can - no need to maintain a custom reinvented wheel there.

        As far as the job jar is concerned, I don't think it's customary to include Hadoop libs - seems redundant, and potentially dangerous when there are mismatched versions of classes in the classpath. I know there are ways to specify/determine which versions of things to use and make this kind of thing work, but it has been my experience that it always becomes overcomplicated after a while - it is just bad juju.

        Looks like the Hadoop classes in examples/target/dependency/hadoop-core-0.20.204.0.jar are tacked onto the classpath inside bin/mahout, which is why standalone mode currently 'just works' on a box with no hadoop install. This seems less than ideal, though I can see why it is done.

        Show
        tom pierce added a comment - It would make a lot of sense to me to unify with other Hadoop projects and start using the hadoop launcher script whenever we can - no need to maintain a custom reinvented wheel there. As far as the job jar is concerned, I don't think it's customary to include Hadoop libs - seems redundant, and potentially dangerous when there are mismatched versions of classes in the classpath. I know there are ways to specify/determine which versions of things to use and make this kind of thing work, but it has been my experience that it always becomes overcomplicated after a while - it is just bad juju. Looks like the Hadoop classes in examples/target/dependency/hadoop-core-0.20.204.0.jar are tacked onto the classpath inside bin/mahout, which is why standalone mode currently 'just works' on a box with no hadoop install. This seems less than ideal, though I can see why it is done.
        Hide
        Roman Shaposhnik added a comment -

        @Dmitriy

        I am not sure we use hadoop executable to launch Mahout stuff.

        Seems like you do:

            (*)
              echo "MAHOUT-JOB: $MAHOUT_JOB"
              export HADOOP_CLASSPATH=$MAHOUT_CONF_DIR:${HADOOP_CLASSPATH}
              exec "$HADOOP_HOME/bin/hadoop" --config $HADOOP_CONF_DIR jar $MAHOUT_JOB $CLASS "$@"
        

        I mean, we are not bound by having to launch any MR stuff.

        Understood. Pig also has local mode where it doesn't have to have any access to Hadoop at all

        I think it's a little bit disconcerting as we have a lot of utility classes that may be using hadoop classes even locally (e.g. reading or writing local sequence files). Can you suggest a strategy for Mahout local mode and still be able to bind to hadoop I/O classes?

        I believe there are 2 issue at play here:

        1. what to do for the local mode
        2. what to do for the MapReduce mode

        For both cases you'll have to make sure that you segregate bundled hadoop
        dependencies into a separate location in your installation tree so that
        you can easily add/remove those from your CLASSPATH. Lets call it $HADOOP_DEPS_DIR
        (it must be different from your ususal lib subdir)

        1. Local mode:

        The very fact that you're running in local mode will be governed by $MAHOUT_LOCAL
        (regardless of whether HADOOP_HOME is empty or not).

        When $MAHOUT_LOCAL is set you'll try to see whether you can run 'hadoop classpath'
        (see how HADOOP_HOME/HADOOP_PREFIX factors into this below). If you can you'll add
        the value it returns to the CLASSPATH. If it is not available you'll add $HADOOP_DEPS_DIR
        to the classpath.

        2. MapReduce mode

        For the MapReduce mode, I would argue that the best option is to follow the suit of Pig/Hive/HBase, etc and rely on HADOOP_HOME only as a fall back value if hadoop is not resolved from the PATH by default. E.g.

        HADOOP_LAUNCHER=$(PATH="${HADOOP_HOME:-${HADOOP_PREFIX}}/bin:$PATH" which hadoop 2>/dev/null)
        

        then you'll need to replace all the explicit calls to $HADOOP_HOME/bin/... with the $HADOOP_LAUNCHER. I will also strongly recommend against guessing the value of HADOOP_CONF_DIR, since I don't think it buys you anything at all (the hadoop launcher script regardless of whether it is coming from honors HADOOP_CONF_DIR as much as it does --config).

        That will take care of the MapReduce case.

        If you agree with the proposal I can supply a patch shortly.

        Show
        Roman Shaposhnik added a comment - @Dmitriy I am not sure we use hadoop executable to launch Mahout stuff. Seems like you do: (*) echo "MAHOUT-JOB: $MAHOUT_JOB" export HADOOP_CLASSPATH=$MAHOUT_CONF_DIR:${HADOOP_CLASSPATH} exec "$HADOOP_HOME/bin/hadoop" --config $HADOOP_CONF_DIR jar $MAHOUT_JOB $CLASS "$@" I mean, we are not bound by having to launch any MR stuff. Understood. Pig also has local mode where it doesn't have to have any access to Hadoop at all I think it's a little bit disconcerting as we have a lot of utility classes that may be using hadoop classes even locally (e.g. reading or writing local sequence files). Can you suggest a strategy for Mahout local mode and still be able to bind to hadoop I/O classes? I believe there are 2 issue at play here: what to do for the local mode what to do for the MapReduce mode For both cases you'll have to make sure that you segregate bundled hadoop dependencies into a separate location in your installation tree so that you can easily add/remove those from your CLASSPATH. Lets call it $HADOOP_DEPS_DIR (it must be different from your ususal lib subdir) 1. Local mode: The very fact that you're running in local mode will be governed by $MAHOUT_LOCAL (regardless of whether HADOOP_HOME is empty or not). When $MAHOUT_LOCAL is set you'll try to see whether you can run 'hadoop classpath' (see how HADOOP_HOME/HADOOP_PREFIX factors into this below). If you can you'll add the value it returns to the CLASSPATH. If it is not available you'll add $HADOOP_DEPS_DIR to the classpath. 2. MapReduce mode For the MapReduce mode, I would argue that the best option is to follow the suit of Pig/Hive/HBase, etc and rely on HADOOP_HOME only as a fall back value if hadoop is not resolved from the PATH by default. E.g. HADOOP_LAUNCHER=$(PATH="${HADOOP_HOME:-${HADOOP_PREFIX}}/bin:$PATH" which hadoop 2>/dev/null) then you'll need to replace all the explicit calls to $HADOOP_HOME/bin/... with the $HADOOP_LAUNCHER. I will also strongly recommend against guessing the value of HADOOP_CONF_DIR, since I don't think it buys you anything at all (the hadoop launcher script regardless of whether it is coming from honors HADOOP_CONF_DIR as much as it does --config). That will take care of the MapReduce case. If you agree with the proposal I can supply a patch shortly.
        Hide
        Dmitriy Lyubimov added a comment -

        Sounds good to me.

        The only concern i have is to verify that local stuff is not broken when run from command line. I don't think this is verified by build unit tests. The history of local mode in Mahout is a little bit out of my scope, i will need another person to assertain the change for the local mode.

        But what you are saying sounds very benign to me. Thank you for doing this.
        When you submit a patch, do you think you could also create a review request for Mahout? Hopefully this will encourage another pair of eyes to look at it. if nobody responds, let me know, i will try to locate another pair of eyes for it.

        Again, thank you for doing this, Hadoop best practices may sometimes be hard to figure.

        Show
        Dmitriy Lyubimov added a comment - Sounds good to me. The only concern i have is to verify that local stuff is not broken when run from command line. I don't think this is verified by build unit tests. The history of local mode in Mahout is a little bit out of my scope, i will need another person to assertain the change for the local mode. But what you are saying sounds very benign to me. Thank you for doing this. When you submit a patch, do you think you could also create a review request for Mahout? Hopefully this will encourage another pair of eyes to look at it. if nobody responds, let me know, i will try to locate another pair of eyes for it. Again, thank you for doing this, Hadoop best practices may sometimes be hard to figure.
        Hide
        Dmitriy Lyubimov added a comment -

        PS review request = i mean at reviews.apache.org .

        Show
        Dmitriy Lyubimov added a comment - PS review request = i mean at reviews.apache.org .
        Hide
        Dmitriy Lyubimov added a comment -

        @Tom,

        perhaps you could be a such second review person on this issue? thanks.

        I agree attaching hadoop libraries to job file is dangerous. This should be compile-time dependency only, but runtime policy should be "provided".

        Show
        Dmitriy Lyubimov added a comment - @Tom, perhaps you could be a such second review person on this issue? thanks. I agree attaching hadoop libraries to job file is dangerous. This should be compile-time dependency only, but runtime policy should be "provided".
        Hide
        tom pierce added a comment -

        You bet - happy to review.

        Show
        tom pierce added a comment - You bet - happy to review.
        Hide
        Roman Shaposhnik added a comment -

        Sorry took me a bit longer than anticipated.

        Show
        Roman Shaposhnik added a comment - Sorry took me a bit longer than anticipated.
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/4632/
        -----------------------------------------------------------

        Review request for mahout and tom pierce.

        Summary
        -------

        Patch according to the JIRA discussion.

        This addresses bug MAHOUT-994.
        https://issues.apache.org/jira/browse/MAHOUT-994

        Diffs


        trunk/bin/mahout 1309189
        trunk/distribution/src/main/assembly/bin.xml 1309189

        Diff: https://reviews.apache.org/r/4632/diff

        Testing
        -------

        build and execution in various modes

        Thanks,

        Roman

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4632/ ----------------------------------------------------------- Review request for mahout and tom pierce. Summary ------- Patch according to the JIRA discussion. This addresses bug MAHOUT-994 . https://issues.apache.org/jira/browse/MAHOUT-994 Diffs trunk/bin/mahout 1309189 trunk/distribution/src/main/assembly/bin.xml 1309189 Diff: https://reviews.apache.org/r/4632/diff Testing ------- build and execution in various modes Thanks, Roman
        Hide
        jiraposter@reviews.apache.org added a comment -

        -----------------------------------------------------------
        This is an automatically generated e-mail. To reply, visit:
        https://reviews.apache.org/r/4632/#review7247
        -----------------------------------------------------------

        Looks good aside from the one issue below.

        trunk/bin/mahout
        <https://reviews.apache.org/r/4632/#comment16017>

        Think we need to omit HADOOP_BINARY_CLASSPATH here; otherwise I get an error due to my cluster and Mahout HDFS libs having different API versions when I run with MAHOUT_LOCAL set.

        • tom

        On 2012-04-04 00:26:36, Roman Shaposhnik wrote:

        -----------------------------------------------------------

        This is an automatically generated e-mail. To reply, visit:

        https://reviews.apache.org/r/4632/

        -----------------------------------------------------------

        (Updated 2012-04-04 00:26:36)

        Review request for mahout and tom pierce.

        Summary

        -------

        Patch according to the JIRA discussion.

        This addresses bug MAHOUT-994.

        https://issues.apache.org/jira/browse/MAHOUT-994

        Diffs

        -----

        trunk/bin/mahout 1309189

        trunk/distribution/src/main/assembly/bin.xml 1309189

        Diff: https://reviews.apache.org/r/4632/diff

        Testing

        -------

        build and execution in various modes

        Thanks,

        Roman

        Show
        jiraposter@reviews.apache.org added a comment - ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4632/#review7247 ----------------------------------------------------------- Looks good aside from the one issue below. trunk/bin/mahout < https://reviews.apache.org/r/4632/#comment16017 > Think we need to omit HADOOP_BINARY_CLASSPATH here; otherwise I get an error due to my cluster and Mahout HDFS libs having different API versions when I run with MAHOUT_LOCAL set. tom On 2012-04-04 00:26:36, Roman Shaposhnik wrote: ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4632/ ----------------------------------------------------------- (Updated 2012-04-04 00:26:36) Review request for mahout and tom pierce. Summary ------- Patch according to the JIRA discussion. This addresses bug MAHOUT-994 . https://issues.apache.org/jira/browse/MAHOUT-994 Diffs ----- trunk/bin/mahout 1309189 trunk/distribution/src/main/assembly/bin.xml 1309189 Diff: https://reviews.apache.org/r/4632/diff Testing ------- build and execution in various modes Thanks, Roman
        Hide
        tom pierce added a comment -

        Anyone object the most recent patch? It's a tiny edit off Roman's submission (as described on Reviewboard). Will plan to commit early next week if nobody pipes up.

        Show
        tom pierce added a comment - Anyone object the most recent patch? It's a tiny edit off Roman's submission (as described on Reviewboard). Will plan to commit early next week if nobody pipes up.
        Hide
        Roman Shaposhnik added a comment -

        Tom, thanks a million for taking care of this! I'd say the patch looks good (although I don't have a binding +1 mojo ).

        Show
        Roman Shaposhnik added a comment - Tom, thanks a million for taking care of this! I'd say the patch looks good (although I don't have a binding +1 mojo ).
        Hide
        Dmitriy Lyubimov added a comment -

        not having tested it , it looks good. One question though it looks like hadoop dependency is included in assembly? It is only used when in local mode only? or it also makes its way into classpath even in distributed mode?

        I am just worried if it may interfere with running Mahout with a third-party hadoop distribution.

        Show
        Dmitriy Lyubimov added a comment - not having tested it , it looks good. One question though it looks like hadoop dependency is included in assembly? It is only used when in local mode only? or it also makes its way into classpath even in distributed mode? I am just worried if it may interfere with running Mahout with a third-party hadoop distribution.
        Hide
        Roman Shaposhnik added a comment -

        Dmitriy, the bundled hadoop dependencies are only supposed to be used in local mode.

        Show
        Roman Shaposhnik added a comment - Dmitriy, the bundled hadoop dependencies are only supposed to be used in local mode.
        Hide
        Dmitriy Lyubimov added a comment -

        sounds good then.

        Show
        Dmitriy Lyubimov added a comment - sounds good then.
        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #1456 (See https://builds.apache.org/job/Mahout-Quality/1456/)
        MAHOUT-994 - bin/mahout should not rely on HADOOP_HOME (Revision 1333154)

        Result = SUCCESS
        tcp : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1333154
        Files :

        • /mahout/trunk/bin/mahout
        • /mahout/trunk/distribution/src/main/assembly/bin.xml
        Show
        Hudson added a comment - Integrated in Mahout-Quality #1456 (See https://builds.apache.org/job/Mahout-Quality/1456/ ) MAHOUT-994 - bin/mahout should not rely on HADOOP_HOME (Revision 1333154) Result = SUCCESS tcp : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1333154 Files : /mahout/trunk/bin/mahout /mahout/trunk/distribution/src/main/assembly/bin.xml

          People

          • Assignee:
            tom pierce
            Reporter:
            Roman Shaposhnik
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development