Mahout
  1. Mahout
  2. MAHOUT-301

Improve command-line shell script by allowing default properties files

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 0.3
    • Fix Version/s: 0.3
    • Component/s: Integration
    • Labels:
      None

      Description

      Snippet from javadoc gives the idea:

      /**
       * General-purpose driver class for Mahout programs.  Utilizes org.apache.hadoop.util.ProgramDriver to run
       * main methods of other classes, but first loads up default properties from a properties file.
       *
       * Usage: run on Hadoop like so:
       *
       * $HADOOP_HOME/bin/hadoop -jar path/to/job org.apache.mahout.driver.MahoutDriver [classes.props file] shortJobName \
       *   [default.props file for this class] [over-ride options, all specified in long form: --input, --jarFile, etc]
       *
       * TODO: set the Main-Class to just be MahoutDriver, so that this option isn't needed?
       *
       * (note: using the current shell scipt, this could be modified to be just 
       * $MAHOUT_HOME/bin/mahout [classes.props file] shortJobName [default.props file] [over-ride options]
       * )
       *
       * Works like this: by default, the file "core/src/main/resources/driver.classes.prop" is loaded, which
       * defines a mapping between short names like "VectorDumper" and fully qualified class names.  This file may
       * instead be overridden on the command line by having the first argument be some string of the form *classes.props.
       *
       * The next argument to the Driver is supposed to be the short name of the class to be run (as defined in the
       * driver.classes.props file).  After this, if the next argument ends in ".props" / ".properties", it is taken to
       * be the file to use as the default properties file for this execution, and key-value pairs are built up from that:
       * if the file contains
       *
       * input=/path/to/my/input
       * output=/path/to/my/output
       *
       * Then the class which will be run will have it's main called with
       *
       *   main(new String[] { "--input", "/path/to/my/input", "--output", "/path/to/my/output" });
       *
       * After all the "default" properties are loaded from the file, any further command-line arguments are taken in,
       * and over-ride the defaults.
       */
      

      Could be cleaned up, as it's kinda ugly with the whole "file named in .props", but gives the idea. Really helps cut down on repetitive long command lines, lets defaults be put props files instead of locked into the code also.

      1. MAHOUT-301.patch
        6 kB
        Jake Mannix
      2. MAHOUT-301.patch
        9 kB
        Jake Mannix
      3. MAHOUT-301.patch
        9 kB
        Jake Mannix
      4. MAHOUT-301-drew.patch
        11 kB
        Drew Farris
      5. MAHOUT-301.patch
        13 kB
        Jake Mannix
      6. MAHOUT-301.patch
        14 kB
        Jake Mannix
      7. MAHOUT-301-drew.patch
        14 kB
        Drew Farris
      8. MAHOUT-301.patch
        20 kB
        Jake Mannix

        Issue Links

        There are no Sub-Tasks for this issue.

          Activity

          Hide
          Drew Farris added a comment -

          This is pretty nice, it gets to the point where relying on shell-history or ad-hoc mechanisms to manage command-lines kills me and this is a nice solution.

          I've quickly skimmed the patch but I haven't tried it out. I see the TODO in there regarding short vs. long arguments. Do you have any thoughts on how to support single-dask arguments? Things the arguments supported by the GenericOptionsParser could be set in the properties file too.

          Show
          Drew Farris added a comment - This is pretty nice, it gets to the point where relying on shell-history or ad-hoc mechanisms to manage command-lines kills me and this is a nice solution. I've quickly skimmed the patch but I haven't tried it out. I see the TODO in there regarding short vs. long arguments. Do you have any thoughts on how to support single-dask arguments? Things the arguments supported by the GenericOptionsParser could be set in the properties file too.
          Hide
          Jake Mannix added a comment -

          The TODO refers to the issue that I think there, but am not sure: what does GenericOptionsParser do if you have a command line input like this:

          programName --input foo.txt -i bar.txt

          where --input is the long argument name for -i as short name? Which one wins? Is it deterministic?

          Show
          Jake Mannix added a comment - The TODO refers to the issue that I think there, but am not sure: what does GenericOptionsParser do if you have a command line input like this: programName --input foo.txt -i bar.txt where --input is the long argument name for -i as short name? Which one wins? Is it deterministic?
          Hide
          Ted Dunning added a comment -

          THis also helps non command line usage, actually. I can imagine a workflow solution where setting all parameters on every step get onerous.

          Show
          Ted Dunning added a comment - THis also helps non command line usage, actually. I can imagine a workflow solution where setting all parameters on every step get onerous.
          Hide
          Drew Farris added a comment -
          {blockquote}
          What does GenericOptionsParser do if you have a command line input like this:

          programName --input foo.txt -i bar.txt

          where --input is the long argument name for -i as short name? Which one wins? Is it deterministic? {blockquote}

          In most cases it's really depends on the implementation, sometimes GenericOptiosnParser isn't even being used. In Mahout's case it's likely to be commons-cli2 that's actually doing the parsing, and I don't know how it would behave in this case. I'll take a look.

          GenericOptionsParser simply handles things like -conf and -Dprop=value that control hadoop configurations, job settings and the like, and then hands back the rest to the caller. In many cases in the mahout , GenericOptionsParser isn't used at all which reduces the control one has over a job's behavior. iirc, Sean and Robin have made some progress towards eliminating these cases with the AbstractJob class.

          Show
          Drew Farris added a comment - {blockquote} What does GenericOptionsParser do if you have a command line input like this: programName --input foo.txt -i bar.txt where --input is the long argument name for -i as short name? Which one wins? Is it deterministic? {blockquote} In most cases it's really depends on the implementation, sometimes GenericOptiosnParser isn't even being used. In Mahout's case it's likely to be commons-cli2 that's actually doing the parsing, and I don't know how it would behave in this case. I'll take a look. GenericOptionsParser simply handles things like -conf and -Dprop=value that control hadoop configurations, job settings and the like, and then hands back the rest to the caller. In many cases in the mahout , GenericOptionsParser isn't used at all which reduces the control one has over a job's behavior. iirc, Sean and Robin have made some progress towards eliminating these cases with the AbstractJob class.
          Hide
          Jake Mannix added a comment -

          So this current patch will totally take -conf / -Dprop=value type stuff, and pass it directly on into the program in the usual way, with the only difference being that these arguments could also be in a properties file, as long as their using the exact same form, which would make ugly props files as is:

          if you wanted to not have to type:

          $MAHOUT_HOME/bin/mahout myClassShortName -DmyProp=value

          You would could currently need to have, in your props file:

          DmyProp = value

          which looks kinda silly, but would work. Oh wait, no it wouldn't, it would end up with a command line which would do " -DmyProp value" not "-DmyProp=value". To get the latter, we'd need an even uglier thing with the current patch:

          "DmyProp=value"=

          which would get interpolated into -DmyProp=value on the internal command line. Super ugly.

          I've got a modified version of this I can upload in a bit which takes care of the short-name/long-name arguments thing by a bit of a kludge, with props files which would look like this:

          i | input = foo/path

          which is to be interpreted as: if on the command line, the user say "i bar/path" OR "input baz/path", they override the "foo/path" in the props file. If the line in the props file has no "|" separating two options, it's assumed to be prepended with "".

          Still doesn't remove the ugliness of -Dprop=value though. Not sure how is best to handle that one. What kind of props file syntax would tell it "take these key-value pairs and do '-key value" and do these other ones as '-Dkey=value'"? I guess just having the 'D' there would be a good signal? It could then just take

          i | input = foo/path
          DmyProp = propValue

          and translate that into a command line like: progName -i foo/path -DmyProp=myValue

          That would work and be not completely horribly ugly. Not great though.

          Show
          Jake Mannix added a comment - So this current patch will totally take -conf / -Dprop=value type stuff, and pass it directly on into the program in the usual way, with the only difference being that these arguments could also be in a properties file, as long as their using the exact same form, which would make ugly props files as is: if you wanted to not have to type: $MAHOUT_HOME/bin/mahout myClassShortName -DmyProp=value You would could currently need to have, in your props file: DmyProp = value which looks kinda silly, but would work. Oh wait, no it wouldn't, it would end up with a command line which would do " -DmyProp value" not "-DmyProp=value". To get the latter, we'd need an even uglier thing with the current patch: "DmyProp=value"= which would get interpolated into -DmyProp=value on the internal command line. Super ugly. I've got a modified version of this I can upload in a bit which takes care of the short-name/long-name arguments thing by a bit of a kludge, with props files which would look like this: i | input = foo/path which is to be interpreted as: if on the command line, the user say " i bar/path" OR " input baz/path", they override the "foo/path" in the props file. If the line in the props file has no "|" separating two options, it's assumed to be prepended with " ". Still doesn't remove the ugliness of -Dprop=value though. Not sure how is best to handle that one. What kind of props file syntax would tell it "take these key-value pairs and do '-key value" and do these other ones as '-Dkey=value'"? I guess just having the 'D' there would be a good signal? It could then just take i | input = foo/path DmyProp = propValue and translate that into a command line like: progName -i foo/path -DmyProp=myValue That would work and be not completely horribly ugly. Not great though.
          Hide
          Robin Anil added a comment -

          Looks great. We parallely need to convert all mainClasses extending AbstractJob and cleanup the stuff there at MAHOUT-294

          Show
          Robin Anil added a comment - Looks great. We parallely need to convert all mainClasses extending AbstractJob and cleanup the stuff there at MAHOUT-294
          Hide
          Jake Mannix added a comment -

          Better version. Javadocs updated in the patch to reflect the way it works:

          /**
           * General-purpose driver class for Mahout programs.  Utilizes org.apache.hadoop.util.ProgramDriver to run
           * main methods of other classes, but first loads up default properties from a properties file.
           *
           * Usage: run on Hadoop like so:
           *
           * $HADOOP_HOME/bin/hadoop -jar path/to/job org.apache.mahout.driver.MahoutDriver \
           *   [--classesFile|-cf <file>] [--defaultsFile|-df <file>] shortJobName [over-ride opts]
           *
           * or for local running:
           *
           * $MAHOUT_HOME/bin/mahout run [--classesFile|-cf <file>] [--defaultsFile|-df <file>] shortJobName [over-ride ops]
           *
           * Works like this: by default, the file "core/src/main/resources/driver.classes.props" is loaded, which
           * defines a mapping between short names like "VectorDumper" and fully qualified class names.  This file may
           * instead be overridden on the command line by specifying --classesFile|-cf <classesFile>.
           *
           * The default properties to be applied to the program run is pulled out of, by default,
           * "core/src/main/resources/<shortJobName>.props", unless --defaultsFile|-df <file> is specified by the cmdline.
           * The format of the default properties files is as follows:
           *
           * i|input = /path/to/my/input
           * o|output = /path/to/my/output
           * m|jarFile = /path/to/jarFile
           * # etc - each line is shortArg|longArg = value
           *
           * The next argument to the Driver is supposed to be the short name of the class to be run (as defined in the
           * driver.classes.props file).
           *
           * Then the class which will be run will have it's main called with
           *
           *   main(new String[] { "--input", "/path/to/my/input", "--output", "/path/to/my/output" });
           *
           * After all the "default" properties are loaded from the file, any further command-line arguments are taken in,
           * and over-ride the defaults.
           *
           * So if your core/src/main/resources/driver.classes.props looks like so:
           *
           * org.apache.mahout.utils.vectors.VectorDumper = "vecDump"
           *
           * and you have a file core/src/main/resources/vecDump.props which looks like
           *
           * o|output = /tmp/vectorOut
           * s|seqFile = /my/vector/sequenceFile
           *
           * And you execute the command-line:
           *
           * $MAHOUT_HOME/bin/mahout run vecDump -s /my/otherVector/sequenceFile
           *
           * Then org.apache.mahout.utils.vectors.VectorDumper.main() will be called with arguments:
           *   {"--output", "/tmp/vectorOut", "-s", "/my/otherVector/sequenceFile"}
           */
          
          Show
          Jake Mannix added a comment - Better version. Javadocs updated in the patch to reflect the way it works: /** * General-purpose driver class for Mahout programs. Utilizes org.apache.hadoop.util.ProgramDriver to run * main methods of other classes, but first loads up default properties from a properties file. * * Usage: run on Hadoop like so: * * $HADOOP_HOME/bin/hadoop -jar path/to/job org.apache.mahout.driver.MahoutDriver \ * [--classesFile|-cf <file>] [--defaultsFile|-df <file>] shortJobName [over-ride opts] * * or for local running: * * $MAHOUT_HOME/bin/mahout run [--classesFile|-cf <file>] [--defaultsFile|-df <file>] shortJobName [over-ride ops] * * Works like this : by default , the file "core/src/main/resources/driver.classes.props" is loaded, which * defines a mapping between short names like "VectorDumper" and fully qualified class names. This file may * instead be overridden on the command line by specifying --classesFile|-cf <classesFile>. * * The default properties to be applied to the program run is pulled out of, by default , * "core/src/main/resources/<shortJobName>.props" , unless --defaultsFile|-df <file> is specified by the cmdline. * The format of the default properties files is as follows: * * i|input = /path/to/my/input * o|output = /path/to/my/output * m|jarFile = /path/to/jarFile * # etc - each line is shortArg|longArg = value * * The next argument to the Driver is supposed to be the short name of the class to be run (as defined in the * driver.classes.props file). * * Then the class which will be run will have it's main called with * * main( new String [] { "--input" , "/path/to/my/input" , "--output" , "/path/to/my/output" }); * * After all the " default " properties are loaded from the file, any further command-line arguments are taken in, * and over-ride the defaults. * * So if your core/src/main/resources/driver.classes.props looks like so: * * org.apache.mahout.utils.vectors.VectorDumper = "vecDump" * * and you have a file core/src/main/resources/vecDump.props which looks like * * o|output = /tmp/vectorOut * s|seqFile = /my/vector/sequenceFile * * And you execute the command-line: * * $MAHOUT_HOME/bin/mahout run vecDump -s /my/otherVector/sequenceFile * * Then org.apache.mahout.utils.vectors.VectorDumper.main() will be called with arguments: * { "--output" , "/tmp/vectorOut" , "-s" , "/my/otherVector/sequenceFile" } */
          Hide
          Jake Mannix added a comment -

          This patch modifies the mahout shell script to add the "run" command, which invokes this driver class.

          It also more nicely takes shortName definitions from either core/src/main/resources/driver.classes.props or the "-cf configFile" location, and runs the class specified by shortName using props specified in core/src/main/resources/shortName.props or whatever is "-df defaultpropsFile".

          Also takes options in the file of the form "DsomeOpt = optionVal" and passes those into the program as "-DsomeOpt=optionVal" as well.

          Not sure how well it works on hadoop yet. But comand line seems to work for the one class I've got a props file for (TestClassifier).

          Show
          Jake Mannix added a comment - This patch modifies the mahout shell script to add the "run" command, which invokes this driver class. It also more nicely takes shortName definitions from either core/src/main/resources/driver.classes.props or the "-cf configFile" location, and runs the class specified by shortName using props specified in core/src/main/resources/shortName.props or whatever is "-df defaultpropsFile". Also takes options in the file of the form "DsomeOpt = optionVal" and passes those into the program as "-DsomeOpt=optionVal" as well. Not sure how well it works on hadoop yet. But comand line seems to work for the one class I've got a props file for (TestClassifier).
          Hide
          Jake Mannix added a comment -

          Fancy new version. Run as follows:

          Set your $MAHOUT_CONF_DIR to a directory where you will have your own overrides (or, if unset, defaults to ./core/src/main/resources).

          In that directory, there should be a file called "driver.classes.props" with contents like so:

          org.apache.mahout.utils.vectors.VectorDumper="vecDump"
          org.apache.mahout.utils.clustering.ClusterDumper="clusty"
          org.apache.mahout.utils.SequenceFileDumper="seqDump"
          org.apache.mahout.clustering.kmeans.KMeansDriver="kmeans"
          org.apache.mahout.clustering.canopy.CanopyDriver="canopy"
          org.apache.mahout.utils.vectors.lucene.Driver="luceneVecs"
          org.apache.mahout.text.SequenceFilesFromDirectory="dirToSeq"
          org.apache.mahout.text.WikipediaToSequenceFile="wikToSeq"
          org.apache.mahout.classifier.bayes.TestClassifier="TestClassifier"
          

          Etc. The right hand side can be whatever you want, but whatever it is determines where MahoutDriver will look for a default properties file. For example:

          $MAHOUT_HOME/bin/mahout run wikToSeq
          

          would look for the file $MAHOUT_CONF_DIR/wikToSeq.props and in that file, take each line and transform it into command line arguments for WikipediaToSequenceFile, using the logic as follows:

          on each line of wikToSeq.props, there is a key-value pair:

          i | input = my/wiki/input/path
          o | output = my/output/path
          c | categories = my/wikiCategories/file
          e | exactMatch = true
          all = true
          

          The part of the key before the vertical bar is the short-name of the argument to pass, and the second part is the long name. If there is only one, they are assumed to be the same.

          You can also pass Hadoop options here, like

          Djava.io.tmpdir = /var/tmp/mahout 
          

          which would lead to the program being called with "-Djava.io.tmpdir=/var/tmp/mahout" passed in.

          Show
          Jake Mannix added a comment - Fancy new version. Run as follows: Set your $MAHOUT_CONF_DIR to a directory where you will have your own overrides (or, if unset, defaults to ./core/src/main/resources). In that directory, there should be a file called "driver.classes.props" with contents like so: org.apache.mahout.utils.vectors.VectorDumper= "vecDump" org.apache.mahout.utils.clustering.ClusterDumper= "clusty" org.apache.mahout.utils.SequenceFileDumper= "seqDump" org.apache.mahout.clustering.kmeans.KMeansDriver= "kmeans" org.apache.mahout.clustering.canopy.CanopyDriver= "canopy" org.apache.mahout.utils.vectors.lucene.Driver= "luceneVecs" org.apache.mahout.text.SequenceFilesFromDirectory= "dirToSeq" org.apache.mahout.text.WikipediaToSequenceFile= "wikToSeq" org.apache.mahout.classifier.bayes.TestClassifier= "TestClassifier" Etc. The right hand side can be whatever you want, but whatever it is determines where MahoutDriver will look for a default properties file. For example: $MAHOUT_HOME/bin/mahout run wikToSeq would look for the file $MAHOUT_CONF_DIR/wikToSeq.props and in that file, take each line and transform it into command line arguments for WikipediaToSequenceFile, using the logic as follows: on each line of wikToSeq.props, there is a key-value pair: i | input = my/wiki/input/path o | output = my/output/path c | categories = my/wikiCategories/file e | exactMatch = true all = true The part of the key before the vertical bar is the short-name of the argument to pass, and the second part is the long name. If there is only one, they are assumed to be the same. You can also pass Hadoop options here, like Djava.io.tmpdir = / var /tmp/mahout which would lead to the program being called with "-Djava.io.tmpdir=/var/tmp/mahout" passed in.
          Hide
          Jake Mannix added a comment -

          Oh, I forgot to finish my sentence which began "run as follows..."

          Once youv'e got default property files in your $MAHOUT_CONF_DIR, you can run like so:

          $MAHOUT_HOME/bin/mahout run wikToSeq
          

          and that's it. If you want to override the options in your wikToSeq.props file, just pass them in on that same command line above, and they override as desired.

          If this can be tested out and debugged, this patch is ready for committing, and significantly improves the command line experience.

          Show
          Jake Mannix added a comment - Oh, I forgot to finish my sentence which began "run as follows..." Once youv'e got default property files in your $MAHOUT_CONF_DIR, you can run like so: $MAHOUT_HOME/bin/mahout run wikToSeq and that's it. If you want to override the options in your wikToSeq.props file, just pass them in on that same command line above, and they override as desired. If this can be tested out and debugged, this patch is ready for committing, and significantly improves the command line experience.
          Hide
          Robin Anil added a comment -

          The help comments are missing from the mahout/bin script. Scroll up that file and you will see a pretty printed help string. Just add the Mahout driver description and possibly a wikilink there. Otherwise looks good to commit. I have checked the full functionality yet. If anyone else want to take a look, please do quickly

          Show
          Robin Anil added a comment - The help comments are missing from the mahout/bin script. Scroll up that file and you will see a pretty printed help string. Just add the Mahout driver description and possibly a wikilink there. Otherwise looks good to commit. I have checked the full functionality yet. If anyone else want to take a look, please do quickly
          Hide
          Drew Farris added a comment -

          Did some testing, here's a patch to clean some of these things up + a couple questions:

          Could we load the default driver.classes.props from the classpath? If it was loaded that way the default would work regardless of where the mahout script is run from (it currently only works if ./bin/mahout is run, not ./mahout for example) and regardless of whether we're running from a binary release or the dev environment. (included in patch)

          Something else I noticed is that the 'mahout' script doesn't add the classes in $MAHOUT_HOME/lib/*.jar to the classpath. This breakes the binary release in that it can't run anything, e.g:

          ./mahout vectordump
          Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/commons/cli2/OptionException
          Caused by: java.lang.ClassNotFoundException: org.apache.commons.cli2.OptionException
          

          (fixed in patch)

          Using -core in the context of a dev build should work properly, but leaving out -core will cause the script to error unless run in the context of a release – this is the way it should work, right?

          Also wondering what the purpose of adding the job jars to the classpath is? (removed in patch)

          Also added a help message for the 'run' argument.

          Does executing './mahout run --help' hang for anyone else or is it something specific to my environment? (didn't track this one down)

          Show
          Drew Farris added a comment - Did some testing, here's a patch to clean some of these things up + a couple questions: Could we load the default driver.classes.props from the classpath? If it was loaded that way the default would work regardless of where the mahout script is run from (it currently only works if ./bin/mahout is run, not ./mahout for example) and regardless of whether we're running from a binary release or the dev environment. (included in patch) Something else I noticed is that the 'mahout' script doesn't add the classes in $MAHOUT_HOME/lib/*.jar to the classpath. This breakes the binary release in that it can't run anything, e.g: ./mahout vectordump Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/commons/cli2/OptionException Caused by: java.lang.ClassNotFoundException: org.apache.commons.cli2.OptionException (fixed in patch) Using -core in the context of a dev build should work properly, but leaving out -core will cause the script to error unless run in the context of a release – this is the way it should work, right? Also wondering what the purpose of adding the job jars to the classpath is? (removed in patch) Also added a help message for the 'run' argument. Does executing './mahout run --help' hang for anyone else or is it something specific to my environment? (didn't track this one down)
          Hide
          Robin Anil added a comment -

          including the job jar is much cleaner than adding all deps. Plus there is nothing more to configure to execute it on top of hadoop..

          BTW. How is hadoop execution done using shell script ? i.e

          hadoop jar mahout-examples-0.3.job o.a.m...DictionaryVectorizer --input ..... args

          Show
          Robin Anil added a comment - including the job jar is much cleaner than adding all deps. Plus there is nothing more to configure to execute it on top of hadoop.. BTW. How is hadoop execution done using shell script ? i.e hadoop jar mahout-examples-0.3.job o.a.m...DictionaryVectorizer --input ..... args
          Hide
          Drew Farris added a comment -

          including the job jar is much cleaner than adding all deps. Plus there is nothing more to configure to execute it on top of hadoop..

          The job files work fine with 'hadoop jar', but putting the job files in the classspath will not automatically include the dependencies they contain (e.g commons-cli2) on the classpath: the dependencies need to be added separately (see the ClassNotFoundException case described above)

          BTW. How is hadoop execution done using shell script ?

          If the HADOOP_CONF_DIR is set, it should be picked up by the jobs, but I don't think that means jar/jobfile execution works properly. I suspect this needs modifications to make that possible.

          Show
          Drew Farris added a comment - including the job jar is much cleaner than adding all deps. Plus there is nothing more to configure to execute it on top of hadoop.. The job files work fine with 'hadoop jar', but putting the job files in the classspath will not automatically include the dependencies they contain (e.g commons-cli2) on the classpath: the dependencies need to be added separately (see the ClassNotFoundException case described above) BTW. How is hadoop execution done using shell script ? If the HADOOP_CONF_DIR is set, it should be picked up by the jobs, but I don't think that means jar/jobfile execution works properly. I suspect this needs modifications to make that possible.
          Hide
          Drew Farris added a comment -

          BTW. How is hadoop execution done using shell script ? i.e

          It looks like something like the following would do the trick

          /bin/mahout -core org.apache.hadoop.util.RunJar /path/to/mahout-examples-0.3-SNAPSHOT.job org.apache.mahout.driver.MahoutDriver TestClassifier
          

          we could probably provide 'runjob' case that appends 'org.apache.hadoop.util.RunJar examples/target/mahout-examples-0.3-SNAPSHOT.job org.apache.mahout.driver.MahoutDriver', but perhaps this could be used in every case that 'run' is called?

          Show
          Drew Farris added a comment - BTW. How is hadoop execution done using shell script ? i.e It looks like something like the following would do the trick /bin/mahout -core org.apache.hadoop.util.RunJar /path/to/mahout-examples-0.3-SNAPSHOT.job org.apache.mahout.driver.MahoutDriver TestClassifier we could probably provide 'runjob' case that appends 'org.apache.hadoop.util.RunJar examples/target/mahout-examples-0.3-SNAPSHOT.job org.apache.mahout.driver.MahoutDriver', but perhaps this could be used in every case that 'run' is called?
          Hide
          Jake Mannix added a comment -

          Hey Drew, thanks for looking at this. Problems you saw are probably what are known as "bugs".

          Did some testing, here's a patch to clean some of these things up + a couple questions:
          Could we load the default driver.classes.props from the classpath? If it was loaded that way the default would work regardless of where the mahout script is run from (it currently only works if ./bin/mahout is run, not ./mahout for example) and regardless of whether we're running from a binary release or the dev environment. (included in patch)

          YES! We should indeed load from classpath. My most recent version of this patch (which isn't posted, because it conflicts with yours, I'm trying to resolve that now) changes it so that you just supply a single directory in which driver.classes.props and the shortNames.props files are located.

          Something else I noticed is that the 'mahout' script doesn't add the classes in $MAHOUT_HOME/lib/*.jar to the classpath. This breakes the binary release in that it can't run anything, e.g:

          ./mahout vectordump
          Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/commons/cli2/OptionException
          Caused by: java.lang.ClassNotFoundException: org.apache.commons.cli2.OptionException
          (fixed in patch)

          
          This wasn't a problem with my patch, right?  That was an issue of the mahout script in trunk itself?  
          
          

          Using -core in the context of a dev build should work properly, but leaving out -core will cause the script to error unless run in the context of a release - this is the way it should work, right?

          
          What is the -core option for?  I've never used it, how does it work?
          
          

          Also added a help message for the 'run' argument.

          
          Where did you add that?
          
          

          Does executing './mahout run --help' hang for anyone else or is it something specific to my environment? (didn't track this one down)

          
          

          The --help option I didn't have in there, you added it, do you know where it's hanging?

          Show
          Jake Mannix added a comment - Hey Drew, thanks for looking at this. Problems you saw are probably what are known as "bugs". Did some testing, here's a patch to clean some of these things up + a couple questions: Could we load the default driver.classes.props from the classpath? If it was loaded that way the default would work regardless of where the mahout script is run from (it currently only works if ./bin/mahout is run, not ./mahout for example) and regardless of whether we're running from a binary release or the dev environment. (included in patch) YES! We should indeed load from classpath. My most recent version of this patch (which isn't posted, because it conflicts with yours, I'm trying to resolve that now) changes it so that you just supply a single directory in which driver.classes.props and the shortNames.props files are located. Something else I noticed is that the 'mahout' script doesn't add the classes in $MAHOUT_HOME/lib/*.jar to the classpath. This breakes the binary release in that it can't run anything, e.g: ./mahout vectordump Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/commons/cli2/OptionException Caused by: java.lang.ClassNotFoundException: org.apache.commons.cli2.OptionException (fixed in patch) This wasn't a problem with my patch, right? That was an issue of the mahout script in trunk itself? Using -core in the context of a dev build should work properly, but leaving out -core will cause the script to error unless run in the context of a release - this is the way it should work, right? What is the -core option for ? I've never used it, how does it work? Also added a help message for the 'run' argument. Where did you add that? Does executing './mahout run --help' hang for anyone else or is it something specific to my environment? (didn't track this one down) The --help option I didn't have in there, you added it, do you know where it's hanging?
          Hide
          Jake Mannix added a comment -

          Ok, Drew, got your patch in diff mode against mine finally.

          So you already added the ability to load via classpath, right? If we merge that way of thinking with what I'm currently working on (having a configurable "MAHOUT_CONF_DIR" which is used for all these props files), we could just have the mahout shell script just add MAHOUT_CONF_DIR to the classpath (the way you already have it adding the hardwired core/src/main/resources directory) and then it would work that way.

          New patch merging yours with mine forthcoming.

          Show
          Jake Mannix added a comment - Ok, Drew, got your patch in diff mode against mine finally. So you already added the ability to load via classpath, right? If we merge that way of thinking with what I'm currently working on (having a configurable "MAHOUT_CONF_DIR" which is used for all these props files), we could just have the mahout shell script just add MAHOUT_CONF_DIR to the classpath (the way you already have it adding the hardwired core/src/main/resources directory) and then it would work that way. New patch merging yours with mine forthcoming.
          Hide
          Drew Farris added a comment -

          This wasn't a problem with my patch, right? That was an issue of the mahout script in trunk itself?

          Yes it was a problem with the script in trunk. I believe this was due to the fact that the job files were on the classpath instead of all of the dependency jars. Adding the job files to the classpath does not add the dependency jars they contain to the classpath as well. So, no you didn't add this, but it should be fixed (and is in the patch)

          What is the -core option for? I've never used it, how does it work?

          when you're running bin/mahout in the context of a build the -core option is used to tell it to use the build classpath instead of the classpath used for a binary release. This just follows the pattern established (by Doug?) in the hadoop and nutch launch scripts.

          Also added a help message for the 'run' argument.

          near line 72 in bin/mahout:
          (this is different from the --help question I had)

            echo "  seq2sparse            generate sparse vectors from a sequence file"
            echo "  vectordump            dump vectors from a sequence file"
            echo "  run                   run mahout tasks using the MahoutDriver, see: http://cwiki.apache.org/MAHOUT/mahoutdriver.html"
          

          So you already added the ability to load via classpath, right? If we merge that way of thinking with what I'm currently working on (having a configurable "MAHOUT_CONF_DIR" which is used for all these props files), we could just have the mahout shell script just add MAHOUT_CONF_DIR to the classpath (the way you already have it adding the hardwired core/src/main/resources directory) and then it would work that way.

          Yep, that should do it, as long as MAHOUT_CONF_DIR appears before src/main/resources, we should be good to go. It should be added outside of the section of the script that determines if -core has been specified on the command-line.

          Show
          Drew Farris added a comment - This wasn't a problem with my patch, right? That was an issue of the mahout script in trunk itself? Yes it was a problem with the script in trunk. I believe this was due to the fact that the job files were on the classpath instead of all of the dependency jars. Adding the job files to the classpath does not add the dependency jars they contain to the classpath as well. So, no you didn't add this, but it should be fixed (and is in the patch) What is the -core option for? I've never used it, how does it work? when you're running bin/mahout in the context of a build the -core option is used to tell it to use the build classpath instead of the classpath used for a binary release. This just follows the pattern established (by Doug?) in the hadoop and nutch launch scripts. Also added a help message for the 'run' argument. near line 72 in bin/mahout: (this is different from the --help question I had) echo " seq2sparse generate sparse vectors from a sequence file" echo " vectordump dump vectors from a sequence file" echo " run run mahout tasks using the MahoutDriver, see: http: //cwiki.apache.org/MAHOUT/mahoutdriver.html" So you already added the ability to load via classpath, right? If we merge that way of thinking with what I'm currently working on (having a configurable "MAHOUT_CONF_DIR" which is used for all these props files), we could just have the mahout shell script just add MAHOUT_CONF_DIR to the classpath (the way you already have it adding the hardwired core/src/main/resources directory) and then it would work that way. Yep, that should do it, as long as MAHOUT_CONF_DIR appears before src/main/resources, we should be good to go. It should be added outside of the section of the script that determines if -core has been specified on the command-line.
          Hide
          Jake Mannix added a comment -

          Something else I noticed is that the 'mahout' script doesn't add the classes in $MAHOUT_HOME/lib/*.jar to the classpath. This breakes the binary release in that it can't run anything, e.g:

          Also wondering what the purpose of adding the job jars to the classpath is? (removed in patch)

          When I run locally now, not using -core, I get this failure:

          /bin/mahout vectordump -s wiki-sparse-vectors-out/vectors/part-00000
          Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/mahout/utils/vectors/VectorDumper
          

          This appears to be because your patch has CLASSPATH set to add on things like $MAHOUT_HOME/mahout-*.jar, which doesn't exist after I've done "mvn install". Is there another maven target I need to use to generate the release jars in $MAHOUT_HOME?

          Show
          Jake Mannix added a comment - Something else I noticed is that the 'mahout' script doesn't add the classes in $MAHOUT_HOME/lib/*.jar to the classpath. This breakes the binary release in that it can't run anything, e.g: Also wondering what the purpose of adding the job jars to the classpath is? (removed in patch) When I run locally now, not using -core, I get this failure: /bin/mahout vectordump -s wiki-sparse-vectors-out/vectors/part-00000 Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/mahout/utils/vectors/VectorDumper This appears to be because your patch has CLASSPATH set to add on things like $MAHOUT_HOME/mahout-*.jar, which doesn't exist after I've done "mvn install". Is there another maven target I need to use to generate the release jars in $MAHOUT_HOME?
          Hide
          Drew Farris added a comment -

          Jake, the basic idea is that you would always use -core when executing from within a build, but you would not use core when executing in the context of a binary release.

          The binary release, built using mvn -Prelease, lands in target/mahout-0.3-SNAPSHOT.tar.gz, untar that and try running bin/mahout from the directory that's created and that should work fine without -core

          Show
          Drew Farris added a comment - Jake, the basic idea is that you would always use -core when executing from within a build, but you would not use core when executing in the context of a binary release. The binary release, built using mvn -Prelease, lands in target/mahout-0.3-SNAPSHOT.tar.gz, untar that and try running bin/mahout from the directory that's created and that should work fine without -core
          Hide
          Jake Mannix added a comment -

          Jake, the basic idea is that you would always use -core when executing from within a build, but you would not use core when executing in the context of a binary release.

          Hmm... ok. I'm a little reticent about running -core when testing, because I'm not really testing what the release run will be like - I like the idea of having a single set of dependencies (jars, not classes directories) which are used locally, and the .job when hitting a remote hadoop cluster. Maybe I'm just not familiar with the -core option and it's use.

          So far, I've always run by the process of

          • make code/config changes
          • run mvn clean install (sometimes with -DskipTests if I'm doing rapid iterations)
          • run "mahout <comand> args" OR
          • hadoop jar examples/target/mahout-examples- {version}

            .job <classname> args

          The last step, as you've noted, is because I'm not sure that the script actually properly lets HADOOP_CONF_DIR properly get passed through the mahout shell script to actually running on the hadoop cluster, but maybe that's just a config issue in my case? Also means that in fact the default properties idea still doesn't work on hadoop, unless the default properties files are pushed to the classpath.

          Maybe a kludgey way to do it would be for the script to grab the properties files from the MAHOUT_CONF_DIR, unzip the release job jar, push them into it, and re-jar it back up and then give it to hadoop, and now those files will be available on the classpath of the running job on the remote cluster?

          What is the right way run a job with some additional (runtime) files added to the job's classpath? Is there some cmdline arg to "hadoop" that I'm forgetting?

          Show
          Jake Mannix added a comment - Jake, the basic idea is that you would always use -core when executing from within a build, but you would not use core when executing in the context of a binary release. Hmm... ok. I'm a little reticent about running -core when testing, because I'm not really testing what the release run will be like - I like the idea of having a single set of dependencies (jars, not classes directories) which are used locally, and the .job when hitting a remote hadoop cluster. Maybe I'm just not familiar with the -core option and it's use. So far, I've always run by the process of make code/config changes run mvn clean install (sometimes with -DskipTests if I'm doing rapid iterations) run "mahout <comand> args" OR hadoop jar examples/target/mahout-examples- {version} .job <classname> args The last step, as you've noted, is because I'm not sure that the script actually properly lets HADOOP_CONF_DIR properly get passed through the mahout shell script to actually running on the hadoop cluster, but maybe that's just a config issue in my case? Also means that in fact the default properties idea still doesn't work on hadoop, unless the default properties files are pushed to the classpath. Maybe a kludgey way to do it would be for the script to grab the properties files from the MAHOUT_CONF_DIR, unzip the release job jar, push them into it, and re-jar it back up and then give it to hadoop, and now those files will be available on the classpath of the running job on the remote cluster? What is the right way run a job with some additional (runtime) files added to the job's classpath? Is there some cmdline arg to "hadoop" that I'm forgetting?
          Hide
          Drew Farris added a comment -

          Hmm... ok. I'm a little reticent about running -core when testing, because I'm not really testing what the release run will be like - I like the idea of having a single set of dependencies (jars, not classes directories) which are used locally, and the .job when hitting a remote hadoop cluster. Maybe I'm just not familiar with the -core option and it's use.

          Ahh, I see where you're coming from, so without core, you're suggesting that mahout pick up the jar files in the target directories if they exist? I think it is fine to modify the non-core classpath to include these, they won't be present in the release build anyway.

          The last step, as you've noted, is because I'm not sure that the script actually properly lets HADOOP_CONF_DIR properly get passed through the mahout shell script to actually running on the hadoop cluster, but maybe that's just a config issue in my case? Also means that in fact the default properties idea still doesn't work on hadoop, unless the default properties files are pushed to the classpath.

          Are any of the default properties files used beyond the MahoutDriver, which executes locally and sets up the job? Do these files need to be distributed to the rest of the cluster? As noted above, I think the proper way to run MahoutDriver in the context of a distributed job is to do something like:

          ./bin/mahout org.apache.hadoop.util.RunJar /path/to/mahout-examples-0.3-SNAPSHOT.job org.apache.mahout.driver.MahoutDriver TestClassifier
          

          I suspect we could easilly modify the mahout script and shorten this to:

          ./bin/mahout runjob TestClassifier
          

          I can look at this a little closer tonight, so if you have an updated patch for me to work on/test in a few hours, definitely post it. I'd be happy to make any changes you're interested in.

          What is the right way run a job with some additional (runtime) files added to the job's classpath? Is there some cmdline arg to "hadoop" that I'm forgetting?

          FWIW, [http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/util/GenericOptionsParser.html|GenericOptionsParser] provides a way to do this with -files, -libjars and -archives

          Show
          Drew Farris added a comment - Hmm... ok. I'm a little reticent about running -core when testing, because I'm not really testing what the release run will be like - I like the idea of having a single set of dependencies (jars, not classes directories) which are used locally, and the .job when hitting a remote hadoop cluster. Maybe I'm just not familiar with the -core option and it's use. Ahh, I see where you're coming from, so without core, you're suggesting that mahout pick up the jar files in the target directories if they exist? I think it is fine to modify the non-core classpath to include these, they won't be present in the release build anyway. The last step, as you've noted, is because I'm not sure that the script actually properly lets HADOOP_CONF_DIR properly get passed through the mahout shell script to actually running on the hadoop cluster, but maybe that's just a config issue in my case? Also means that in fact the default properties idea still doesn't work on hadoop, unless the default properties files are pushed to the classpath. Are any of the default properties files used beyond the MahoutDriver, which executes locally and sets up the job? Do these files need to be distributed to the rest of the cluster? As noted above, I think the proper way to run MahoutDriver in the context of a distributed job is to do something like: ./bin/mahout org.apache.hadoop.util.RunJar /path/to/mahout-examples-0.3-SNAPSHOT.job org.apache.mahout.driver.MahoutDriver TestClassifier I suspect we could easilly modify the mahout script and shorten this to: ./bin/mahout runjob TestClassifier I can look at this a little closer tonight, so if you have an updated patch for me to work on/test in a few hours, definitely post it. I'd be happy to make any changes you're interested in. What is the right way run a job with some additional (runtime) files added to the job's classpath? Is there some cmdline arg to "hadoop" that I'm forgetting? FWIW, [http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/util/GenericOptionsParser.html|GenericOptionsParser] provides a way to do this with -files, -libjars and -archives
          Hide
          Jake Mannix added a comment -

          Ahh, I see where you're coming from, so without core, you're suggesting that mahout pick up the jar files in the target directories if they exist? I think it is fine to modify the non-core classpath to include these, they won't be present in the release build anyway.

          Cool, yeah, that makes sense.

          Are any of the default properties files used beyond the MahoutDriver, which executes locally and sets up the job? Do these files need to be distributed to the rest of the cluster? As noted above, I think the proper way to run MahoutDriver in the context of a distributed job is to do something like:

          ./bin/mahout org.apache.hadoop.util.RunJar /path/to/mahout-examples-0.3-SNAPSHOT.job org.apache.mahout.driver.MahoutDriver TestClassifier
          

          I suspect we could easilly modify the mahout script and shorten this to:

          ./bin/mahout runjob TestClassifier
          

          Cool, so why not just check to see if $HADOOP_CONF_DIR is set - if it is, do "runjob" as described, if it's not, do "run" to do locally.

          FWIW, [http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/util/GenericOptionsParser.html|GenericOptionsParser] provides a way to do this with -files, -libjars and -archives

          Now of course, I guess I don't really need the files to get onto the job's classpath on the cluster - it just needs to be on the classpath of the locally running jvm which is invoking MahoutDriver.main(). So I was doing more work than was necessary. This is easy to do, just add MAHOUT_CONF_DIR to the classpath and we're good to go.

          Show
          Jake Mannix added a comment - Ahh, I see where you're coming from, so without core, you're suggesting that mahout pick up the jar files in the target directories if they exist? I think it is fine to modify the non-core classpath to include these, they won't be present in the release build anyway. Cool, yeah, that makes sense. Are any of the default properties files used beyond the MahoutDriver, which executes locally and sets up the job? Do these files need to be distributed to the rest of the cluster? As noted above, I think the proper way to run MahoutDriver in the context of a distributed job is to do something like: ./bin/mahout org.apache.hadoop.util.RunJar /path/to/mahout-examples-0.3-SNAPSHOT.job org.apache.mahout.driver.MahoutDriver TestClassifier I suspect we could easilly modify the mahout script and shorten this to: ./bin/mahout runjob TestClassifier Cool, so why not just check to see if $HADOOP_CONF_DIR is set - if it is, do "runjob" as described, if it's not, do "run" to do locally. FWIW, [http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/util/GenericOptionsParser.html|GenericOptionsParser] provides a way to do this with -files, -libjars and -archives Now of course, I guess I don't really need the files to get onto the job's classpath on the cluster - it just needs to be on the classpath of the locally running jvm which is invoking MahoutDriver.main(). So I was doing more work than was necessary. This is easy to do, just add MAHOUT_CONF_DIR to the classpath and we're good to go.
          Hide
          Drew Farris added a comment -

          Cool, so why not just check to see if $HADOOP_CONF_DIR is set - if it is, do "runjob" as described, if it's not, do "run" to do locally.

          Yes, ok – that should work because I believe you can use RunJar to launch anything even if it isn't a mapreduce job, no need for classpath setup in this case either – all you need to do is point to the examples job. Might be able to take advantage of this elsewhere.

          Show
          Drew Farris added a comment - Cool, so why not just check to see if $HADOOP_CONF_DIR is set - if it is, do "runjob" as described, if it's not, do "run" to do locally. Yes, ok – that should work because I believe you can use RunJar to launch anything even if it isn't a mapreduce job, no need for classpath setup in this case either – all you need to do is point to the examples job. Might be able to take advantage of this elsewhere.
          Hide
          Drew Farris added a comment -

          It doesn't appear that the following command works as intended:

          ./bin/mahout org.apache.hadoop.util.RunJar /path/to/mahout-examples-0.3-SNAPSHOT.job org.apache.mahout.driver.MahoutDriver TestClassifier
          

          The following seems to be the appropriate way to achieve what we're trying to do here:

          hadoop jar examples/target/mahout-examples-0.3-SNAPSHOT.job org.apache.mahout.driver.MahoutDriver TestClassifier
          

          Any thoughts on whether it makes sense to attempt to work the latter form into the mahout script? It won't pull the necessary config files for MahoutDriver in from a path outside of the job file unless HADOOP_CLASSPATH is set to include those directories, but I haven't had a chance to verify that.

          Show
          Drew Farris added a comment - It doesn't appear that the following command works as intended: ./bin/mahout org.apache.hadoop.util.RunJar /path/to/mahout-examples-0.3-SNAPSHOT.job org.apache.mahout.driver.MahoutDriver TestClassifier The following seems to be the appropriate way to achieve what we're trying to do here: hadoop jar examples/target/mahout-examples-0.3-SNAPSHOT.job org.apache.mahout.driver.MahoutDriver TestClassifier Any thoughts on whether it makes sense to attempt to work the latter form into the mahout script? It won't pull the necessary config files for MahoutDriver in from a path outside of the job file unless HADOOP_CLASSPATH is set to include those directories, but I haven't had a chance to verify that.
          Hide
          Jake Mannix added a comment -

          Ok, new patch.

          This one works in one of two ways. If you have $MAHOUT_CONF_DIR defined (there are some dummy files living in the newly created directory "conf" at the top level, moving away from core/src/main/resources), then you can just run:

          $MAHOUT_HOME/bin/mahout run svd
          

          and it should read your properties in $MAHOUT_CONF_DIR/svd.props and run (locally).

          The other way it can work (and actually does, at least on my setup) is running on hadoop:

          $HADOOP_HOME/bin/hadoop jar path/to/mahout.job org.apache.mahout.driver.MahoutDriver svd 
          

          And again, $MAHOUT_CONF_DIR/svd.props is read locally before being launched off to the hadoop cluster.

          I have not yet been able to get the idea of turning the shell script into automagically issuing RunJar as the command and passing MahoutDriver and the remaining args after, so that you would never need to run hadoop's shell script at all, although that would be great to have work.

          Also not yet in this patch: actually default set MAHOUT_CONF_DIR to the correct place in both dev mode and release mode, and I haven't modified the pom to package up the new conf dir and put it in the distribution.

          Show
          Jake Mannix added a comment - Ok, new patch. This one works in one of two ways. If you have $MAHOUT_CONF_DIR defined (there are some dummy files living in the newly created directory "conf" at the top level, moving away from core/src/main/resources), then you can just run: $MAHOUT_HOME/bin/mahout run svd and it should read your properties in $MAHOUT_CONF_DIR/svd.props and run (locally). The other way it can work (and actually does, at least on my setup) is running on hadoop: $HADOOP_HOME/bin/hadoop jar path/to/mahout.job org.apache.mahout.driver.MahoutDriver svd And again, $MAHOUT_CONF_DIR/svd.props is read locally before being launched off to the hadoop cluster. I have not yet been able to get the idea of turning the shell script into automagically issuing RunJar as the command and passing MahoutDriver and the remaining args after, so that you would never need to run hadoop's shell script at all, although that would be great to have work. Also not yet in this patch: actually default set MAHOUT_CONF_DIR to the correct place in both dev mode and release mode, and I haven't modified the pom to package up the new conf dir and put it in the distribution.
          Hide
          Jake Mannix added a comment -

          Our comments crossed in the ether!

          Any thoughts on whether it makes sense to attempt to work the latter form into the mahout script? It won't pull the necessary config files for MahoutDriver in from a path outside of the job file unless HADOOP_CLASSPATH is set to include those directories, but I haven't had a chance to verify that.

          You're right - I did indeed set my HADOOP_CLASSPATH to include $MAHOUT_CONF_DIR, which allowed this to work, otherwise it would not. This should be done by the script.

          Ideally, yes, it's ugly but if $MAHOUT_HOME/bin/mahout just sets $HADOOP_CLASSPATH to include $MAHOUT_CONF_DIR (or $MAHOUT_HOME/conf if that variable is not set), then just execute $HADOOP_HOME/bin/hadoop jar ... then it should work.

          Show
          Jake Mannix added a comment - Our comments crossed in the ether! Any thoughts on whether it makes sense to attempt to work the latter form into the mahout script? It won't pull the necessary config files for MahoutDriver in from a path outside of the job file unless HADOOP_CLASSPATH is set to include those directories, but I haven't had a chance to verify that. You're right - I did indeed set my HADOOP_CLASSPATH to include $MAHOUT_CONF_DIR, which allowed this to work, otherwise it would not. This should be done by the script. Ideally, yes, it's ugly but if $MAHOUT_HOME/bin/mahout just sets $HADOOP_CLASSPATH to include $MAHOUT_CONF_DIR (or $MAHOUT_HOME/conf if that variable is not set), then just execute $HADOOP_HOME/bin/hadoop jar ... then it should work.
          Hide
          Jake Mannix added a comment -

          Ok, now we're getting somewhere. This one a) has the ability to properly handle "mahout run -h" or "mahout run --help", helpfully spitting out the list of classes with shortName's which MahoutDriver has been told about in the driver.classes.props, and more importantly, it can, both in a release environment, and in a dev environment, do:

          ./bin/mahout run kmeans [options]
          

          If $MAHOUT_CONF_DIR is set, and points to a place with the right files, then the default properties are loaded from there (overridden by [options] given above).

          If both $HADOOP_HOME and $HADOOP_CONF_DIR are set, then this actually sets $HADOOP_CLASSPATH to be prepended with $MAHOUT_CONF_DIR so that the following is actually run:

          $HADOOP_HOME/bin/hadoop jar [path to examples.job] o.a.m.driver.MahoutDriver kmeans [options]
          

          actually works and it gets the default properties loaded and overridden as necessary, running your job on the hadoop cluster.

          If one of those variables are not specified (TODO: if $HADOOP_HOME is specified, but $HADOOP_CONF_DIR is not, guess a default of $HADOOP_HOME/conf, I suppose), then the assumption is to run locally.

          Previous behavior still works, from what I can tell - you can still do:

          $MAHOUT_HOME/bin/mahout kmeans --output kmeans/out --input input/vecs -k 13 --clusters tmp/foobar
          

          and we're backwards compatible with the old way.

          Now the question is: do we want to be? Or do we want to trim down the shell script to just always use MahoutDriver, and get rid of all of the 'elif [ "$COMMAND" =' stuff and just have $CLASS be MahoutDriver, passing it $COMMAND as the first argument?

          Then the command line would be exactly the same as before, except you could also load up your $MAHOUT_CONF_DIR/<shortName>.props files with whatever defaults you wanted to use.

          Show
          Jake Mannix added a comment - Ok, now we're getting somewhere. This one a) has the ability to properly handle "mahout run -h" or "mahout run --help", helpfully spitting out the list of classes with shortName's which MahoutDriver has been told about in the driver.classes.props, and more importantly, it can, both in a release environment, and in a dev environment, do: ./bin/mahout run kmeans [options] If $MAHOUT_CONF_DIR is set, and points to a place with the right files, then the default properties are loaded from there (overridden by [options] given above). If both $HADOOP_HOME and $HADOOP_CONF_DIR are set, then this actually sets $HADOOP_CLASSPATH to be prepended with $MAHOUT_CONF_DIR so that the following is actually run: $HADOOP_HOME/bin/hadoop jar [path to examples.job] o.a.m.driver.MahoutDriver kmeans [options] actually works and it gets the default properties loaded and overridden as necessary, running your job on the hadoop cluster. If one of those variables are not specified (TODO: if $HADOOP_HOME is specified, but $HADOOP_CONF_DIR is not, guess a default of $HADOOP_HOME/conf, I suppose), then the assumption is to run locally. Previous behavior still works, from what I can tell - you can still do: $MAHOUT_HOME/bin/mahout kmeans --output kmeans/out --input input/vecs -k 13 --clusters tmp/foobar and we're backwards compatible with the old way. Now the question is: do we want to be? Or do we want to trim down the shell script to just always use MahoutDriver, and get rid of all of the 'elif [ "$COMMAND" =' stuff and just have $CLASS be MahoutDriver, passing it $COMMAND as the first argument? Then the command line would be exactly the same as before, except you could also load up your $MAHOUT_CONF_DIR/<shortName>.props files with whatever defaults you wanted to use.
          Hide
          Drew Farris added a comment -

          This sounds great. I will take it for a spin when I am in front of a computer. My take is that the old if, else it's in the script are now redundant. As long as one can use MahoutDriver to run both classes that have been aliased to short names and classes specified using the full name, I say let's get rid of them.

          Show
          Drew Farris added a comment - This sounds great. I will take it for a spin when I am in front of a computer. My take is that the old if, else it's in the script are now redundant. As long as one can use MahoutDriver to run both classes that have been aliased to short names and classes specified using the full name, I say let's get rid of them.
          Hide
          Drew Farris added a comment -

          Jake, this is looking really great.

          Here's a partial patch that includes modifications to bin/mahout and MahoutDriver:

          It removes the separate 'command' option from the original script and delegates everything to MahoutDriver, so things like the following work:

          ./mahout testclassifier
          ./mahout --help
          

          Also will set MAHOUT_CONF_DIR to MAHOUT_HOME/conf if MAHOUT_CONF_DIR is not set.

          If no args are specified, will print same output as --help.

          One potential TODO from this would be to potentially launch arbitrary classes if no matching program name is specified, but I need to dig into ProgramDriver to understand how it works before I can contribute something like that.

          Hope this is helpful.

          Show
          Drew Farris added a comment - Jake, this is looking really great. Here's a partial patch that includes modifications to bin/mahout and MahoutDriver: It removes the separate 'command' option from the original script and delegates everything to MahoutDriver, so things like the following work: ./mahout testclassifier ./mahout --help Also will set MAHOUT_CONF_DIR to MAHOUT_HOME/conf if MAHOUT_CONF_DIR is not set. If no args are specified, will print same output as --help. One potential TODO from this would be to potentially launch arbitrary classes if no matching program name is specified, but I need to dig into ProgramDriver to understand how it works before I can contribute something like that. Hope this is helpful.
          Hide
          Jake Mannix added a comment -

          Awesome Drew, I'll check it out.

          One potential TODO from this would be to potentially launch arbitrary classes if no matching program name is specified, but I need to dig into ProgramDriver to understand how it works before I can contribute something like that.

          Yeah, I was thinking about that over breakfast - an easy hack to do this is while the driver.classes.props file is being read, keep track if whether you've found an exact match on args[0], and once all of drivers.classes.props has been read and you haven't found a match, just do a Class.forName(args[0]) and add it to the ProgramDriver with it's full name as the "shortName" and the rest of the program will work (and would even still work with default properties files! If you put com.mycompany.MyClass.props in $MAHOUT_CONF_DIR, it'll read that for defaults).

          I'll see if I can add that to your patch later today. I think if that's working, we should be looking good to commit and see who else wants to play with it and test it out.

          Show
          Jake Mannix added a comment - Awesome Drew, I'll check it out. One potential TODO from this would be to potentially launch arbitrary classes if no matching program name is specified, but I need to dig into ProgramDriver to understand how it works before I can contribute something like that. Yeah, I was thinking about that over breakfast - an easy hack to do this is while the driver.classes.props file is being read, keep track if whether you've found an exact match on args [0] , and once all of drivers.classes.props has been read and you haven't found a match, just do a Class.forName(args [0] ) and add it to the ProgramDriver with it's full name as the "shortName" and the rest of the program will work (and would even still work with default properties files! If you put com.mycompany.MyClass.props in $MAHOUT_CONF_DIR, it'll read that for defaults). I'll see if I can add that to your patch later today. I think if that's working, we should be looking good to commit and see who else wants to play with it and test it out.
          Hide
          Jake Mannix added a comment -

          Ok, new patch, with the modification that indeed you have the ability to just run "$MAHOUT_HOME/bin/mahout <classname> [args]" and it still works. And if <classname>.props exists on the classpath, it'll get used for defaults. w00t, as the kids say.

          I've added to the patch the conf directory (you'd not kept it in your patch, Drew), and there are a bunch of emtpy files in there, except some of them have commented out properties in the right format:

          cleaneigen.props :

          #ci|corpusInput =
          #ei|eigenInput =
          #o|output =
          

          To help users see what they can store in here, and in what format.

          Show
          Jake Mannix added a comment - Ok, new patch, with the modification that indeed you have the ability to just run "$MAHOUT_HOME/bin/mahout <classname> [args] " and it still works. And if <classname>.props exists on the classpath, it'll get used for defaults. w00t, as the kids say. I've added to the patch the conf directory (you'd not kept it in your patch, Drew), and there are a bunch of emtpy files in there, except some of them have commented out properties in the right format: cleaneigen.props : #ci|corpusInput = #ei|eigenInput = #o|output = To help users see what they can store in here, and in what format.
          Hide
          Jake Mannix added a comment -

          Let's release this. Others want to try it out?

          We need documentation for it too, obviously, but see how it runs on other jobs? It should work on Hadoop, too, as this ticket / comment thread indicates.

          Show
          Jake Mannix added a comment - Let's release this. Others want to try it out? We need documentation for it too, obviously, but see how it runs on other jobs? It should work on Hadoop, too, as this ticket / comment thread indicates.
          Hide
          Drew Farris added a comment -

          Had a chance to take this out for a spin tonight. It is working very well. I did some k-means using the script starting with the 20newsgroups collection as textfiles, both locally and on a cluster. I think it is good to go, can we commit? I'd be happy to handle it if we have sufficient consensus.

          There are a couple modifications I've made to the maven assemblies to include all of this in the binary and source releases properly (adding the conf directory, setting executable on the mahout script, etc). While I was at it, I cleaned up the bin assembly process so that the releases should build faster too. Should I commit those, open another issue or should I re-post as a part of this patch?

          Show
          Drew Farris added a comment - Had a chance to take this out for a spin tonight. It is working very well. I did some k-means using the script starting with the 20newsgroups collection as textfiles, both locally and on a cluster. I think it is good to go, can we commit? I'd be happy to handle it if we have sufficient consensus. There are a couple modifications I've made to the maven assemblies to include all of this in the binary and source releases properly (adding the conf directory, setting executable on the mahout script, etc). While I was at it, I cleaned up the bin assembly process so that the releases should build faster too. Should I commit those, open another issue or should I re-post as a part of this patch?
          Hide
          Robin Anil added a comment -

          +1 for committing this.

          Can you upload the patch for the maven configs. Maybe a separate issue? and mark it as 0.3.

          Show
          Robin Anil added a comment - +1 for committing this. Can you upload the patch for the maven configs. Maybe a separate issue? and mark it as 0.3.
          Hide
          Jake Mannix added a comment -

          Drew, do you have a patch with your last changes? If I can try them out too to verify that they work on more than one system, we can commit this I think.

          Should I commit those, open another issue or should I re-post as a part of this patch?

          I'd say that should be in a separate issue, that should be small enough to mark for 0.3 and commit separately.

          Show
          Jake Mannix added a comment - Drew, do you have a patch with your last changes? If I can try them out too to verify that they work on more than one system, we can commit this I think. Should I commit those, open another issue or should I re-post as a part of this patch? I'd say that should be in a separate issue, that should be small enough to mark for 0.3 and commit separately.
          Hide
          Grant Ingersoll added a comment -

          Just capturing something longer term here, no need to block anything. One of the things I'd love to have is some basic "experiment management" capabilities. I can imagine in this mode that things like input parameters, etc. are all written into files and organized along with the output, etc. such that it is easy to keep track of all the different ways things get run over time. Seems like this script w/ default property files, etc. could be part of that solution.

          Show
          Grant Ingersoll added a comment - Just capturing something longer term here, no need to block anything. One of the things I'd love to have is some basic "experiment management" capabilities. I can imagine in this mode that things like input parameters, etc. are all written into files and organized along with the output, etc. such that it is easy to keep track of all the different ways things get run over time. Seems like this script w/ default property files, etc. could be part of that solution.
          Hide
          Drew Farris added a comment -

          Can you upload the patch for the maven configs. Maybe a separate issue? and mark it as 0.3.

          See: MAHOUT-311

          Show
          Drew Farris added a comment - Can you upload the patch for the maven configs. Maybe a separate issue? and mark it as 0.3. See: MAHOUT-311
          Hide
          Jake Mannix added a comment -

          Checked in a version of this which works, not sure if it had the most updated stuff from Drew in it. I'll check out the MAHOUT-311 patch to see if there's a bit more for the assembly stuff to get in too.

          Show
          Jake Mannix added a comment - Checked in a version of this which works, not sure if it had the most updated stuff from Drew in it. I'll check out the MAHOUT-311 patch to see if there's a bit more for the assembly stuff to get in too.

            People

            • Assignee:
              Jake Mannix
              Reporter:
              Jake Mannix
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development