Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.7
    • Component/s: None
    • Labels:
      None

      Description

      I've put together a few handy additions to seqdumper:

      • Ability to dump all sequence files in a directory.
      • A quiet flag to attenuate the non-data output.
      • A flag to toggle name-only printing for NamedVector values.
      • An option to only print the N highest-valued elements in WeightedVector values

      Seems like others will probably find some of these to be helpful.

      1. MAHOUT-947.patch
        10 kB
        tom pierce
      2. MAHOUT-947-2.patch
        25 kB
        tom pierce
      3. MAHOUT-947.patch
        25 kB
        tom pierce
      4. MAHOUT-947.patch
        95 kB
        Grant Ingersoll

        Activity

        Hide
        Lance Norskog added a comment -

        There is a sequencefile utility code pattern that handles files and directories and supplies an iterator. There's no need to have separate args.

        Show
        Lance Norskog added a comment - There is a sequencefile utility code pattern that handles files and directories and supplies an iterator. There's no need to have separate args.
        Hide
        Grant Ingersoll added a comment -

        I wasn't suggesting supporting multiple args, just quoted globs - since HDFS FileSystem supports strings with glob patterns in them...

        Makes sense. I can update. In any case, I changed Tom's patch to use our standard --input flag for both and then just check to see whether it is a directory or not. We could just as well check to see if it is a glob.

        Show
        Grant Ingersoll added a comment - I wasn't suggesting supporting multiple args, just quoted globs - since HDFS FileSystem supports strings with glob patterns in them... Makes sense. I can update. In any case, I changed Tom's patch to use our standard --input flag for both and then just check to see whether it is a directory or not. We could just as well check to see if it is a glob.
        Hide
        Grant Ingersoll added a comment -

        Hmm, should be a getOptions in there, but maybe my patch is messed up:

        /**
           * Options can occur multiple times, so return the list
           * @param optionName The unadorned (no "--" prefixing it) option name
           * @return The values, else null.  If the option is present, but has no values, then the result will be an empty list (Collections.emptyList())
           */
          public List<String> getOptions(String optionName){
            return argMap.get(keyFor(optionName));
          }
        

        Or do you mean we should just have one or the other, but not both? That could work. I did both as it seems like one knows when one only wants want arg and when one wants multiples, so getOption() is really just a convenience method.

        Perhaps the parsedArgs is still useful if one wants to iterate over them or something? There also is at least one place where we use them to pass through to other jobs that aren't necessarily AbstractJobs (DistributedConjugateGradientSolver)

        Show
        Grant Ingersoll added a comment - Hmm, should be a getOptions in there, but maybe my patch is messed up: /** * Options can occur multiple times, so return the list * @param optionName The unadorned (no "--" prefixing it) option name * @ return The values, else null . If the option is present, but has no values, then the result will be an empty list (Collections.emptyList()) */ public List< String > getOptions( String optionName){ return argMap.get(keyFor(optionName)); } Or do you mean we should just have one or the other, but not both? That could work. I did both as it seems like one knows when one only wants want arg and when one wants multiples, so getOption() is really just a convenience method. Perhaps the parsedArgs is still useful if one wants to iterate over them or something? There also is at least one place where we use them to pass through to other jobs that aren't necessarily AbstractJobs (DistributedConjugateGradientSolver)
        Hide
        Sean Owen added a comment -

        My only issue with this is that this has brought in a second way to access args – getOption(). It seems like it should be one way or the other. The 'parsedArgs' thing is no longer useful then and could be removed. You could have a getOptions() method to return a multiple-valued flag.

        Show
        Sean Owen added a comment - My only issue with this is that this has brought in a second way to access args – getOption(). It seems like it should be one way or the other. The 'parsedArgs' thing is no longer useful then and could be removed. You could have a getOptions() method to return a multiple-valued flag.
        Hide
        Jake Mannix added a comment -

        I wasn't suggesting supporting multiple args, just quoted globs - since HDFS FileSystem supports strings with glob patterns in them...

        Show
        Jake Mannix added a comment - I wasn't suggesting supporting multiple args, just quoted globs - since HDFS FileSystem supports strings with glob patterns in them...
        Hide
        Grant Ingersoll added a comment -

        This patch is quite a bit bigger than Tom's b/c as I was digging in converting to use AbstractJob, I realized that AbstractJob only supported single value arguments. This patch fixes this and also goes through and standardizes a whole slew of files that use AbstractJob

        Show
        Grant Ingersoll added a comment - This patch is quite a bit bigger than Tom's b/c as I was digging in converting to use AbstractJob, I realized that AbstractJob only supported single value arguments. This patch fixes this and also goes through and standardizes a whole slew of files that use AbstractJob
        Hide
        Grant Ingersoll added a comment -

        I have a patch for this that cleans this up and switches it to the standard command line processing. Will post soon

        Show
        Grant Ingersoll added a comment - I have a patch for this that cleans this up and switches it to the standard command line processing. Will post soon
        Hide
        Jake Mannix added a comment -

        so one comment: instead of --seqDirectory vs --seqFile, why not just support glob paths?

        vectordumper s "/path/to/my/dir/part*"

        ?

        Do we need separate flags for if it's a directory vs a file? Makes the usage messier, IMO.

        Show
        Jake Mannix added a comment - so one comment: instead of --seqDirectory vs --seqFile, why not just support glob paths? vectordumper s "/path/to/my/dir/part *" ? Do we need separate flags for if it's a directory vs a file? Makes the usage messier, IMO.
        Hide
        Lance Norskog added a comment -

        I won't be able to try it. The patch looks clean.

        Show
        Lance Norskog added a comment - I won't be able to try it. The patch looks clean.
        Hide
        Grant Ingersoll added a comment -

        I'm close to committing

        Show
        Grant Ingersoll added a comment - I'm close to committing
        Hide
        Jeff Hammerbacher added a comment -

        Hey Lance,

        Any more changes required for this patch?

        Thanks,
        Jeff

        Show
        Jeff Hammerbacher added a comment - Hey Lance, Any more changes required for this patch? Thanks, Jeff
        Hide
        tom pierce added a comment -

        Dropped the cluster dumping addition to VectorDumper.

        Show
        tom pierce added a comment - Dropped the cluster dumping addition to VectorDumper.
        Hide
        tom pierce added a comment -

        Hah - I agree on everything needing a quiet option (or better, quiet by default and a verbose flag!).

        There was at least one println/write that I didn't want to see in one of these (and the bin/mahout wrapper adds some output too). Turning up the loglevel would be good, though I think I remember this being tricky to do programatically with slf4j (though maybe I'm confusing slf4j with another Java logger).

        Any objection to VectorDumper having the ability to dump vectors from clusters? It looks like ClusterDumper likes to read things into core, which can sometimes be troublesome.

        Show
        tom pierce added a comment - Hah - I agree on everything needing a quiet option (or better, quiet by default and a verbose flag!). There was at least one println/write that I didn't want to see in one of these (and the bin/mahout wrapper adds some output too). Turning up the loglevel would be good, though I think I remember this being tricky to do programatically with slf4j (though maybe I'm confusing slf4j with another Java logger). Any objection to VectorDumper having the ability to dump vectors from clusters? It looks like ClusterDumper likes to read things into core, which can sometimes be troublesome.
        Hide
        Lance Norskog added a comment -

        mahout/src/conf/driver.classes.props lists all of the dumper classes.

        Everything needs the "quiet" option In fact, could it just change the log level from INFO to WARN?

        Show
        Lance Norskog added a comment - mahout/src/conf/driver.classes.props lists all of the dumper classes. Everything needs the "quiet" option In fact, could it just change the log level from INFO to WARN?
        Hide
        tom pierce added a comment -

        Adjusted to put vector options in VectorDumper. Also add ability to dump vectors from AbstractClusters and WeightedPropertyVectorWritables.

        Show
        tom pierce added a comment - Adjusted to put vector options in VectorDumper. Also add ability to dump vectors from AbstractClusters and WeightedPropertyVectorWritables.
        Hide
        tom pierce added a comment -

        Oh nice- I hadn't seen VectorDumper before. Looks like they could both use the directory and quiet options, and the other things should move over. Will adjust patch.

        Show
        tom pierce added a comment - Oh nice- I hadn't seen VectorDumper before. Looks like they could both use the directory and quiet options, and the other things should move over. Will adjust patch.
        Hide
        Lance Norskog added a comment -

        VectorDumper is a custom class just for vectors; should most of these be there?

        Show
        Lance Norskog added a comment - VectorDumper is a custom class just for vectors; should most of these be there?

          People

          • Assignee:
            Grant Ingersoll
            Reporter:
            tom pierce
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development