Mahout
  1. Mahout
  2. MAHOUT-695

Have LDADriver determine numWords from input vectors

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 0.5
    • Fix Version/s: 0.6
    • Component/s: Clustering
    • Labels:
      None

      Description

      It bugged me that you needed to specify the number of words directly to the LDADriver
      eg ./bin/mahout lda \
      -i ./examples/bin/work/reuters-out-seqdir-sparse/tf-vectors \
      -o ./examples/bin/work/reuters-lda -k 20 -v 50000 -ow -x 20

      with this patch the ldadriver just checks a vector from the input to determine the size
      eg ./bin/mahout lda \
      -i ./examples/bin/work/reuters-out-seqdir-sparse/tf-vectors \
      -o ./examples/bin/work/reuters-lda -k 20 -ow -x 20

      1. mahout-695.patch
        3 kB
        Mat Kelcey
      2. mahout-695.patch
        4 kB
        Jake Mannix

        Activity

        Hide
        Mat Kelcey added a comment -

        is this patch formatted correctly?
        am using a cloned git repo from github
        and generated it with 'git format-patch -p'

        also any, and all, feedback welcome.
        i'd love to get my hands dirtier.

        Show
        Mat Kelcey added a comment - is this patch formatted correctly? am using a cloned git repo from github and generated it with 'git format-patch -p' also any, and all, feedback welcome. i'd love to get my hands dirtier.
        Hide
        Jake Mannix added a comment -

        Mat, make sure in the future, to generate your diffs in a git repo simply like this: "git diff --no-prefix", and it'll format great for application by a simple "patch -p0 < patchfile".

        Attached is your patch, regenerated to be like this.

        Show
        Jake Mannix added a comment - Mat, make sure in the future, to generate your diffs in a git repo simply like this: "git diff --no-prefix", and it'll format great for application by a simple "patch -p0 < patchfile". Attached is your patch, regenerated to be like this.
        Hide
        Jake Mannix added a comment -

        But awesome work, thanks, this is great, I've also been often annoyed by this missing feature. What would be even cooler? If in case there was no dictionary, we just sniff the first vector in the data set, and ask for its getSize()!

        Show
        Jake Mannix added a comment - But awesome work, thanks, this is great, I've also been often annoyed by this missing feature. What would be even cooler? If in case there was no dictionary, we just sniff the first vector in the data set, and ask for its getSize()!
        Hide
        Mat Kelcey added a comment -

        yeah right, i didn't think of checking the first vector...
        is there any case when sniffing the dataset wouldn't work?
        if so it could just do that in all cases in which the --numWords or --dict parameters wouldn't be needed at all.

        Show
        Mat Kelcey added a comment - yeah right, i didn't think of checking the first vector... is there any case when sniffing the dataset wouldn't work? if so it could just do that in all cases in which the --numWords or --dict parameters wouldn't be needed at all.
        Hide
        Jake Mannix added a comment -

        Hmm... the dataset is always there, and for algorithms running on vectors, will always indeed consist of at least on vector, naturally. So I'm guessing that we should be able to always do this, and in the grand scheme of things, will be not much slower than just supplying the param, so yeah, I agree, let's swap it out so that we just sniff and use the info we get from that!

        Show
        Jake Mannix added a comment - Hmm... the dataset is always there, and for algorithms running on vectors, will always indeed consist of at least on vector, naturally. So I'm guessing that we should be able to always do this, and in the grand scheme of things, will be not much slower than just supplying the param, so yeah, I agree, let's swap it out so that we just sniff and use the info we get from that!
        Hide
        Mat Kelcey added a comment -

        cool, i'll give it a crack tomorrow and see how i go.

        Show
        Mat Kelcey added a comment - cool, i'll give it a crack tomorrow and see how i go.
        Hide
        Mat Kelcey added a comment -

        here's another patch for determining the num words from the first vector.

        i've left numwords option in though as a form of deprecation so a warning can be given. the alternate of taking the option out would fail at startup complaining about the unknown arg. so depending on how much backwards compatibility you're after this might not be needed...

        Show
        Mat Kelcey added a comment - here's another patch for determining the num words from the first vector. i've left numwords option in though as a form of deprecation so a warning can be given. the alternate of taking the option out would fail at startup complaining about the unknown arg. so depending on how much backwards compatibility you're after this might not be needed...
        Hide
        Mat Kelcey added a comment -

        a new version to include changes from
        https://issues.apache.org/jira/browse/MAHOUT-694

        Show
        Mat Kelcey added a comment - a new version to include changes from https://issues.apache.org/jira/browse/MAHOUT-694
        Hide
        Mat Kelcey added a comment -

        oops, patched from my wrong branch. i've delete them and will resubmit a patch tomorrow.

        Show
        Mat Kelcey added a comment - oops, patched from my wrong branch. i've delete them and will resubmit a patch tomorrow.
        Hide
        Mat Kelcey added a comment -

        Have removed NUM_WORDS option completely which will break existing callers since it makes it an unknown parameter. (Not sure if backwards compability is an issue at this stage) Am happy to reinclude code to ignore it with a warning message that it's deprecated.

        Show
        Mat Kelcey added a comment - Have removed NUM_WORDS option completely which will break existing callers since it makes it an unknown parameter. (Not sure if backwards compability is an issue at this stage) Am happy to reinclude code to ignore it with a warning message that it's deprecated.
        Hide
        Sean Owen added a comment -

        Hey Mat, looks like a good patch, and Jake likes it. I think it's OK to break callers in this regard. Is it probably OK to commit this, as far as you are concerned, or are there any other changes / concerns that should be addressed?

        Show
        Sean Owen added a comment - Hey Mat, looks like a good patch, and Jake likes it. I think it's OK to break callers in this regard. Is it probably OK to commit this, as far as you are concerned, or are there any other changes / concerns that should be addressed?
        Hide
        Mat Kelcey added a comment -

        I'll recheck the patch applies cleanly over the weekend, it's been awhile and I wouldn't be surprised if bin/build-reuters.sh has changed a bit.

        Show
        Mat Kelcey added a comment - I'll recheck the patch applies cleanly over the weekend, it's been awhile and I wouldn't be surprised if bin/build-reuters.sh has changed a bit.
        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #1072 (See https://builds.apache.org/job/Mahout-Quality/1072/)
        MAHOUT-695 Compute number of terms rather than specify it

        srowen : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1177616
        Files :

        • /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/lda/LDADriver.java
        • /mahout/trunk/examples/bin/build-reuters.sh
        Show
        Hudson added a comment - Integrated in Mahout-Quality #1072 (See https://builds.apache.org/job/Mahout-Quality/1072/ ) MAHOUT-695 Compute number of terms rather than specify it srowen : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1177616 Files : /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/lda/LDADriver.java /mahout/trunk/examples/bin/build-reuters.sh

          People

          • Assignee:
            Jake Mannix
            Reporter:
            Mat Kelcey
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development