Mahout
  1. Mahout
  2. MAHOUT-397

SparseVectorsFromSequenceFiles only outputs a single vector file

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.3
    • Fix Version/s: 0.4
    • Component/s: Integration
    • Labels:
      None

      Description

      When running LDA via build-reuters.sh on a 3-node Hadoop cluster, I've noticed that there is only a single vector file produced by the utility preprocessing steps. This means LDA (and other clustering too) can only use a single mapper no matter how large the cluster is. Investigating, it seems that the program argument (-nr) for setting the number of reducers - and hence the number of output files - is not propagated to the final stages where the output vectors are created.

      1. MAHOUT-397.patch
        13 kB
        Jeff Eastman

        Activity

        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Open Open Patch Available Patch Available
        13m 42s 1 Jeff Eastman 20/May/10 02:43
        Patch Available Patch Available Resolved Resolved
        125d 6h 1m 1 Sean Owen 22/Sep/10 08:44
        Resolved Resolved Closed Closed
        39d 8h 5m 1 Sean Owen 31/Oct/10 15:49
        Sean Owen made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Sean Owen made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Hide
        Sean Owen added a comment -

        Sounds like Jeff fixed it.

        Show
        Sean Owen added a comment - Sounds like Jeff fixed it.
        Jeff Eastman made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hide
        Jeff Eastman added a comment -

        patch submitted runs on r946508

        Show
        Jeff Eastman added a comment - patch submitted runs on r946508
        Jeff Eastman made changes -
        Field Original Value New Value
        Attachment MAHOUT-397.patch [ 12445011 ]
        Hide
        Jeff Eastman added a comment -

        This patch seems to resolve the issue by propagating the number of reducers argument through to the back-end processing steps where the actual output vectors are produced. It also includes a slight modification to SequenceFilesFromDirectory to remove chunk-size upsizing to 64mb which allows Reuters data to be split into 3 smaller files to improve processing. All unit tests run.

        Files modified:
        M core/src/main/java/org/apache/mahout/clustering/lda/LDADriver.java
        M utils/src/test/java/org/apache/mahout/utils/vectors/text/DictionaryVectorizerTest.java
        M utils/src/main/java/org/apache/mahout/utils/vectors/text/DictionaryVectorizer.java
        M utils/src/main/java/org/apache/mahout/utils/vectors/common/PartialVectorMerger.java
        M utils/src/main/java/org/apache/mahout/utils/vectors/tfidf/TFIDFConverter.java
        M utils/src/main/java/org/apache/mahout/text/SparseVectorsFromSequenceFiles.java
        M examples/src/main/java/org/apache/mahout/text/SequenceFilesFromDirectory.java
        M examples/bin/build-reuters.sh

        The attached build-reuters.sh runs LDA iterations in about 1.5 min vs. 5.5 min with a single vector file on a 3-node cluster using 3 mappers and 2-3 reducers for the vectorization. I will commit it in a day or so but want some more eyeballs on it since this is new code for me.

        Show
        Jeff Eastman added a comment - This patch seems to resolve the issue by propagating the number of reducers argument through to the back-end processing steps where the actual output vectors are produced. It also includes a slight modification to SequenceFilesFromDirectory to remove chunk-size upsizing to 64mb which allows Reuters data to be split into 3 smaller files to improve processing. All unit tests run. Files modified: M core/src/main/java/org/apache/mahout/clustering/lda/LDADriver.java M utils/src/test/java/org/apache/mahout/utils/vectors/text/DictionaryVectorizerTest.java M utils/src/main/java/org/apache/mahout/utils/vectors/text/DictionaryVectorizer.java M utils/src/main/java/org/apache/mahout/utils/vectors/common/PartialVectorMerger.java M utils/src/main/java/org/apache/mahout/utils/vectors/tfidf/TFIDFConverter.java M utils/src/main/java/org/apache/mahout/text/SparseVectorsFromSequenceFiles.java M examples/src/main/java/org/apache/mahout/text/SequenceFilesFromDirectory.java M examples/bin/build-reuters.sh The attached build-reuters.sh runs LDA iterations in about 1.5 min vs. 5.5 min with a single vector file on a 3-node cluster using 3 mappers and 2-3 reducers for the vectorization. I will commit it in a day or so but want some more eyeballs on it since this is new code for me.
        Jeff Eastman created issue -

          People

          • Assignee:
            Jeff Eastman
            Reporter:
            Jeff Eastman
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development