Mahout
  1. Mahout
  2. MAHOUT-398

Seq2sparse outputs final vectors to different directories depending upon the TF/TFIDF weight switch. This is confusing to users.

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.3
    • Fix Version/s: 0.4
    • Component/s: Integration
    • Labels:
      None

      Description

      In TF mode, seq2sparse puts the output vectors into <output>/vectors. In TFIDF mode; however, it puts the output vectors into <output>/tfidf/vectors. This happens because the IDF calculation - if it is selected - happens after TF and uses the TF vectors for its input.

      Seems like both modes ought to output to a consistent directory structure so changing the switch does not change the final output location: perhaps as simple as changing TF to output to <output>/tf/vectors so that the contents of both directories when present are more obvious from their nomenclature.

        Activity

        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Open Open Patch Available Patch Available
        6d 10h 39m 1 Drew Farris 28/May/10 12:16
        Patch Available Patch Available Resolved Resolved
        2d 14h 49m 1 Drew Farris 31/May/10 03:06
        Resolved Resolved Closed Closed
        153d 13h 43m 1 Sean Owen 31/Oct/10 15:49
        Sean Owen made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #38 (See http://hudson.zones.apache.org/hudson/job/Mahout-Quality/38/)
        MAHOUT-398: eliminated separate tfidf directory for tfidf vector output.

        Show
        Hudson added a comment - Integrated in Mahout-Quality #38 (See http://hudson.zones.apache.org/hudson/job/Mahout-Quality/38/ ) MAHOUT-398 : eliminated separate tfidf directory for tfidf vector output.
        Drew Farris made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Drew Farris made changes -
        Assignee Drew Farris [ drew.farris ]
        Hide
        Drew Farris added a comment -

        Committed revision 949649. eliminated separate tfidf directory for tfidf vector output.

        Show
        Drew Farris added a comment - Committed revision 949649. eliminated separate tfidf directory for tfidf vector output.
        Drew Farris made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hide
        Drew Farris added a comment -

        Any objections on this? Would like to commit

        Show
        Drew Farris added a comment - Any objections on this? Would like to commit
        Drew Farris made changes -
        Attachment MAHOUT-398.patch [ 12445276 ]
        Hide
        Drew Farris added a comment -

        Jeff, I agree, it makes sense to make this a bit more consistent across outputs

        After the minor changes you propose, the output produced by the reuters example when constructing tfidf vectors looks like this:

        .../reuters-out-seqdir-sparse/dictionary.file-0
        .../reuters-out-seqdir-sparse/tfidf
        ../reuters-out-seqdir-sparse/tfidf/frequency.file-0
        .../reuters-out-seqdir-sparse/tfidf/df-count
        .../reuters-out-seqdir-sparse/tfidf/df-count/part-00000
        .../reuters-out-seqdir-sparse/tfidf/tfidf-vectors
        .../reuters-out-seqdir-sparse/tfidf/tfidf-vectors/part-00000
        .../reuters-out-seqdir-sparse/tf-vectors
        .../reuters-out-seqdir-sparse/tf-vectors/part-00000
        .../reuters-out-seqdir-sparse/tokenized-documents
        .../reuters-out-seqdir-sparse/tokenized-documents/part-00000
        .../reuters-out-seqdir-sparse/wordcount
        .../reuters-out-seqdir-sparse/wordcount/part-00000
        

        How about we the tfidf-vectors and tf-vectors output directories at the same level? I seems that putting frequency.file and dictionary.file at the same level might make some sense. I know there's been some talk about standardizing input, working and output directory creation for jobs but I haven't followed it – might that provide some suggestion what to do here?

        Here's a patch that includes Jeff's changes and pushes the tfidf stuff up a level. The output is:

        reuters-out-seqdir-sparse/dictionary.file-0
        reuters-out-seqdir-sparse/frequency.file-0
        reuters-out-seqdir-sparse/tf-vectors
        reuters-out-seqdir-sparse/tf-vectors/part-00000
        reuters-out-seqdir-sparse/tokenized-documents
        reuters-out-seqdir-sparse/tokenized-documents/part-00000
        reuters-out-seqdir-sparse/df-count
        reuters-out-seqdir-sparse/df-count/part-00000
        reuters-out-seqdir-sparse/tfidf-vectors
        reuters-out-seqdir-sparse/tfidf-vectors/part-00000
        reuters-out-seqdir-sparse/wordcount
        reuters-out-seqdir-sparse/wordcount/part-00000
        
        Show
        Drew Farris added a comment - Jeff, I agree, it makes sense to make this a bit more consistent across outputs After the minor changes you propose, the output produced by the reuters example when constructing tfidf vectors looks like this: .../reuters-out-seqdir-sparse/dictionary.file-0 .../reuters-out-seqdir-sparse/tfidf ../reuters-out-seqdir-sparse/tfidf/frequency.file-0 .../reuters-out-seqdir-sparse/tfidf/df-count .../reuters-out-seqdir-sparse/tfidf/df-count/part-00000 .../reuters-out-seqdir-sparse/tfidf/tfidf-vectors .../reuters-out-seqdir-sparse/tfidf/tfidf-vectors/part-00000 .../reuters-out-seqdir-sparse/tf-vectors .../reuters-out-seqdir-sparse/tf-vectors/part-00000 .../reuters-out-seqdir-sparse/tokenized-documents .../reuters-out-seqdir-sparse/tokenized-documents/part-00000 .../reuters-out-seqdir-sparse/wordcount .../reuters-out-seqdir-sparse/wordcount/part-00000 How about we the tfidf-vectors and tf-vectors output directories at the same level? I seems that putting frequency.file and dictionary.file at the same level might make some sense. I know there's been some talk about standardizing input, working and output directory creation for jobs but I haven't followed it – might that provide some suggestion what to do here? Here's a patch that includes Jeff's changes and pushes the tfidf stuff up a level. The output is: reuters-out-seqdir-sparse/dictionary.file-0 reuters-out-seqdir-sparse/frequency.file-0 reuters-out-seqdir-sparse/tf-vectors reuters-out-seqdir-sparse/tf-vectors/part-00000 reuters-out-seqdir-sparse/tokenized-documents reuters-out-seqdir-sparse/tokenized-documents/part-00000 reuters-out-seqdir-sparse/df-count reuters-out-seqdir-sparse/df-count/part-00000 reuters-out-seqdir-sparse/tfidf-vectors reuters-out-seqdir-sparse/tfidf-vectors/part-00000 reuters-out-seqdir-sparse/wordcount reuters-out-seqdir-sparse/wordcount/part-00000
        Hide
        Jeff Eastman added a comment -

        Here's a very minimal fix that, imho, reduces the ambiguity and makes the contents of the vector directories much more obvious:

        DictionaryVectorizer {

        public static final String DOCUMENT_VECTOR_OUTPUT_FOLDER = "tf-vectors";

        TFIDFConverter {

        private static final String DOCUMENT_VECTOR_OUTPUT_FOLDER = "tfidf-vectors";

        Show
        Jeff Eastman added a comment - Here's a very minimal fix that, imho, reduces the ambiguity and makes the contents of the vector directories much more obvious: DictionaryVectorizer { public static final String DOCUMENT_VECTOR_OUTPUT_FOLDER = "tf-vectors"; TFIDFConverter { private static final String DOCUMENT_VECTOR_OUTPUT_FOLDER = "tfidf-vectors";
        Jeff Eastman made changes -
        Field Original Value New Value
        Description In TF mode, seq2sparse puts the output vectors into <output>vectors. In TFIDF mode; however, it puts the output vectors into <output>/tfidf/vectors. Even worse, in TFIDF mode the TFIDF converter reuses the <output>/vector/ directory for its intermediate calculations. Seems like both modes ought to output to the same directory so changing the switch does not cause downstream user changes that are error-prone and confusing. In TF mode, seq2sparse puts the output vectors into <output>/vectors. In TFIDF mode; however, it puts the output vectors into <output>/tfidf/vectors. This happens because the IDF calculation - if it is selected - happens after TF and uses the TF vectors for its input.

        Seems like both modes ought to output to a consistent directory structure so changing the switch does not change the final output location: perhaps as simple as changing TF to output to <output>/tf/vectors so that the contents of both directories when present are more obvious from their nomenclature.
        Jeff Eastman created issue -

          People

          • Assignee:
            Drew Farris
            Reporter:
            Jeff Eastman
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development