Mahout
  1. Mahout
  2. MAHOUT-398

Seq2sparse outputs final vectors to different directories depending upon the TF/TFIDF weight switch. This is confusing to users.

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.3
    • Fix Version/s: 0.4
    • Component/s: Integration
    • Labels:
      None

      Description

      In TF mode, seq2sparse puts the output vectors into <output>/vectors. In TFIDF mode; however, it puts the output vectors into <output>/tfidf/vectors. This happens because the IDF calculation - if it is selected - happens after TF and uses the TF vectors for its input.

      Seems like both modes ought to output to a consistent directory structure so changing the switch does not change the final output location: perhaps as simple as changing TF to output to <output>/tf/vectors so that the contents of both directories when present are more obvious from their nomenclature.

        Activity

        Jeff Eastman created issue -
        Jeff Eastman made changes -
        Field Original Value New Value
        Description In TF mode, seq2sparse puts the output vectors into <output>vectors. In TFIDF mode; however, it puts the output vectors into <output>/tfidf/vectors. Even worse, in TFIDF mode the TFIDF converter reuses the <output>/vector/ directory for its intermediate calculations. Seems like both modes ought to output to the same directory so changing the switch does not cause downstream user changes that are error-prone and confusing. In TF mode, seq2sparse puts the output vectors into <output>/vectors. In TFIDF mode; however, it puts the output vectors into <output>/tfidf/vectors. This happens because the IDF calculation - if it is selected - happens after TF and uses the TF vectors for its input.

        Seems like both modes ought to output to a consistent directory structure so changing the switch does not change the final output location: perhaps as simple as changing TF to output to <output>/tf/vectors so that the contents of both directories when present are more obvious from their nomenclature.
        Hide
        Jeff Eastman added a comment -

        Here's a very minimal fix that, imho, reduces the ambiguity and makes the contents of the vector directories much more obvious:

        DictionaryVectorizer {

        public static final String DOCUMENT_VECTOR_OUTPUT_FOLDER = "tf-vectors";

        TFIDFConverter {

        private static final String DOCUMENT_VECTOR_OUTPUT_FOLDER = "tfidf-vectors";

        Show
        Jeff Eastman added a comment - Here's a very minimal fix that, imho, reduces the ambiguity and makes the contents of the vector directories much more obvious: DictionaryVectorizer { public static final String DOCUMENT_VECTOR_OUTPUT_FOLDER = "tf-vectors"; TFIDFConverter { private static final String DOCUMENT_VECTOR_OUTPUT_FOLDER = "tfidf-vectors";
        Hide
        Drew Farris added a comment -

        Jeff, I agree, it makes sense to make this a bit more consistent across outputs

        After the minor changes you propose, the output produced by the reuters example when constructing tfidf vectors looks like this:

        .../reuters-out-seqdir-sparse/dictionary.file-0
        .../reuters-out-seqdir-sparse/tfidf
        ../reuters-out-seqdir-sparse/tfidf/frequency.file-0
        .../reuters-out-seqdir-sparse/tfidf/df-count
        .../reuters-out-seqdir-sparse/tfidf/df-count/part-00000
        .../reuters-out-seqdir-sparse/tfidf/tfidf-vectors
        .../reuters-out-seqdir-sparse/tfidf/tfidf-vectors/part-00000
        .../reuters-out-seqdir-sparse/tf-vectors
        .../reuters-out-seqdir-sparse/tf-vectors/part-00000
        .../reuters-out-seqdir-sparse/tokenized-documents
        .../reuters-out-seqdir-sparse/tokenized-documents/part-00000
        .../reuters-out-seqdir-sparse/wordcount
        .../reuters-out-seqdir-sparse/wordcount/part-00000
        

        How about we the tfidf-vectors and tf-vectors output directories at the same level? I seems that putting frequency.file and dictionary.file at the same level might make some sense. I know there's been some talk about standardizing input, working and output directory creation for jobs but I haven't followed it – might that provide some suggestion what to do here?

        Here's a patch that includes Jeff's changes and pushes the tfidf stuff up a level. The output is:

        reuters-out-seqdir-sparse/dictionary.file-0
        reuters-out-seqdir-sparse/frequency.file-0
        reuters-out-seqdir-sparse/tf-vectors
        reuters-out-seqdir-sparse/tf-vectors/part-00000
        reuters-out-seqdir-sparse/tokenized-documents
        reuters-out-seqdir-sparse/tokenized-documents/part-00000
        reuters-out-seqdir-sparse/df-count
        reuters-out-seqdir-sparse/df-count/part-00000
        reuters-out-seqdir-sparse/tfidf-vectors
        reuters-out-seqdir-sparse/tfidf-vectors/part-00000
        reuters-out-seqdir-sparse/wordcount
        reuters-out-seqdir-sparse/wordcount/part-00000
        
        Show
        Drew Farris added a comment - Jeff, I agree, it makes sense to make this a bit more consistent across outputs After the minor changes you propose, the output produced by the reuters example when constructing tfidf vectors looks like this: .../reuters-out-seqdir-sparse/dictionary.file-0 .../reuters-out-seqdir-sparse/tfidf ../reuters-out-seqdir-sparse/tfidf/frequency.file-0 .../reuters-out-seqdir-sparse/tfidf/df-count .../reuters-out-seqdir-sparse/tfidf/df-count/part-00000 .../reuters-out-seqdir-sparse/tfidf/tfidf-vectors .../reuters-out-seqdir-sparse/tfidf/tfidf-vectors/part-00000 .../reuters-out-seqdir-sparse/tf-vectors .../reuters-out-seqdir-sparse/tf-vectors/part-00000 .../reuters-out-seqdir-sparse/tokenized-documents .../reuters-out-seqdir-sparse/tokenized-documents/part-00000 .../reuters-out-seqdir-sparse/wordcount .../reuters-out-seqdir-sparse/wordcount/part-00000 How about we the tfidf-vectors and tf-vectors output directories at the same level? I seems that putting frequency.file and dictionary.file at the same level might make some sense. I know there's been some talk about standardizing input, working and output directory creation for jobs but I haven't followed it – might that provide some suggestion what to do here? Here's a patch that includes Jeff's changes and pushes the tfidf stuff up a level. The output is: reuters-out-seqdir-sparse/dictionary.file-0 reuters-out-seqdir-sparse/frequency.file-0 reuters-out-seqdir-sparse/tf-vectors reuters-out-seqdir-sparse/tf-vectors/part-00000 reuters-out-seqdir-sparse/tokenized-documents reuters-out-seqdir-sparse/tokenized-documents/part-00000 reuters-out-seqdir-sparse/df-count reuters-out-seqdir-sparse/df-count/part-00000 reuters-out-seqdir-sparse/tfidf-vectors reuters-out-seqdir-sparse/tfidf-vectors/part-00000 reuters-out-seqdir-sparse/wordcount reuters-out-seqdir-sparse/wordcount/part-00000
        Drew Farris made changes -
        Attachment MAHOUT-398.patch [ 12445276 ]
        Hide
        Drew Farris added a comment -

        Any objections on this? Would like to commit

        Show
        Drew Farris added a comment - Any objections on this? Would like to commit
        Drew Farris made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hide
        Drew Farris added a comment -

        Committed revision 949649. eliminated separate tfidf directory for tfidf vector output.

        Show
        Drew Farris added a comment - Committed revision 949649. eliminated separate tfidf directory for tfidf vector output.
        Drew Farris made changes -
        Assignee Drew Farris [ drew.farris ]
        Drew Farris made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #38 (See http://hudson.zones.apache.org/hudson/job/Mahout-Quality/38/)
        MAHOUT-398: eliminated separate tfidf directory for tfidf vector output.

        Show
        Hudson added a comment - Integrated in Mahout-Quality #38 (See http://hudson.zones.apache.org/hudson/job/Mahout-Quality/38/ ) MAHOUT-398 : eliminated separate tfidf directory for tfidf vector output.
        Sean Owen made changes -
        Status Resolved [ 5 ] Closed [ 6 ]

          People

          • Assignee:
            Drew Farris
            Reporter:
            Jeff Eastman
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development