Mahout
  1. Mahout
  2. MAHOUT-973

SparseVectorsFromSequenceFiles will not create a proper TFIDF (bug in TFIDFPartialVectorReducer)

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.6
    • Fix Version/s: 0.7
    • Component/s: None
    • Labels:
      None

      Description

      Although I'm using a little bit different the TFIDFConverter, but the problem will occur the same way with SparseVectorsFromSequenceFiles when somebody wants to create a TFIDF vectors for their documents.

      Basically if maxDFSigma is not set then because of SparseVectorsFromSequenceFiles.java:281
      long maxDF = maxDFPercent;

      maxDF will be 99. which is then passed to TFIDFConvert.processTfIdf function as an argument, where it is interpreted as "The max percentage of vectors for the DF." Partial vectors will be created with TFIDFPartialVectorReducer.class and because of TFIDFPartialVectorReducer.java:81 as maxDF = 99 if (df > maxDF) the term will be ignored.

      the problem here is that two different quantities are compared. df value is the number of documents which contains the given term, and it's not normalized by the document number, i.e. it's not a percentage! see TermDocumentCountReducer.java for details. while maxDF is interpreted as a percentage, see above. Thus, as soon as the df count gets higher than 99, or in the best case 100, meaning the given term occurs in more than 99 or 100 different documents, it'll be ignored... and this is not what we would like it to do.

      I.e. there's a bug in TFIDFPartialVectorReducer.java at line 81.

      I've attached a possible fix for this problem.

      the bug was introduced a61e5ff8 commit (git) or rev 1210994 in svn:
      @@ -78,7 +78,7 @@ public class TFIDFPartialVectorReducer extends
      continue;
      }
      long df = dictionary.get(e.index());

      • if (df * 100.0 / vectorCount > maxDfPercent) {
        + if (maxDf > -1 && df > maxDf) { continue; }

        if (df < minDf) {

        Activity

        Sean Owen made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #1427 (See https://builds.apache.org/job/Mahout-Quality/1427/)
        MAHOUT-973 one more file needed for fix to compute maxDF as a percent of total count (Revision 1310357)

        Result = SUCCESS
        srowen : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1310357
        Files :

        • /mahout/trunk/core/src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles.java
        Show
        Hudson added a comment - Integrated in Mahout-Quality #1427 (See https://builds.apache.org/job/Mahout-Quality/1427/ ) MAHOUT-973 one more file needed for fix to compute maxDF as a percent of total count (Revision 1310357) Result = SUCCESS srowen : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1310357 Files : /mahout/trunk/core/src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles.java
        Hide
        Hudson added a comment -

        Integrated in Mahout-Quality #1426 (See https://builds.apache.org/job/Mahout-Quality/1426/)
        MAHOUT-973 fix treatment of value as percentage (Revision 1310302)

        Result = FAILURE
        srowen : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1310302
        Files :

        • /mahout/trunk/core/src/main/java/org/apache/mahout/vectorizer/tfidf/TFIDFPartialVectorReducer.java
        Show
        Hudson added a comment - Integrated in Mahout-Quality #1426 (See https://builds.apache.org/job/Mahout-Quality/1426/ ) MAHOUT-973 fix treatment of value as percentage (Revision 1310302) Result = FAILURE srowen : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1310302 Files : /mahout/trunk/core/src/main/java/org/apache/mahout/vectorizer/tfidf/TFIDFPartialVectorReducer.java
        Sean Owen made changes -
        Resolution Fixed [ 1 ]
        Status Open [ 1 ] Resolved [ 5 ]
        Assignee Grant Ingersoll [ gsingers ] Sean Owen [ srowen ]
        Hide
        Sean Owen added a comment -

        Grant never heard back here but I think it's right enough to fix this one way at least, so am committing.

        Show
        Sean Owen added a comment - Grant never heard back here but I think it's right enough to fix this one way at least, so am committing.
        Sean Owen made changes -
        Assignee Grant Ingersoll [ gsingers ]
        Fix Version/s 0.7 [ 12319261 ]
        Affects Version/s 0.7 [ 12319261 ]
        Hide
        Sean Owen added a comment -

        Grant that was your rev. He's right that it seems incorrect afterwards, but don't know whether the intent was to keep maxDf a percentage, or to make it an absolute value, since another part of the change removes "percentage" as part of the key under which this is transferred.

        Show
        Sean Owen added a comment - Grant that was your rev. He's right that it seems incorrect afterwards, but don't know whether the intent was to keep maxDf a percentage, or to make it an absolute value, since another part of the change removes "percentage" as part of the key under which this is transferred.
        Viktor Gal made changes -
        Affects Version/s 0.6 [ 12316364 ]
        Viktor Gal made changes -
        Attachment fix-TFIDFPartialVectorReducer.patch [ 12513442 ]
        Viktor Gal made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Viktor Gal made changes -
        Field Original Value New Value
        Status Open [ 1 ] Patch Available [ 10002 ]
        Viktor Gal created issue -

          People

          • Assignee:
            Sean Owen
            Reporter:
            Viktor Gal
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development