Uploaded image for project: 'Mahout'
  1. Mahout
  2. MAHOUT-957

term vectors not created in SparseVectorsFromSequenceFiles using tf weighting and maxDFSigma filtering

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.6
    • Fix Version/s: 0.6
    • Component/s: Clustering
    • Labels:
      None

      Description

      The SparseVectorsFromSequenceFiles throws an exception when you want term frequency vectors output, with the maxDFSigma filtering option.

      Basically the if / else if section shown below, will skip calling DictionaryVectorizer.createTermFrequencyVectors when have that combination. The condition will create vectors when you want tf vectors without maxDFSigma filtering, or tfidf vectors with maxDFSigma filtering, but if you want tf vectors with maxDFSigma filtering, it totally skips over the call to createTermFrequencyVectors, and later on throws an exception because the vector input path doesn't exist.

      For example, the following cmd line will reproduce this situation:
      bin/mahout seq2sparse -i /Users/me/Documents/workspace/mahoutStuff/seq -o /Users/me/Documents/workspace/mahoutStuff/termvecs -wt tf --minSupport 2 --minDF 2 --maxDFSigma 3 -seq

      //the suspect code at line ~267 in DictionaryVectorizer.createTermFrequencyVectors
      if (!processIdf && !shouldPrune) {
      DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath, outputDir, tfDirName, conf, minSupport, maxNGramSize,
      minLLRValue, norm, logNormalize, reduceTasks, chunkSize, sequentialAccessOutput, namedVectors);
      } else if (processIdf) {
      DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath, outputDir, tfDirName, conf, minSupport, maxNGramSize,
      minLLRValue, -1.0f, false, reduceTasks, chunkSize, sequentialAccessOutput, namedVectors);
      }

        Attachments

        1. MAHOUT-957.patch
          6 kB
          Grant Ingersoll

          Activity

            People

            • Assignee:
              gsingers Grant Ingersoll
              Reporter:
              jconwell John Conwell
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: