Uploaded image for project: 'Mahout'
  1. Mahout
  2. MAHOUT-237

Map/Reduce Implementation of Document Vectorizer

    XMLWordPrintableJSON

    Details

    • Type: New Feature
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.3
    • Fix Version/s: 0.3
    • Component/s: None
    • Labels:
      None

      Description

      Current Vectorizer uses Lucene Index to convert documents into SparseVectors
      Ted is working on a Hash based Vectorizer which can map features into Vectors of fixed size and sum it up to get the document Vector
      This is a pure bag-of-words based Vectorizer written in Map/Reduce.

      The input document is in SequenceFile<Text,Text> . with key = docid, value = content
      First Map/Reduce over the document collection and generate the feature counts.
      Second Sequential pass reads the output of the map/reduce and converts them to SequenceFile<Text, LongWritable> where key=feature, value = unique id
      Second stage should create shards of features of a given split size
      Third Map/Reduce over the document collection, using each shard and create Partial(containing the features of the given shard) SparseVectors
      Fourth Map/Reduce over partial shard, group by docid, create full document Vector

        Attachments

        1. DictionaryVectorizer.patch
          64 kB
          Robin Anil
        2. DictionaryVectorizer.patch
          50 kB
          Robin Anil
        3. DictionaryVectorizer.patch
          49 kB
          Robin Anil
        4. DictionaryVectorizer.patch
          49 kB
          Robin Anil
        5. DictionaryVectorizer.patch
          24 kB
          Robin Anil
        6. MAHOUT-237-tfidf.patch
          95 kB
          Robin Anil
        7. MAHOUT-237-tfidf.patch
          36 kB
          Robin Anil
        8. SparseVector-VIntWritable.patch
          2 kB
          Robin Anil

          Activity

            People

            • Assignee:
              robinanil Robin Anil
              Reporter:
              robinanil Robin Anil
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: