Uploaded image for project: 'Mahout'
  1. Mahout
  2. MAHOUT-237

Map/Reduce Implementation of Document Vectorizer

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.3
    • 0.3
    • None
    • None

    Description

      Current Vectorizer uses Lucene Index to convert documents into SparseVectors
      Ted is working on a Hash based Vectorizer which can map features into Vectors of fixed size and sum it up to get the document Vector
      This is a pure bag-of-words based Vectorizer written in Map/Reduce.

      The input document is in SequenceFile<Text,Text> . with key = docid, value = content
      First Map/Reduce over the document collection and generate the feature counts.
      Second Sequential pass reads the output of the map/reduce and converts them to SequenceFile<Text, LongWritable> where key=feature, value = unique id
      Second stage should create shards of features of a given split size
      Third Map/Reduce over the document collection, using each shard and create Partial(containing the features of the given shard) SparseVectors
      Fourth Map/Reduce over partial shard, group by docid, create full document Vector

      Attachments

        1. DictionaryVectorizer.patch
          64 kB
          Robin Anil
        2. DictionaryVectorizer.patch
          50 kB
          Robin Anil
        3. DictionaryVectorizer.patch
          49 kB
          Robin Anil
        4. DictionaryVectorizer.patch
          49 kB
          Robin Anil
        5. DictionaryVectorizer.patch
          24 kB
          Robin Anil
        6. MAHOUT-237-tfidf.patch
          95 kB
          Robin Anil
        7. MAHOUT-237-tfidf.patch
          36 kB
          Robin Anil
        8. SparseVector-VIntWritable.patch
          2 kB
          Robin Anil

        Activity

          People

            robinanil Robin Anil
            robinanil Robin Anil
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: