[MAHOUT-237] Map/Reduce Implementation of Document Vectorizer - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 0.3
Fix Version/s: 0.3
Component/s: None
Labels:
None

Description

Current Vectorizer uses Lucene Index to convert documents into SparseVectors
Ted is working on a Hash based Vectorizer which can map features into Vectors of fixed size and sum it up to get the document Vector
This is a pure bag-of-words based Vectorizer written in Map/Reduce.

The input document is in SequenceFile<Text,Text> . with key = docid, value = content
First Map/Reduce over the document collection and generate the feature counts.
Second Sequential pass reads the output of the map/reduce and converts them to SequenceFile<Text, LongWritable> where key=feature, value = unique id
Second stage should create shards of features of a given split size
Third Map/Reduce over the document collection, using each shard and create Partial(containing the features of the given shard) SparseVectors
Fourth Map/Reduce over partial shard, group by docid, create full document Vector

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

DictionaryVectorizer.patch
12/Jan/10 20:15
64 kB
Robin Anil
DictionaryVectorizer.patch
11/Jan/10 16:37
50 kB
Robin Anil
DictionaryVectorizer.patch
10/Jan/10 12:09
49 kB
Robin Anil
DictionaryVectorizer.patch
10/Jan/10 07:20
49 kB
Robin Anil
DictionaryVectorizer.patch
05/Jan/10 02:47
24 kB
Robin Anil
MAHOUT-237-tfidf.patch
05/Feb/10 09:04
95 kB
Robin Anil
MAHOUT-237-tfidf.patch
02/Feb/10 21:03
36 kB
Robin Anil
SparseVector-VIntWritable.patch
11/Jan/10 16:37
2 kB
Robin Anil

Activity

People

Assignee:: Robin Anil

Reporter:: Robin Anil

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 05/Jan/10 02:45

Updated:: 21/May/11 03:23

Resolved:: 05/Feb/10 09:31