[MAHOUT-126] Prepare document vectors from the text - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 0.2
Fix Version/s: 0.2
Component/s: None
Labels:
None

Description

Clustering algorithms presently take the document vectors as input. Generating these document vectors from the text can be broken in two tasks.

1. Create lucene index of the input plain-text documents
2. From the index, generate the document vectors (sparse) with weights as TF-IDF values of the term. With lucene index, this value can be calculated very easily.

Presently, I have created two separate utilities, which could possibly be invoked from another class.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

MAHOUT-126.patch
29/May/09 08:08
10 kB
Shashikant Kore
mahout-126-benson.patch
29/May/09 12:27
11 kB
Benson Margulies
MAHOUT-126.patch
09/Jun/09 16:34
41 kB
Grant Ingersoll
MAHOUT-126.patch
16/Jun/09 19:23
50 kB
Grant Ingersoll
MAHOUT-126.patch
16/Jun/09 21:37
41 kB
Grant Ingersoll
MAHOUT-126-no-normalization.patch
17/Jun/09 20:32
2 kB
David Leo Wright Hall
MAHOUT-126-no-normalization.patch
17/Jun/09 21:08
1 kB
David Leo Wright Hall
MAHOUT-126-TF.patch
17/Jun/09 21:30
4 kB
David Leo Wright Hall
MAHOUT-126-null-entry.patch
18/Jun/09 05:22
0.8 kB
David Leo Wright Hall

Issue Links

is blocked by

MAHOUT-65 Add Element Labels to Vectors and Matrices

Closed

relates to

MAHOUT-61 Text problem matrix builder

Closed

Activity

People

Assignee:: Grant Ingersoll

Reporter:: Shashikant Kore

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 29/May/09 07:51

Updated:: 18/Nov/09 14:05

Resolved:: 29/Jun/09 12:45