[MAHOUT-1598] extend seq2sparse to handle multiple text blocks of same document - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 0.9
Fix Version/s: 0.10.0
Component/s: None
Labels:
- legacy

Description

Currently the seq2sparse or in particular the org.apache.mahout.vectorizer.DictionaryVectorizer needs as input exactly one text block per document.

I stumbled on this because i'm having an use case where one document represents a ticket which can have several text blocks in different languages.

So my idea was that the org.apache.mahout.vectorizer.DocumentProcessor shall tokenize each text block itself. So i can use language specific features in our Lucene Analyzer.

Unfortunately the current implementation doesn't support this.

But with just minor changes this can be made possible.

The only thing which has to be changed would be the org.apache.mahout.vectorizer.term.TFPartialVectorReducer to handle all values of the iterable (not just the 1st one >.<)

An Alternative would be to change this Reducer to a Mapper, i don't get why in the 1st place this is implemented as an reducer. Is there any benefit from this?

I will provide a PR via github.

Please have a look onto this and tell me if i am assuming anything wrong.

Attachments

Activity

People

Assignee:: Andrew Musselman

Reporter:: Wolfgang Buchner

Votes:: 1 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 25/Jul/14 11:28

Updated:: 13/Apr/15 09:57

Resolved:: 30/Mar/15 17:28