[MAHOUT-242] LLR Collocation Identifier - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: 0.3
Fix Version/s: 0.3
Component/s: None
Labels:
None

Description

Identifies interesting Collocations in text using ngrams scored via the LogLikelihoodRatio calculation.

As discussed in:

Current form is a tar of a maven project that depends on mahout. Build as usual with 'mvn clean install', can be executed using:

mvn -e exec:java  -Dexec.mainClass="org.apache.mahout.colloc.CollocDriver" -Dexec.args="--input src/test/resources/article --colloc target/colloc --output target/output -w"

Output will be placed in target/output and can be viewed nicely using:

sort -rn -k1 target/output/part-00000

Includes rudimentary unit tests. Please review and comment. Needs more work to get this into patch state and integrate with Robin's document vectorizer work in ~~MAHOUT-237~~

Some basic TODO/FIXME's include:

use mahout math's ObjectInt map implementation when available
make the analyzer configurable
better input validation + negative unit tests.
more flexible ways to generate units of analysis (n-1)grams.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

mahout-colloc.tar.gz
11/Jan/10 04:36
12 kB
Drew Farris
mahout-colloc.tar.gz
14/Jan/10 05:54
14 kB
Drew Farris
MAHOUT-242.patch
16/Jan/10 20:50
51 kB
Drew Farris
MAHOUT-242.patch
22/Jan/10 21:58
52 kB
Drew Farris
MAHOUT-242.patch
03/Feb/10 03:11
49 kB
Drew Farris
MAHOUT-242.patch
08/Feb/10 15:11
52 kB
Drew Farris
MAHOUT-242.patch
09/Feb/10 04:25
52 kB
Drew Farris

Activity

People

Assignee:: Drew Farris

Reporter:: Drew Farris

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 11/Jan/10 04:35

Updated:: 11/Mar/10 02:14

Resolved:: 09/Feb/10 17:03