Identifies interesting Collocations in text using ngrams scored via the LogLikelihoodRatio calculation.
As discussed in:
Current form is a tar of a maven project that depends on mahout. Build as usual with 'mvn clean install', can be executed using:
Output will be placed in target/output and can be viewed nicely using:
Includes rudimentary unit tests. Please review and comment. Needs more work to get this into patch state and integrate with Robin's document vectorizer work in
Some basic TODO/FIXME's include:
- use mahout math's ObjectInt map implementation when available
- make the analyzer configurable
- better input validation + negative unit tests.
- more flexible ways to generate units of analysis (n-1)grams.