[OPENNLP-1307] Incorrect code example for Document Categorization (9.3) - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Documentation
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.9.3
Fix Version/s: 2.1.1
Component/s: Doccat
Labels:
- DocumentCategorizerME
- documentation
Environment:
N/A

Flags:

Important

Description

in https://opennlp.apache.org/docs/1.9.3/manual/opennlp.html#tools.doccat.classifying.api,

the code example feeds a String into DocumentCategorizerME.categorize(). The method itself takes an array. I flagged priority as Major because this was a killer - obviously it's a self-documenting bug when you run it, but I made the mistake of assuming that the array actually needed would be an array of documents - instead it needs to be an array of tokens from a single document, i.e. one needs to split() the doc on whitespace. Lost 24 hours experimenting with algos (maxent vs. naive_bayes) and params (cutoff, iterations, etc) before figuring this one out.

Current(wrong) version:

String inputText = ...
DocumentCategorizerME myCategorizer = new DocumentCategorizerME(m);
double[] outcomes = myCategorizer.categorize(inputText);
String category = myCategorizer.getBestCategory(outcomes);

Should be more like:

String inputText = ... // sanitized document to be categorized
DocumentCategorizerME myCategorizer = new DocumentCategorizerME(m);
double[] outcomes = myCategorizer.categorize(inputText.split(" ");
String category = myCategorizer.getBestCategory(outcomes);

Attachments

Activity

People

Assignee:: Martin Wiesner

Reporter:: John Slocum

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 25/Aug/20 18:10

Updated:: 10/Dec/22 14:20

Resolved:: 10/Dec/22 14:20

Time Tracking

Estimated:

Remaining:

Logged:

Not Specified