Details
-
Documentation
-
Status: Closed
-
Major
-
Resolution: Fixed
-
1.9.3
-
N/A
-
Important
Description
in https://opennlp.apache.org/docs/1.9.3/manual/opennlp.html#tools.doccat.classifying.api,
the code example feeds a String into DocumentCategorizerME.categorize(). The method itself takes an array. I flagged priority as Major because this was a killer - obviously it's a self-documenting bug when you run it, but I made the mistake of assuming that the array actually needed would be an array of documents - instead it needs to be an array of tokens from a single document, i.e. one needs to split() the doc on whitespace. Lost 24 hours experimenting with algos (maxent vs. naive_bayes) and params (cutoff, iterations, etc) before figuring this one out.
Current(wrong) version:
String inputText = ... DocumentCategorizerME myCategorizer = new DocumentCategorizerME(m); double[] outcomes = myCategorizer.categorize(inputText); String category = myCategorizer.getBestCategory(outcomes);
Should be more like:
String inputText = ... // sanitized document to be categorized DocumentCategorizerME myCategorizer = new DocumentCategorizerME(m); double[] outcomes = myCategorizer.categorize(inputText.split(" "); String category = myCategorizer.getBestCategory(outcomes);