Uploaded image for project: 'OpenNLP'
  1. OpenNLP
  2. OPENNLP-1307

Incorrect code example for Document Categorization (9.3)

    XMLWordPrintableJSON

Details

    • Important

    Description

      in https://opennlp.apache.org/docs/1.9.3/manual/opennlp.html#tools.doccat.classifying.api,

      the code example feeds a String into DocumentCategorizerME.categorize(). The method itself takes an array. I flagged priority as Major because this was a killer - obviously it's a self-documenting bug when you run it, but I made the mistake of assuming that the array actually needed would be an array of documents - instead it needs to be an array of tokens from a single document, i.e. one needs to split() the doc on whitespace. Lost 24 hours experimenting with algos (maxent vs. naive_bayes) and params (cutoff, iterations, etc) before figuring this one out.

       

      Current(wrong) version:

       

      String inputText = ...
      DocumentCategorizerME myCategorizer = new DocumentCategorizerME(m);
      double[] outcomes = myCategorizer.categorize(inputText);
      String category = myCategorizer.getBestCategory(outcomes);
      

       

      Should be more like:

       

      String inputText = ... // sanitized document to be categorized
      DocumentCategorizerME myCategorizer = new DocumentCategorizerME(m);
      double[] outcomes = myCategorizer.categorize(inputText.split(" ");
      String category = myCategorizer.getBestCategory(outcomes);
      

       

      Attachments

        Activity

          People

            mawiesne Martin Wiesner
            Orolo John Slocum
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - 2m
                2m
                Remaining:
                Remaining Estimate - 2m
                2m
                Logged:
                Time Spent - Not Specified
                Not Specified