Uploaded image for project: 'OpenNLP'
  1. OpenNLP
  2. OPENNLP-711

SentenceDetectorME::sentPosDetect() with useTokenEnd=false

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 1.6.0
    • 1.6.0
    • Sentence Detector
    • None

    Description

      I trained the SentenceModel with a german korpus and wondered about the results for the following input (a mark indicates the expected split):

      "I am hungry.Ich bin Mr. Bean.Ein guter Satz."
                   ^                ^
      

      The result was 3 sentences. Good, but the split was not at the eosChar. It was after the token with the eosChar: "I am hungry.Ich" , "bin Mr. Bean.Ein", ...

      After some debugging I found out that I have to set useTokenEnd=false in the SentenceDetectorFactory-ctor.
      And then I found a little bug in SentenceDetectorME when the span is calculated:

        public Span[] sentPosDetect(String s) {
      ...
            if (bestOutcome.equals(SPLIT) && isAcceptableBreak(s, index, cint)) {
              if (index != cint) {
                if (useTokenEnd) {
                  positions.add(getFirstNonWS(s, getFirstWS(s,cint + 1)));
                }
                else {
                  positions.add(getFirstNonWS(s,cint)); // this should be positions.add(getFirstNonWS(s,cint + 1)); 
                }
                sentProbs.add(probs[model.getIndex(bestOutcome)]);
              }
              index = cint + 1;
            }
      ...
      

      This change has only impact on models with useTokenEnd=false

      Attachments

        Activity

          People

            joern Jörn Kottmann
            eugenh Eugen Hanussek
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: