[OPENNLP-711] SentenceDetectorME::sentPosDetect() with useTokenEnd=false - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: 1.6.0
Fix Version/s: 1.6.0
Component/s: Sentence Detector
Labels:
None

Description

I trained the SentenceModel with a german korpus and wondered about the results for the following input (a mark indicates the expected split):

"I am hungry.Ich bin Mr. Bean.Ein guter Satz."
             ^                ^

The result was 3 sentences. Good, but the split was not at the eosChar. It was after the token with the eosChar: "I am hungry.Ich" , "bin Mr. Bean.Ein", ...

After some debugging I found out that I have to set useTokenEnd=false in the SentenceDetectorFactory-ctor.
And then I found a little bug in SentenceDetectorME when the span is calculated:

  public Span[] sentPosDetect(String s) {
...
      if (bestOutcome.equals(SPLIT) && isAcceptableBreak(s, index, cint)) {
        if (index != cint) {
          if (useTokenEnd) {
            positions.add(getFirstNonWS(s, getFirstWS(s,cint + 1)));
          }
          else {
            positions.add(getFirstNonWS(s,cint)); // this should be positions.add(getFirstNonWS(s,cint + 1)); 
          }
          sentProbs.add(probs[model.getIndex(bestOutcome)]);
        }
        index = cint + 1;
      }
...

This change has only impact on models with useTokenEnd=false

Attachments

Activity

People

Assignee:: Jörn Kottmann

Reporter:: Eugen Hanussek

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 07/Aug/14 11:38

Updated:: 20/Nov/14 16:31

Resolved:: 20/Oct/14 22:11