Details
-
Bug
-
Status: Closed
-
Minor
-
Resolution: Fixed
-
1.6.0
-
None
Description
I trained the SentenceModel with a german korpus and wondered about the results for the following input (a mark indicates the expected split):
"I am hungry.Ich bin Mr. Bean.Ein guter Satz."
^ ^
The result was 3 sentences. Good, but the split was not at the eosChar. It was after the token with the eosChar: "I am hungry.Ich" , "bin Mr. Bean.Ein", ...
After some debugging I found out that I have to set useTokenEnd=false in the SentenceDetectorFactory-ctor.
And then I found a little bug in SentenceDetectorME when the span is calculated:
public Span[] sentPosDetect(String s) { ... if (bestOutcome.equals(SPLIT) && isAcceptableBreak(s, index, cint)) { if (index != cint) { if (useTokenEnd) { positions.add(getFirstNonWS(s, getFirstWS(s,cint + 1))); } else { positions.add(getFirstNonWS(s,cint)); // this should be positions.add(getFirstNonWS(s,cint + 1)); } sentProbs.add(probs[model.getIndex(bestOutcome)]); } index = cint + 1; } ...
This change has only impact on models with useTokenEnd=false