Description
Today the KoreanTokenizer keeps track of the whitespaces that appear before a known term in order to apply a space penalty factor. These whitespaces are considered part of the next term so the backtrace gap limit is not applied.
As a result, the position buffer can grow up to the maximum number of consecutive whitespaces in the input. This is problematic since the buffer is reused on reset() so we should ensure that the max backtrace gap limit is applied on consecutive whitespaces consistently.
Attachments
Issue Links
- links to