[LUCENE-10081] KoreanTokenizer should check the max backtrace gap on whitespaces - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 9.0, 8.10
Component/s: None
Labels:
None

Lucene Fields:

New

Description

Today the KoreanTokenizer keeps track of the whitespaces that appear before a known term in order to apply a space penalty factor. These whitespaces are considered part of the next term so the backtrace gap limit is not applied.
As a result, the position buffer can grow up to the maximum number of consecutive whitespaces in the input. This is problematic since the buffer is reused on reset() so we should ensure that the max backtrace gap limit is applied on consecutive whitespaces consistently.

Attachments

Issue Links

links to

GitHub Pull Request #272

Activity

People

Assignee:: Unassigned

Reporter:: Jim Ferenczi

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 31/Aug/21 21:36

Updated:: 28/Aug/22 16:25

Resolved:: 06/Sep/21 07:05

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

0.5h