Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-10081

KoreanTokenizer should check the max backtrace gap on whitespaces

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • None
    • 9.0, 8.10
    • None
    • None
    • New

    Description

      Today the KoreanTokenizer keeps track of the whitespaces that appear before a known term in order to apply a space penalty factor. These whitespaces are considered part of the next term so the backtrace gap limit is not applied.
      As a result, the position buffer can grow up to the maximum number of consecutive whitespaces in the input. This is problematic since the buffer is reused on reset() so we should ensure that the max backtrace gap limit is applied on consecutive whitespaces consistently.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              jimczi Jim Ferenczi
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 0.5h
                  0.5h