Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-8185

HyphenationCompoundWordTokenFilter returns terms shorter than minSubwordSize

Details

    • Bug
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 6.6.1, 7.2.1
    • None
    • None
    • None
    • New

    Description

      To account for languages which use binding characters ("fogemorphemes") for composing words the HyphenationCompoundWordTokenFilter re-checks the dictionary for a candidate with the last character removed when the original candidate was not found. It currently does not re-check against minSubWordSize in this case. Terms that are one character shorter than minSubWordSize can be returned.

      Attachments

        1. LUCENE-8185.patch
          6 kB
          Matthias Krueger

        Issue Links

          Activity

            People

              Unassigned Unassigned
              mkrio Matthias Krueger
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 10m
                  10m