Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-3417

DictionaryCompoundWordTokenFilter does not properly add tokens from the end compound word.

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 3.3, 4.0-ALPHA
    • 3.5, 4.0-ALPHA
    • modules/analysis
    • None
    • New, Patch Available

    Description

      Due to an off-by-one error, a subword placed at the end of a compound word will not get a token added to the token stream.

      For example (from the unit test in the attached patch):
      Dictionary:

      {"ab", "cd", "ef"}

      Input: "abcdef"
      Created tokens:

      {"abcdef", "ab", "cd"}

      Expected tokens:

      {"abcdef", "ab", "cd", "ef"}

      Additionally, it could produce tokens that were shorter than the minSubwordSize due to another off-by-one error. For example (again, from the attached patch):

      Dictionary:

      {"abc", "d", "efg"}

      Minimum subword length: 2
      Input: "abcdefg"
      Created tokens:

      {"abcdef", "abc", "d", "efg"}

      Expected tokens:

      {"abcdef", "abc", "efg"}

      Attachments

        1. LUCENE-3417.patch
          3 kB
          Njal Karevoll

        Activity

          People

            rcmuir Robert Muir
            nkvoll Njal Karevoll
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - 5m
                5m
                Remaining:
                Remaining Estimate - 5m
                5m
                Logged:
                Time Spent - Not Specified
                Not Specified