Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-8183

HyphenationCompoundWordTokenFilter creates overlapping tokens with onlyLongestMatch enabled

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 6.6
    • None
    • modules/analysis
    • None
    • New

    Description

      The HyphenationCompoundWordTokenFilter creates overlapping tokens even if onlyLongestMatch is enabled. 

      Example:

      Dictionary: gesellschaft, schaft
      Hyphenator: de_DR.xml //from Apche Offo
      onlyLongestMatch: true

       

      text gesellschaft gesellschaft schaft
      raw_bytes [67 65 73 65 6c 6c 73 63 68 61 66 74] [67 65 73 65 6c 6c 73 63 68 61 66 74] [73 63 68 61 66 74]
      start 0 0 0
      end 12 12 12
      positionLength 1 1 1
      type word word word
      position 1 1 1

      IMHO this includes 2 unexpected Tokens

      1. the 2nd 'gesellschaft' as it duplicates the original token
      2. the 'schaft' as it is a sub-token 'gesellschaft' that is present in the dictionary

       

      Attachments

        1. LUCENE-8183_20180223_rwesten.diff
          3 kB
          Rupert Westenthaler
        2. LUCENE-8183_20180227_rwesten.diff
          25 kB
          Rupert Westenthaler
        3. lucene-8183.zip
          80 kB
          Rupert Westenthaler

        Issue Links

          Activity

            People

              uschindler Uwe Schindler
              rwesten Rupert Westenthaler
              Votes:
              1 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated: