Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-7004

Duplicate tokens using WordDelimiterFilter for a specific configuration

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Implemented
    • None
    • None
    • None
    • None
    • New, Patch Available

    Description

      When using both the options PRESERVE_ORIGINAL|SPLIT_ON_CASE_CHANGE|CONCATENATE_ALL using the WordDelimiterFilter, we have duplicate tokens on strings contaning only case changes.

      When using the SPLIT_ON_CASE_CHANGE option, "abcDef" is split into "abc", "Def".

      When having PRESERVE_ORIGINAL, we keep "abcDef".

      However, when one uses CONCATENATE_ALL (or CATENATE_WORDS ?), it also adds another token built from the concatenation of the splited words, giving "abcDef" again.

      I'm not 100% certain that token filters should not produce duplicate tokens (same word, same start and end positions). Can someone confirm this is a bug ?

      Attachments

        1. wdf-analysis.png
          15 kB
          Shawn Heisey
        2. FIX-LUCENE-7004.PATCH
          7 kB
          Jean-Baptiste Lespiau
        3. TEST-LUCENE-7004.PATCH
          19 kB
          Jean-Baptiste Lespiau

        Issue Links

          Activity

            People

              Unassigned Unassigned
              Gueust Jean-Baptiste Lespiau
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: