Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-8516

Make WordDelimiterGraphFilter a Tokenizer

    XMLWordPrintableJSON

    Details

    • Type: Task
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Being able to split tokens up at arbitrary points in a filter chain, in effect adding a second round of tokenization, can cause any number of problems when trying to keep tokenstreams to contract. The most common offender here is the WordDelimiterGraphFilter, which can produce broken offsets in a wide range of situations.

      We should make WDGF a Tokenizer in its own right, which should preserve all the functionality we need, but make reasoning about the resulting tokenstream much simpler.

        Attachments

        1. LUCENE-8516.patch
          51 kB
          Alan Woodward

          Activity

            People

            • Assignee:
              romseygeek Alan Woodward
              Reporter:
              romseygeek Alan Woodward
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated: