Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-9006

Ensure WordDelimiterGraphFilter always emits catenateAll token early

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 8.4
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Ideally, the first token of WDGF is the preserveOriginal (if configured to emit), and the second should be the catenateAll (if configured to emit). The deprecated WDF does this but WDGF can sometimes put the first other token earlier when there is a non-emitted candidate sub-token.

      Example input "8-other" when only generateWordParts and catenateAll – not generateNumberParts. WDGF internally sees the '8' but moves on. Ultimately, the "other" token and the catenated "8other" will appear at the same internal position, which by luck fools the sorter to emit "other" first.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                dsmiley David Smiley
                Reporter:
                dsmiley David Smiley
              • Votes:
                0 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 10m
                  1h 10m