Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-8497

Rethink multi-term analysis handling

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

    • Type: New Feature
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 8.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      The current framework for handling term normalisation works via instanceof checks for MultiTermAwareComponent and casts.  MultiTermAwareComponent itself deals in AbstractAnalysisComponents, and so callers need to cast to the correct component type before use, which is ripe for misuse.

      We should re-organise all this to be type-safe and usable without casts.  One possibility is to add `normalize` methods to CharFilterFactory and TokenFilterFactory that mirror their existing `create` methods.  The default implementation would return the input unchanged, while filters that should apply at normalization time can delegate to `create`.

      Related to this, we should deprecate and remove LowerCaseTokenizer, which combines tokenization and normalization in a way that will break this API.

        Attachments

        1. LUCENE-8497.patch
          74 kB
          Alan Woodward
        2. LUCENE-8497.patch
          77 kB
          Alan Woodward
        3. LUCENE-8497.patch
          77 kB
          Alan Woodward
        4. LUCENE-8497.patch
          81 kB
          Alan Woodward

        Issue Links

          Activity

            People

            • Assignee:
              romseygeek Alan Woodward
              Reporter:
              romseygeek Alan Woodward

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 1h 10m
                1h 10m

                  Issue deployment