Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-7960

NGram filters -- preserve the original token when it is outside the min/max size range

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 7.4, 8.0
    • modules/analysis
    • None
    • New

    Description

      When ngram or edgengram filters are used, any terms that are shorter than the minGramSize are completely removed from the token stream.

      This is probably 100% what was intended, but I've seen it cause a lot of problems for users. I am not suggesting that the default behavior be changed. That would be far too disruptive to the existing user base.

      I do think there should be a new boolean option, with a name like keepShortTerms, that defaults to false, to allow the short terms to be preserved.

      Attachments

        1. LUCENE-7960.patch
          45 kB
          Shawn Heisey
        2. LUCENE-7960.patch
          41 kB
          Shawn Heisey
        3. LUCENE-7960.patch
          48 kB
          Shawn Heisey
        4. LUCENE-7960.patch
          46 kB
          Ingomar Wesp

        Issue Links

          Activity

            People

              rcmuir Robert Muir
              elyograg Shawn Heisey
              Votes:
              2 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h
                  1h