Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-7960

NGram filters -- preserve the original token when it is outside the min/max size range

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 7.4, 8.0
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      When ngram or edgengram filters are used, any terms that are shorter than the minGramSize are completely removed from the token stream.

      This is probably 100% what was intended, but I've seen it cause a lot of problems for users. I am not suggesting that the default behavior be changed. That would be far too disruptive to the existing user base.

      I do think there should be a new boolean option, with a name like keepShortTerms, that defaults to false, to allow the short terms to be preserved.

        Attachments

        1. LUCENE-7960.patch
          45 kB
          Shawn Heisey
        2. LUCENE-7960.patch
          41 kB
          Shawn Heisey
        3. LUCENE-7960.patch
          48 kB
          Shawn Heisey
        4. LUCENE-7960.patch
          46 kB
          Ingomar Wesp

          Issue Links

            Activity

              People

              • Assignee:
                rcmuir Robert Muir
                Reporter:
                elyograg Shawn Heisey
              • Votes:
                2 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h
                  1h