Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-3979

NGramTokenizer strips whitespace, with no option to keep leading and trailing whitespace

Details

    • Bug
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 2.9.2, 3.0
    • None
    • modules/analysis
    • n/a

    • New

    Description

      org.apache.lucene.analysis.ngram.NGramTokenizer removes whitespace, making a search for literal strings like " test" and "test " equivalent to "test". Searching with relevant whitespace is sometimes desired, particularly where ngrams are used.

      This could be fixed by either removing .trim() from the line shown below, or by providing a flag to specifically set trimming behaviour (keeping trim=true as the default so that existing code using this analyzer is not broken).

      111: inStr = new String(chars).trim(); // remove any trailing empty strings

      Attachments

        Activity

          People

            Unassigned Unassigned
            damason David Mason
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:

              Time Tracking

                Estimated:
                Original Estimate - 5m
                5m
                Remaining:
                Remaining Estimate - 5m
                5m
                Logged:
                Time Spent - Not Specified
                Not Specified