[LUCENE-3979] NGramTokenizer strips whitespace, with no option to keep leading and trailing whitespace - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: 2.9.2, 3.0
Fix Version/s: None
Component/s: modules/analysis
Labels:
- tokenizer
- whitespace
Environment:

n/a

Lucene Fields:

New

Description

org.apache.lucene.analysis.ngram.NGramTokenizer removes whitespace, making a search for literal strings like " test" and "test " equivalent to "test". Searching with relevant whitespace is sometimes desired, particularly where ngrams are used.

This could be fixed by either removing .trim() from the line shown below, or by providing a flag to specifically set trimming behaviour (keeping trim=true as the default so that existing code using this analyzer is not broken).

111: inStr = new String(chars).trim(); // remove any trailing empty strings

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: David Mason

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 13/Apr/12 03:43

Updated:: 28/Aug/22 13:14

Time Tracking

Estimated:

Remaining:

Logged:

Not Specified