[LUCENE-5042] Improve NGramTokenizer - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 4.4, 6.0
Component/s: None
Labels:
None

Lucene Fields:

New

Description

Now that we fixed NGramTokenizer and NGramTokenFilter to not produce corrupt token streams, the only way to have "true" offsets for n-grams is to use the tokenizer (the filter emits the offsets of the original token).

Yet, our NGramTokenizer has a few flaws, in particular:

it doesn't have the ability to pre-tokenize the input stream, for example on whitespaces,
it doesn't play nice with surrogate pairs.

Since we already broke backward compatibility for it in 4.4, I'd like to also fix these issues before we release.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

LUCENE-5042.patch
07/Jun/13 09:35
16 kB
Adrien Grand
LUCENE-5042.patch
08/Jun/13 18:38
66 kB
Adrien Grand

Activity

People

Assignee:: Adrien Grand

Reporter:: Adrien Grand

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 06/Jun/13 15:28

Updated:: 28/Aug/22 13:47

Resolved:: 17/Jun/13 17:40