[LUCENE-8509] NGramTokenizer, TrimFilter and WordDelimiterGraphFilter in combination can produce backwards offsets - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Task
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 8.0
Component/s: None
Labels:
None

Lucene Fields:

New

Description

Discovered by an elasticsearch user and described here: https://github.com/elastic/elasticsearch/issues/33710

The ngram tokenizer produces tokens "a b" and " bb" (note the space at the beginning of the second token). The WDGF takes the first token and splits it into two, adjusting the offsets of the second token, so we get "a"[0,1] and "b"[2,3]. The trim filter removes the leading space from the second token, leaving offsets unchanged, so WDGF sees "bb"[1,4]; because the leading space has already been stripped, WDGF sees no need to adjust offsets, and emits the token as-is, resulting in the start offsets of the tokenstream being [0, 2, 1], and the IndexWriter rejecting it.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

LUCENE-8509.patch
19/Nov/18 15:38
16 kB
Alan Woodward
LUCENE-8509.patch
24/Oct/18 11:36
9 kB
Alan Woodward

Activity

People

Assignee:: Alan Woodward

Reporter:: Alan Woodward

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 19/Sep/18 13:39

Updated:: 28/Aug/22 15:36

Resolved:: 04/Dec/18 10:04