[LUCENE-8717] Handle stop words that appear at articulation points - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
None

Lucene Fields:

New

Description

Our set of TokenFilters currently cannot handle the case where a multi-term synonym starts with a stopword. This means that given a synonym file containing the mapping "the walking dead => twd" and a standard english stopword filter, QueryBuilder will produce incorrect queries.

The tricky part here is that our standard way of dealing with stopwords, which is to just remove them entirely from the token stream and use a larger position increment on subsequent tokens, doesn't work when the removed token also has a position length greater than 1. There are various tricks you can do to increment position length on the previous token, but this doesn't work if the stopword is the first token in the token stream, or if there are multiple stopwords in the side path.

Instead, I'd like to propose adding a new TermDeletedAttribute, which we only use on tokens that should be removed from the stream but which hold necessary information about the structure of the token graph. These tokens can then be removed by GraphTokenStreamFiniteStrings at query time, and by FlattenGraphFilter at index time.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

LUCENE-8717.patch
11/Mar/19 13:14
43 kB
Alan Woodward
LUCENE-8717.patch
06/Mar/19 10:41
33 kB
Alan Woodward

Activity

People

Assignee:: Alan Woodward

Reporter:: Alan Woodward

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 06/Mar/19 10:40

Updated:: 28/Aug/22 15:42