[LUCENE-7857] CharTokenizer-derived tokenizers and KeywordTokenizer emit multiple tokens when the max length is exceeded - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 6.7, 7.0
Component/s: None
Labels:
None

Lucene Fields:

New

Description

Assigning to myself to not lose track of it.

~~LUCENE-7705~~ introduced the ability to define the allowable token length for these tokenizers other than hard-code it to 255. It's always been the case that when the hard-coded limit was exceeded, multiple tokens would be emitted. However, the tests for ~~LUCENE-7705~~ exposed a problem.

Suppose the max length is 3 and the doc contains "letter". Two tokens are emitted and indexed: "let" and "ter".

Now suppose the search is for "lett". If the default operator is AND or phrase queries are constructed the query fails since the tokens emitted are "let" and "t". Only if the operator is OR is the document found, and even then it won't be correct since searching for "lett" would match a document indexed with "bett" because it would match on the bare "t".

Proposal:

The remainder of the token should be ignored when maxTokenLen is exceeded.

rcmuir steve_rowe tomasflobbe comments? Again, this behavior was not introduced by ~~LUCENE-7705~~, it's just that it would be very hard to notice with the default 255 char limit.

I'm not quite sure why master generates a parsed query of:
field:let field:t
and 6x generates
field:"let t"
so the tests succeeded on master but not on 6x....

Attachments

Issue Links

is related to

LUCENE-7705 Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length

Resolved

Activity

People

Assignee:: Erick Erickson

Reporter:: Erick Erickson

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 30/May/17 22:19

Updated:: 28/Aug/22 15:15

Resolved:: 04/Jun/17 03:21