Description
Assigning to myself to not lose track of it.
LUCENE-7705 introduced the ability to define the allowable token length for these tokenizers other than hard-code it to 255. It's always been the case that when the hard-coded limit was exceeded, multiple tokens would be emitted. However, the tests for LUCENE-7705 exposed a problem.
Suppose the max length is 3 and the doc contains "letter". Two tokens are emitted and indexed: "let" and "ter".
Now suppose the search is for "lett". If the default operator is AND or phrase queries are constructed the query fails since the tokens emitted are "let" and "t". Only if the operator is OR is the document found, and even then it won't be correct since searching for "lett" would match a document indexed with "bett" because it would match on the bare "t".
Proposal:
The remainder of the token should be ignored when maxTokenLen is exceeded.
rcmuirsteve_rowetomasflobbe comments? Again, this behavior was not introduced by LUCENE-7705, it's just that it would be very hard to notice with the default 255 char limit.
I'm not quite sure why master generates a parsed query of:
field:let field:t
and 6x generates
field:"let t"
so the tests succeeded on master but not on 6x....
Attachments
Issue Links
- is related to
-
LUCENE-7705 Allow CharTokenizer-derived tokenizers and KeywordTokenizer to configure the max token length
- Resolved