Details
-
Bug
-
Status: Resolved
-
Minor
-
Resolution: Implemented
-
None
-
None
-
None
-
None
-
New, Patch Available
Description
When using both the options PRESERVE_ORIGINAL|SPLIT_ON_CASE_CHANGE|CONCATENATE_ALL using the WordDelimiterFilter, we have duplicate tokens on strings contaning only case changes.
When using the SPLIT_ON_CASE_CHANGE option, "abcDef" is split into "abc", "Def".
When having PRESERVE_ORIGINAL, we keep "abcDef".
However, when one uses CONCATENATE_ALL (or CATENATE_WORDS ?), it also adds another token built from the concatenation of the splited words, giving "abcDef" again.
I'm not 100% certain that token filters should not produce duplicate tokens (same word, same start and end positions). Can someone confirm this is a bug ?
Attachments
Attachments
Issue Links
- requires
-
LUCENE-7003 Adding a class to help debug a TokenFilter
- Resolved