[LUCENE-7004] Duplicate tokens using WordDelimiterFilter for a specific configuration - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Implemented
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
None

Lucene Fields:

New, Patch Available

Description

When using both the options PRESERVE_ORIGINAL|SPLIT_ON_CASE_CHANGE|CONCATENATE_ALL using the WordDelimiterFilter, we have duplicate tokens on strings contaning only case changes.

When using the SPLIT_ON_CASE_CHANGE option, "abcDef" is split into "abc", "Def".

When having PRESERVE_ORIGINAL, we keep "abcDef".

However, when one uses CONCATENATE_ALL (or CATENATE_WORDS ?), it also adds another token built from the concatenation of the splited words, giving "abcDef" again.

I'm not 100% certain that token filters should not produce duplicate tokens (same word, same start and end positions). Can someone confirm this is a bug ?

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

wdf-analysis.png
28/Jan/16 22:41
15 kB
Shawn Heisey
FIX-LUCENE-7004.PATCH
31/Jan/16 01:03
7 kB
Jean-Baptiste Lespiau
TEST-LUCENE-7004.PATCH
31/Jan/16 01:03
19 kB
Jean-Baptiste Lespiau

Issue Links

requires

LUCENE-7003 Adding a class to help debug a TokenFilter

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Jean-Baptiste Lespiau

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 27/Jan/16 21:40

Updated:: 28/Aug/22 14:49

Resolved:: 26/Apr/16 21:37