[LUCENE-9581] Clarify discardCompoundToken behavior in the JapaneseTokenizer - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 9.0, 8.8
Component/s: None
Labels:
None

Lucene Fields:

New

Description

At first sight, the discardCompoundToken option added in ~~LUCENE-9123~~ seems redundant with the NORMAL mode of the Japanese tokenizer. When set to true, the current behavior is to disable the decomposition for compounds, that's exactly what the NORMAL mode does.

So I wonder if the right semantic of the option would be to keep only the decomposition of the compound or if it's really needed. If the goal is to make the output compatible with a graph token filter, the current workaround to set the mode to NORMAL should be enough.

That's consistent with the mode that should be used to preserve positions in the index since we don't handle position length on the indexing side.

Am I missing something regarding the new option ? Is there a compelling case where it differs from the NORMAL mode ?

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

LUCENE-9581.patch
10/Nov/20 09:44
8 kB
Jim Ferenczi
LUCENE-9581.patch
22/Oct/20 13:54
14 kB
Jim Ferenczi
LUCENE-9581.patch
21/Oct/20 02:30
2 kB
Kazuaki Hiraga

Activity

People

Assignee:: Unassigned

Reporter:: Jim Ferenczi

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 20/Oct/20 10:31

Updated:: 28/Aug/22 16:09

Resolved:: 23/Nov/20 08:13