[LUCENE-7181] JapaneseTokenizer: Validate segmentation of User Dictionary entries on creation - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
None

Lucene Fields:

New

Description

From the conversation on the dev list

The user dictionary in the JapaneseTokenizer allows users to customize how a stream is broken into tokens using a specific set of rules provided like:
AABBBCC -> AA BBB CC

It does not allow users to change any of the token characters like:

(1) AABBBCC -> DD BBB CC (this will just tokenize to "AA", "BBB", "CC", seems to only care about positions)
It also doesn't let a character be part of more than one token, like:

(2) AABBBCC -> AAB BBB BCC (this will throw an AIOOBE)

..or make the output token bigger than the input text:

(3) AA -> AAA (Also AIOOBE)

Currently there is no validation for those cases, case 1 doesn't fail but provide unexpected tokens. Cases 2 and 3 fail when the input text is analyzed. We should add validation to the UserDictionary creation.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

LUCENE-7181.patch
06/Apr/16 00:50
3 kB
Tomas Eduardo Fernandez Lobbe

Activity

People

Assignee:: Christian Moen

Reporter:: Tomas Eduardo Fernandez Lobbe

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 06/Apr/16 00:46

Updated:: 28/Aug/22 14:54