Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
-
None
-
New
Description
From the conversation on the dev list
The user dictionary in the JapaneseTokenizer allows users to customize how a stream is broken into tokens using a specific set of rules provided like:
AABBBCC -> AA BBB CC
It does not allow users to change any of the token characters like:
(1) AABBBCC -> DD BBB CC (this will just tokenize to "AA", "BBB", "CC", seems to only care about positions)
It also doesn't let a character be part of more than one token, like:
(2) AABBBCC -> AAB BBB BCC (this will throw an AIOOBE)
..or make the output token bigger than the input text:
(3) AA -> AAA (Also AIOOBE)
Currently there is no validation for those cases, case 1 doesn't fail but provide unexpected tokens. Cases 2 and 3 fail when the input text is analyzed. We should add validation to the UserDictionary creation.