Details
-
Wish
-
Status: Resolved
-
Major
-
Resolution: Won't Fix
-
3.0.2
-
None
-
None
-
New, Patch Available
Description
While I understand some of the reasons for its design, the original LowerCaseTokenizer should have been named LowerCaseLetterTokenizer.
I feel that LowerCaseTokenizer makes too many assumptions about what too tokenize, and I have therefore patched it. The default behavior will remain as it always has--to avoid breaking any implementations for which it's being used.
I have changed LowerCaseTokenizer to extend CharTokenizer (rather than LetterTokenizer). LetterTokenizer's functionality was merged into the default behavior of LowerCaseTokenizer.
Getter/Setter methods have been added to the LowerCaseTokenizer Class, allowing you to turn on / off tokenizing by white space, numbers, and special (Non-Alpha/Numeric) characters.