Tokenizer converts strings to lowercase automatically, but RegexTokenizer does not. It would be nice to add an option to RegexTokenizer to convert to lowercase. Proposal:
- call the Boolean Param "toLowercase"
- set default to false (so behavior does not change)
Q: Should conversion to lowercase happen before or after regex matching?
- Before: This is simpler.
- After: This gives the user full control since they can have the regex treat upper/lower case differently.
--> I'd vote for conversion before matching. If a user needs full control, they can convert to lowercase manually.