Thank you for taking time to review this. You're right the regex is better but probably the 4 chars are ok as this mimics what MockTokenizer will split on.
Working on this made me wonder if perhaps WordBreakSpellChecker itself could be made more useful for non-western languages if it was configurable to break/combine with/on characters other than the space. I have very little of a linguistic background so I'm not sure if there is a solid use-case for this or how would it work. My guess is it would be too complicated for now if even useful at all. But if anyone has thoughts in this direction I wouldn't mind hearing them.