WhitespaceTokenizer uses Character.isWhitespace to decide what is whitespace. Here's a pertinent excerpt:
It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', '\u2007', '\u202F')
Perhaps Character.isWhitespace should have been called isLineBreakableWhitespace?
I think WhitespaceTokenizer should tokenize on this. I am aware it's easy to work around but why leave this trap in by default?