Details
-
Improvement
-
Status: Closed
-
Minor
-
Resolution: Fixed
-
None
-
None
-
New
Description
WhitespaceTokenizer uses Character.isWhitespace to decide what is whitespace. Here's a pertinent excerpt:
It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', '\u2007', '\u202F')
Perhaps Character.isWhitespace should have been called isLineBreakableWhitespace?
I think WhitespaceTokenizer should tokenize on this. I am aware it's easy to work around but why leave this trap in by default?
Attachments
Attachments
Issue Links
- duplicates
-
LUCENE-5096 WhitespaceTokenizer supports Java whitespace, should also support Unicode whitespace
- Resolved
- relates to
-
LUCENE-6879 Allow to define custom CharTokenizer using Java 8 Lambdas/Method references
- Resolved