[LUCENE-6874] WhitespaceTokenizer should tokenize on NBSP - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 5.4, 6.0
Component/s: modules/analysis
Labels:
None

Lucene Fields:

New

Description

WhitespaceTokenizer uses Character.isWhitespace to decide what is whitespace. Here's a pertinent excerpt:

It is a Unicode space character (SPACE_SEPARATOR, LINE_SEPARATOR, or PARAGRAPH_SEPARATOR) but is not also a non-breaking space ('\u00A0', '\u2007', '\u202F')

Perhaps Character.isWhitespace should have been called isLineBreakableWhitespace?

I think WhitespaceTokenizer should tokenize on this. I am aware it's easy to work around but why leave this trap in by default?

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

LUCENE-6874.patch
02/Nov/15 18:14
8 kB
Uwe Schindler
LUCENE-6874-jflex.patch
03/Nov/15 14:45
56 kB
Steven Rowe
LUCENE_6874_jflex.patch
07/Nov/15 05:13
49 kB
David Smiley
icu-datasucker.patch
11/Nov/15 23:48
2 kB
Uwe Schindler
unicode-ws-tokenizer.patch
12/Nov/15 00:54
10 kB
Uwe Schindler
unicode-ws-tokenizer.patch
12/Nov/15 09:49
13 kB
Uwe Schindler
unicode-ws-tokenizer.patch
12/Nov/15 10:05
13 kB
Uwe Schindler
LUCENE-6874-chartokenizer.patch
12/Nov/15 12:50
23 kB
Uwe Schindler
LUCENE-6874-chartokenizer.patch
12/Nov/15 13:00
27 kB
Uwe Schindler
LUCENE-6874-chartokenizer.patch
12/Nov/15 16:26
27 kB
Uwe Schindler

Issue Links

duplicates

LUCENE-5096 WhitespaceTokenizer supports Java whitespace, should also support Unicode whitespace

Resolved

relates to

LUCENE-6879 Allow to define custom CharTokenizer using Java 8 Lambdas/Method references

Resolved

Activity

People

Assignee:: Uwe Schindler

Reporter:: David Smiley

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 02/Nov/15 05:10

Updated:: 28/Aug/22 14:45

Resolved:: 14/Nov/15 19:33