[LUCENE-5447] StandardTokenizer should break at consecutive chars matching Word_Break = MidLetter, MidNum and/or MidNumLet - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 4.6.1
Fix Version/s: 4.7, 6.0
Component/s: modules/analysis
Labels:
None

Lucene Fields:

New, Patch Available

Description

StandardTokenizer should split all of the following sequences into two tokens each, but they are all instead kept intact and output as single tokens:

"A::B"           (':' is in \p{Word_Break = MidLetter})
"1..2", "A..B"   ('.' is in \p{Word_Break = MidNumLet})
"A.:B"
"A:.B"
"1,,2"           (',' is in \p{Word_Break = MidNum})
"1,.2"
"1.,2"

Unfortunately, the word break test data released with Unicode, e.g. for Unicode 6.3 http://www.unicode.org/Public/6.3.0/ucd/auxiliary/WordBreakTest.txt, and incorporated into a versioned Lucene test, e.g. WordBreakTestUnicode_6_3_0, doesn't cover these cases.

Attachments

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

LUCENE-5447.patch
19/Feb/14 01:20
977 kB
Steven Rowe
LUCENE-5447.patch
19/Feb/14 00:43
974 kB
Steven Rowe
LUCENE-5447-take2.patch
19/Feb/14 17:03
43 kB
Steven Rowe
LUCENE-5447-test.patch
14/Feb/14 20:58
2 kB
Steven Rowe

Activity

People

Assignee:: Steven Rowe

Reporter:: Steven Rowe

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 14/Feb/14 20:55

Updated:: 28/Aug/22 14:00

Resolved:: 19/Feb/14 17:59