-
Type:
Bug
-
Status: Closed
-
Priority:
Major
-
Resolution: Fixed
-
Affects Version/s: 4.6.1
-
Component/s: modules/analysis
-
Labels:None
-
Lucene Fields:New, Patch Available
StandardTokenizer should split all of the following sequences into two tokens each, but they are all instead kept intact and output as single tokens:
"A::B" (':' is in \p{Word_Break = MidLetter}) "1..2", "A..B" ('.' is in \p{Word_Break = MidNumLet}) "A.:B" "A:.B" "1,,2" (',' is in \p{Word_Break = MidNum}) "1,.2" "1.,2"
Unfortunately, the word break test data released with Unicode, e.g. for Unicode 6.3 http://www.unicode.org/Public/6.3.0/ucd/auxiliary/WordBreakTest.txt, and incorporated into a versioned Lucene test, e.g. WordBreakTestUnicode_6_3_0, doesn't cover these cases.