Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
4.6.1
-
None
-
New, Patch Available
Description
StandardTokenizer should split all of the following sequences into two tokens each, but they are all instead kept intact and output as single tokens:
"A::B" (':' is in \p{Word_Break = MidLetter}) "1..2", "A..B" ('.' is in \p{Word_Break = MidNumLet}) "A.:B" "A:.B" "1,,2" (',' is in \p{Word_Break = MidNum}) "1,.2" "1.,2"
Unfortunately, the word break test data released with Unicode, e.g. for Unicode 6.3 http://www.unicode.org/Public/6.3.0/ucd/auxiliary/WordBreakTest.txt, and incorporated into a versioned Lucene test, e.g. WordBreakTestUnicode_6_3_0, doesn't cover these cases.