Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
6.2.1
-
None
-
Mac OS X 10.10
-
New
Description
Some chinese text is not tokenized correctly with Chinese punctuation marks appended.
e.g.
碧绿的眼珠 is tokenized as 碧绿|的|眼珠. Which is correct.
But
碧绿的眼珠,(with a Chinese punctuation appended )is tokenized as 碧绿|的|眼|珠,
The similar case happens when text with numbers appended.
e.g.
生活报8月4号 -->生活|报|8|月|4|号
生活报-->生活报
Test Sample:
public static void main(String[] args) throws IOException
private static void printTokens(Analyzer analyzer, String sentence) throws IOException{
System.out.println("sentence:" + sentence);
TokenStream tokens = analyzer.tokenStream("dummyfield", sentence);
tokens.reset();
CharTermAttribute termAttr = (CharTermAttribute) tokens.getAttribute(CharTermAttribute.class);
while (tokens.incrementToken())
tokens.close();
}
Output:
Sample1=======
sentence:生活报8月4号
生活
报
8
月
4
号
sentence:生活报
生活报
Sample2=======
sentence:碧绿的眼珠,
碧绿
的
眼
珠
sentence:碧绿的眼珠
碧绿
的
眼珠