[LUCENE-7509] [smartcn] Some chinese text is not tokenized correctly with Chinese punctuation marks appended - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 6.2.1
Fix Version/s: None
Component/s: modules/analysis
Labels:
- chinese
- tokenization
Environment:

Mac OS X 10.10

Lucene Fields:

New

Description

Some chinese text is not tokenized correctly with Chinese punctuation marks appended.

e.g.
碧绿的眼珠 is tokenized as 碧绿|的|眼珠. Which is correct.

But
碧绿的眼珠，（with a Chinese punctuation appended )is tokenized as 碧绿|的|眼|珠，

The similar case happens when text with numbers appended.

e.g.
生活报8月4号 -->生活|报|8|月|4|号
生活报-->生活报

Test Sample:
public static void main(String[] args) throws IOException

{ Analyzer analyzer = new SmartChineseAnalyzer(); /* will load stopwords */ System.out.println("Sample1======="); String sentence = "生活报8月4号"; printTokens(analyzer, sentence); sentence = "生活报"; printTokens(analyzer, sentence); System.out.println("Sample2======="); sentence = "碧绿的眼珠，"; printTokens(analyzer, sentence); sentence = "碧绿的眼珠"; printTokens(analyzer, sentence); analyzer.close(); }

private static void printTokens(Analyzer analyzer, String sentence) throws IOException{
System.out.println("sentence:" + sentence);
TokenStream tokens = analyzer.tokenStream("dummyfield", sentence);
tokens.reset();
CharTermAttribute termAttr = (CharTermAttribute) tokens.getAttribute(CharTermAttribute.class);
while (tokens.incrementToken())

{ System.out.println(termAttr.toString()); }

tokens.close();
}

Output:
Sample1=======
sentence:生活报8月4号
生活
报
8
月
4
号
sentence:生活报
生活报
Sample2=======
sentence:碧绿的眼珠，
碧绿
的
眼
珠
sentence:碧绿的眼珠
碧绿
的
眼珠

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: peina

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 20/Oct/16 03:09

Updated:: 28/Aug/22 15:05