Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-7509

[smartcn] Some chinese text is not tokenized correctly with Chinese punctuation marks appended

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 6.2.1
    • None
    • modules/analysis
    • Mac OS X 10.10

    • New

    Description

      Some chinese text is not tokenized correctly with Chinese punctuation marks appended.

      e.g.
      碧绿的眼珠 is tokenized as 碧绿|的|眼珠. Which is correct.

      But
      碧绿的眼珠,(with a Chinese punctuation appended )is tokenized as 碧绿|的|眼|珠,

      The similar case happens when text with numbers appended.

      e.g.
      生活报8月4号 -->生活|报|8|月|4|号
      生活报-->生活报

      Test Sample:
      public static void main(String[] args) throws IOException

      { Analyzer analyzer = new SmartChineseAnalyzer(); /* will load stopwords */ System.out.println("Sample1======="); String sentence = "生活报8月4号"; printTokens(analyzer, sentence); sentence = "生活报"; printTokens(analyzer, sentence); System.out.println("Sample2======="); sentence = "碧绿的眼珠,"; printTokens(analyzer, sentence); sentence = "碧绿的眼珠"; printTokens(analyzer, sentence); analyzer.close(); }

      private static void printTokens(Analyzer analyzer, String sentence) throws IOException{
      System.out.println("sentence:" + sentence);
      TokenStream tokens = analyzer.tokenStream("dummyfield", sentence);
      tokens.reset();
      CharTermAttribute termAttr = (CharTermAttribute) tokens.getAttribute(CharTermAttribute.class);
      while (tokens.incrementToken())

      { System.out.println(termAttr.toString()); }

      tokens.close();
      }

      Output:
      Sample1=======
      sentence:生活报8月4号
      生活

      8

      4

      sentence:生活报
      生活报
      Sample2=======
      sentence:碧绿的眼珠,
      碧绿



      sentence:碧绿的眼珠
      碧绿

      眼珠

      Attachments

        Activity

          People

            Unassigned Unassigned
            peina peina
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: