Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-7509

[smartcn] Some chinese text is not tokenized correctly with Chinese punctuation marks appended

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 6.2.1
    • Fix Version/s: None
    • Component/s: modules/analysis
    • Environment:

      Mac OS X 10.10

    • Lucene Fields:
      New

      Description

      Some chinese text is not tokenized correctly with Chinese punctuation marks appended.

      e.g.
      碧绿的眼珠 is tokenized as 碧绿|的|眼珠. Which is correct.

      But
      碧绿的眼珠,(with a Chinese punctuation appended )is tokenized as 碧绿|的|眼|珠,

      The similar case happens when text with numbers appended.

      e.g.
      生活报8月4号 -->生活|报|8|月|4|号
      生活报-->生活报

      Test Sample:
      public static void main(String[] args) throws IOException

      { Analyzer analyzer = new SmartChineseAnalyzer(); /* will load stopwords */ System.out.println("Sample1======="); String sentence = "生活报8月4号"; printTokens(analyzer, sentence); sentence = "生活报"; printTokens(analyzer, sentence); System.out.println("Sample2======="); sentence = "碧绿的眼珠,"; printTokens(analyzer, sentence); sentence = "碧绿的眼珠"; printTokens(analyzer, sentence); analyzer.close(); }

      private static void printTokens(Analyzer analyzer, String sentence) throws IOException{
      System.out.println("sentence:" + sentence);
      TokenStream tokens = analyzer.tokenStream("dummyfield", sentence);
      tokens.reset();
      CharTermAttribute termAttr = (CharTermAttribute) tokens.getAttribute(CharTermAttribute.class);
      while (tokens.incrementToken())

      { System.out.println(termAttr.toString()); }

      tokens.close();
      }

      Output:
      Sample1=======
      sentence:生活报8月4号
      生活

      8

      4

      sentence:生活报
      生活报
      Sample2=======
      sentence:碧绿的眼珠,
      碧绿



      sentence:碧绿的眼珠
      碧绿

      眼珠

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              peina peina
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated: