Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-7509

[smartcn] Some chinese text is not tokenized correctly with Chinese punctuation marks appended

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 6.2.1
    • Fix Version/s: None
    • Component/s: modules/analysis
    • Environment:

      Mac OS X 10.10

    • Lucene Fields:
      New

      Description

      Some chinese text is not tokenized correctly with Chinese punctuation marks appended.

      e.g.
      碧绿的眼珠 is tokenized as 碧绿|的|眼珠. Which is correct.

      But
      碧绿的眼珠,(with a Chinese punctuation appended )is tokenized as 碧绿|的|眼|珠,

      The similar case happens when text with numbers appended.

      e.g.
      生活报8月4号 -->生活|报|8|月|4|号
      生活报-->生活报

      Test Sample:
      public static void main(String[] args) throws IOException

      { Analyzer analyzer = new SmartChineseAnalyzer(); /* will load stopwords */ System.out.println("Sample1======="); String sentence = "生活报8月4号"; printTokens(analyzer, sentence); sentence = "生活报"; printTokens(analyzer, sentence); System.out.println("Sample2======="); sentence = "碧绿的眼珠,"; printTokens(analyzer, sentence); sentence = "碧绿的眼珠"; printTokens(analyzer, sentence); analyzer.close(); }

      private static void printTokens(Analyzer analyzer, String sentence) throws IOException{
      System.out.println("sentence:" + sentence);
      TokenStream tokens = analyzer.tokenStream("dummyfield", sentence);
      tokens.reset();
      CharTermAttribute termAttr = (CharTermAttribute) tokens.getAttribute(CharTermAttribute.class);
      while (tokens.incrementToken())

      { System.out.println(termAttr.toString()); }

      tokens.close();
      }

      Output:
      Sample1=======
      sentence:生活报8月4号
      生活

      8

      4

      sentence:生活报
      生活报
      Sample2=======
      sentence:碧绿的眼珠,
      碧绿



      sentence:碧绿的眼珠
      碧绿

      眼珠

        Activity

        Hide
        mikemccand Michael McCandless added a comment -

        Hi peina, could you please turn your test fragments into a test that fails? See e.g. https://wiki.apache.org/lucene-java/HowToContribute

        Do you know how to fix this? Is there a Unicode API we should be using to more generally check for punctuation, so that Chinese punctuation is included?

        Show
        mikemccand Michael McCandless added a comment - Hi peina , could you please turn your test fragments into a test that fails? See e.g. https://wiki.apache.org/lucene-java/HowToContribute Do you know how to fix this? Is there a Unicode API we should be using to more generally check for punctuation, so that Chinese punctuation is included?
        Hide
        GushGG Chang KaiShin added a comment - - edited

        This is not a bug. The underlying Viterbi algorithm segmenting Chinese sentences is based on the probability of the occurrences of the Chinese Characters. Take sentence "生活报8月4号" as an example. The "报" here is meant 2 meanings. If it is placed in the end of the sentence. It means daily newspaper. However, if placed with conjunctions with other Chinese Characters. It is meant to report something. So the algorithm segments "报" as independent word to mean reporting. On the Contrary, "生活报" is assumed to have higher chance to mean daily newspaper. You need to add some words to the dictionary to let the algorithms to learn, so that you get the correct result you wanted.

        The same induction applies to the case "碧绿的眼珠,". It was segmented into 碧绿|的|眼| 珠,
        The punctuation "," is a stopword, so the result is 碧绿|的|眼| 珠. I suggest put the word "眼珠" into the dictionary , the problem should be solved.

        Show
        GushGG Chang KaiShin added a comment - - edited This is not a bug. The underlying Viterbi algorithm segmenting Chinese sentences is based on the probability of the occurrences of the Chinese Characters. Take sentence "生活报8月4号" as an example. The "报" here is meant 2 meanings. If it is placed in the end of the sentence. It means daily newspaper. However, if placed with conjunctions with other Chinese Characters. It is meant to report something. So the algorithm segments "报" as independent word to mean reporting. On the Contrary, "生活报" is assumed to have higher chance to mean daily newspaper. You need to add some words to the dictionary to let the algorithms to learn, so that you get the correct result you wanted. The same induction applies to the case "碧绿的眼珠,". It was segmented into 碧绿|的|眼| 珠, The punctuation "," is a stopword, so the result is 碧绿|的|眼| 珠. I suggest put the word "眼珠" into the dictionary , the problem should be solved.
        Hide
        peina peina added a comment -

        Thanks. Make sense to me.

        Show
        peina peina added a comment - Thanks. Make sense to me.
        Hide
        peina peina added a comment -

        BTW, is there any chance that https://issues.apache.org/jira/browse/LUCENE-7508 will be fixed?

        Show
        peina peina added a comment - BTW, is there any chance that https://issues.apache.org/jira/browse/LUCENE-7508 will be fixed?

          People

          • Assignee:
            Unassigned
            Reporter:
            peina peina
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:

              Development