Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-6736

SmartChineseAnalyzer chops English tokens in a chinese-english mixed sentence.

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Invalid
    • 5.1
    • None
    • modules/analysis
    • linux Java 1.7

    • New

    Description

      I am new with Lucene Analyzer. The following code has predefined the sentence in "testStr":
      String testStr = "女单方面,王适娴second seed和头号种子卫冕冠军西班牙选手马林first seed同处1/4区,3号种子李雪芮和韩国选手Korean player成池铉处在2/4区,不过成池铉先要过日本小将(Japanese player)奥原希望这关。下半区,6号种子王仪涵若想晋级决赛secure position. congratulations.";

      The printed tokenized result is:

      女 单 方面 王 适 娴 second seed 和 头号 种子 卫冕 冠军 西班牙 选手 马 林 first seed 同 处 1 4 区 3 号 种子 李 雪 芮 和 韩国 选手 korean player 成 池 铉 处在 2 4 区 不过 成 池 铉 先 要 过 日本 小将 japanes player 奥 原 希望 这 关 下 半 区 6 号 种子 王 仪 涵 若 想 晋级 决赛 secur posit congratul

      As you can see some long English tokens such as Japanese, position and congratulations are cut short in the tokenization process. I hope I didn't use it wrong.

      Test code:

      private static void testChineseTokenizer() {
      String testStr = "女单方面,王适娴second seed和头号种子卫冕冠军西班牙选手马林first seed同处1/4区,3号种子李雪芮和韩国选手Korean player成池铉处在2/4区,不过成池铉先要过日本小将(Japanese player)奥原希望这关。下半区,6号种子王仪涵若想晋级决赛secure position. congratulations.";
      Analyzer analyzer = new SmartChineseAnalyzer();
      List<String> result = new ArrayList<String>();
      StringReader sr = new StringReader(testStr);

      try {
      TokenStream stream = analyzer.tokenStream(null,sr);
      CharTermAttribute cattr = stream.addAttribute(CharTermAttribute.class);
      stream.reset();
      while (stream.incrementToken())

      { String token = cattr.toString(); result.add(token); }

      stream.end();
      stream.close();
      sr.close();
      analyzer.close();
      stream = null;
      for (String tok: result)

      { System.out.print(" " + tok); }

      System.out.println();
      }
      catch(IOException e)

      { // not thrown b/c we're using a string reader... }

      }

      Attachments

        Activity

          People

            Unassigned Unassigned
            waynexin Wayne Xin
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: