Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-9100

JapaneseTokenizer produces inconsistent tokens

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 7.2
    • Fix Version/s: None
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      We use JapaneseTokenizer on prod and seeing some inconsistent behavior. With this text:
      "マギアリス【単版話】 4話 (Unlimited Comics)" I get different results if I insert space before `【` char. Here is the small code snippet demonstrating the case (note we use our own dictionary and connection costs):

              Analyzer analyzer = new Analyzer() {
                  @Override
                  protected TokenStreamComponents createComponents(String fieldName) {
      //                Tokenizer tokenizer = new JapaneseTokenizer(newAttributeFactory(), null, true, JapaneseTokenizer.Mode.SEARCH);
                      Tokenizer tokenizer = new JapaneseTokenizer(newAttributeFactory(), dictionaries.systemDictionary, dictionaries.unknownDictionary, dictionaries.connectionCosts, null, true, JapaneseTokenizer.Mode.SEARCH);
                      return new TokenStreamComponents(tokenizer, new LowerCaseFilter(tokenizer));
                  }
              };
              String text1 = "マギアリス【単版話】 4話 (Unlimited Comics)";
              String text2 = "マギアリス 【単版話】 4話 (Unlimited Comics)"; //inserted space
              try (TokenStream tokens = analyzer.tokenStream("field", new StringReader(text1))) {
                  CharTermAttribute chars = tokens.addAttribute(CharTermAttribute.class);
                  tokens.reset();
                  while (tokens.incrementToken()) {
                      System.out.println(chars.toString());
                  }
                  tokens.end();
              } catch (IOException e) {
                  // should never happen with a StringReader
                  throw new RuntimeException(e);
              } 

      Output is:

      //text1
       マギ
      アリス
      単
      版
      話
      4
      話
      unlimited
      comics
      
      //text2
      マギア
      リス
      単
      版
      話
      4
      話
      unlimited
      comics

      It looks like tokenizer doesn't view the punctuation ( is Character.START_PUNCTUATION type) as an indicator that there should be a token break, and somehow 【 punctuation char causes difference in the output. 

      If I use the JapaneseTokenizer tokenizer then this problem doesn't manifest because it doesn't tokenize マギアリス into multiple tokens and outputs as is. 

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              elbek.dev@gmail.com Elbek Kamoliddinov
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated: