Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-8325

smartcn analyzer can't handle SURROGATE char

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 7.4, 8.0
    • Component/s: None
    • Labels:
    • Lucene Fields:
      New

      Description

      This issue is from https://github.com/elastic/elasticsearch/issues/30739

      smartcn analyzer can't handle SURROGATE char, Example:

       

       

      Analyzer ca = new SmartChineseAnalyzer(); 
      String sentence = "\uD862\uDE0F"; // 𨨏 a surrogate char 
      TokenStream tokenStream = ca.tokenStream("", sentence); 
      CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class); 
      tokenStream.reset(); 
      while (tokenStream.incrementToken()) { 
          String term = charTermAttribute.toString(); 
          System.out.println(term); 
      } 
      

       

      In the above code snippet will output: 

       

      ? 
      ? 
      

       

       and I have created a PATCH to try to fix this, please help review(since smartcn only support GBK char, so it's only just handle it as a single char).

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              chengpohi chengpohi
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: