Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-8325

smartcn analyzer can't handle SURROGATE char

Agile BoardAttach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • None
    • 7.4, 8.0
    • None
    • New

    Description

      This issue is from https://github.com/elastic/elasticsearch/issues/30739

      smartcn analyzer can't handle SURROGATE char, Example:

       

       

      Analyzer ca = new SmartChineseAnalyzer(); 
      String sentence = "\uD862\uDE0F"; // 𨨏 a surrogate char 
      TokenStream tokenStream = ca.tokenStream("", sentence); 
      CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class); 
      tokenStream.reset(); 
      while (tokenStream.incrementToken()) { 
          String term = charTermAttribute.toString(); 
          System.out.println(term); 
      } 
      

       

      In the above code snippet will output: 

       

      ? 
      ? 
      

       

       and I have created a PATCH to try to fix this, please help review(since smartcn only support GBK char, so it's only just handle it as a single char).

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            chengpohi chengpohi
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment