Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-8532

nori analyzer issue with trailing space

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 7.4
    • None
    • modules/analysis
    • None
    • New, Patch Available

    Description

      We can reproduce it from Elasticsearch.

      When we run the following command:

      GET _analyze

      "analyzer": "nori",

        "text": "공단시"

      }

      It returns the following as expected:

      {
        "tokens": [
         

      {       "token": "공단",       "start_offset": 0,       "end_offset": 2,       "type": "word",       "position": 0     }

      ,
         

      {       "token": "시",       "start_offset": 2,       "end_offset": 3,       "type": "word",       "position": 1     }

        ]
      }

      But if we run with "공단시 " (with a trailing space)

      GET _analyze

      {   "analyzer": "nori",   "text": "공단시 " }

      It returns

      {
        "tokens": [
         

      {       "token": "공단",       "start_offset": 0,       "end_offset": 2,       "type": "word",       "position": 0     }

      ,
         

      {       *"token": "씨",*       "start_offset": 2,       "end_offset": 3,       "type": "word",       "position": 1     }

        ]
      }

      The second token should be "시" instead of  "씨".

      Attachments

        1. LUCENE-8532.patch
          9 kB
          Jim Ferenczi

        Activity

          People

            Unassigned Unassigned
            kiju98 Kiju Kim
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: