[LUCENE-8532] nori analyzer issue with trailing space - ASF JIRA

XML

Word

Printable

JSON

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 7.4
Fix Version/s: None
Component/s: modules/analysis
Labels:
None
Environment:

Hide

Elasticsearch version: Version: Version: 6.4.2, Build: default/tar/04711c2/2018-09-26T13:34:09.098244Z, JVM: 1.8.0_131

Plugins installed: [analysis-nori]

JVM version:
java version "1.8.0_131"
Java(TM) SE Runtime Environment (build 1.8.0_131-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.131-b11, mixed mode)

OS version: Darwin Kijuui-MacBook-Pro.local 17.7.0 Darwin Kernel Version 17.7.0: Thu Jun 21 22:53:14 PDT 2018; root:xnu-4570.71.2~1/RELEASE_X86_64 x86_64

Show
Elasticsearch version: Version: Version: 6.4.2, Build: default/tar/04711c2/2018-09-26T13:34:09.098244Z, JVM: 1.8.0_131 Plugins installed: [analysis-nori] JVM version: java version "1.8.0_131" Java(TM) SE Runtime Environment (build 1.8.0_131-b11) Java HotSpot(TM) 64-Bit Server VM (build 25.131-b11, mixed mode) OS version: Darwin Kijuui-MacBook-Pro.local 17.7.0 Darwin Kernel Version 17.7.0: Thu Jun 21 22:53:14 PDT 2018; root:xnu-4570.71.2~1/RELEASE_X86_64 x86_64

We can reproduce it from Elasticsearch.

When we run the following command:

GET _analyze

{

"analyzer": "nori",

"text": "공단시"

}

It returns the following as expected:

{
"tokens": [

{ "token": "공단", "start_offset": 0, "end_offset": 2, "type": "word", "position": 0 }

{ "token": "시", "start_offset": 2, "end_offset": 3, "type": "word", "position": 1 }

]
}

But if we run with "공단시 " (with a trailing space)

GET _analyze

{ "analyzer": "nori", "text": "공단시 " }

It returns

{
"tokens": [

{ "token": "공단", "start_offset": 0, "end_offset": 2, "type": "word", "position": 0 }

{ *"token": "씨",* "start_offset": 2, "end_offset": 3, "type": "word", "position": 1 }

]
}

The second token should be "시" instead of "씨".