[LUCENE-9088] JapaneseNumberFilter uses inaccurate PartOfSpeechAttribute - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: modules/analysis
Labels:
None

Lucene Fields:

New

Description

According to the JapaneseNumberFilter javadocs, it uses the attribute values of the last token used to compose the normalized number, which can be wrong. While this is documented it leads to a number of incompatibilities with other japanese token filters.

For example, the PartOfSpeechAttribute of the last token used for an input text of "2008 2009" will lead to an the following output (some attributes left out...):

```

{ "token" : "2008", "start_offset" : 0, "end_offset" : 4, "type" : "word", [...] "partOfSpeech" : "記号-空白", "partOfSpeech (en)" : "symbol-space" [...] }

{ "token" : " ", "start_offset" : 4, "end_offset" : 5, "type" : "word", [...] "partOfSpeech" : "記号-空白", "partOfSpeech (en)" : "symbol-space", [...] }

{ "token" : "2009", "start_offset" : 5, "end_offset" : 9, "type" : "word", ... "partOfSpeech" : "名詞-数", "partOfSpeech (en)" : "noun-numeric", }

```

so that e.g. a following `kuromoji_part_of_speech` filter will eliminate the "2008" token erroneously tagged as "symbol-space".

Even without fixing the other token attrobutes, the POS attributes should IMHO be set to "noun-numeric", since that's what the filter is supposed to detect.

Attachments

Issue Links

links to

GitHub Pull Request #1073

Activity

People

Assignee:: Unassigned

Reporter:: Christoph Büscher

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 10/Dec/19 12:44

Updated:: 28/Aug/22 15:54

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

20m