[LUCENE-9413] Add a char filter corresponding to CJKWidthFilter - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 9.0, 8.8
Component/s: None
Labels:
None

Lucene Fields:

New

Description

In association with issues in Elasticsearch (https://github.com/elastic/elasticsearch/issues/58384 and https://github.com/elastic/elasticsearch/issues/58385), it might be useful for Japanese default analyzer.

Although I don't think it's a bug to not normalize FULL and HALF width characters before tokenization, the behaviour sometimes confuses beginners or users who have limited knowledge about Japanese analysis (and Unicode).

If we have a FULL and HALF width character normalization filter in analyzers-common, we can include it into JapaneseAnalyzer (currently, JapaneseAnalyzer contains CJKWidthFilter but it is applied after tokenization so some of FULL width numbers or latin alphabets are separated by the tokenizer).

Attachments

Issue Links

relates to

LUCENE-9853 Use CJKWidthCharFilter as the default character normalizer for JapaneseAnalyzer instead of CJKWidthFilter

Reopened

links to

GitHub Pull Request #2081

Activity

People

Assignee:: Tomoko Uchida

Reporter:: Tomoko Uchida

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 20/Jun/20 05:49

Updated:: 22/Nov/24 09:49

Resolved:: 17/Nov/20 09:57

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

0.5h