Description
In association with issues in Elasticsearch (https://github.com/elastic/elasticsearch/issues/58384 and https://github.com/elastic/elasticsearch/issues/58385), it might be useful for Japanese default analyzer.
Although I don't think it's a bug to not normalize FULL and HALF width characters before tokenization, the behaviour sometimes confuses beginners or users who have limited knowledge about Japanese analysis (and Unicode).
If we have a FULL and HALF width character normalization filter in analyzers-common, we can include it into JapaneseAnalyzer (currently, JapaneseAnalyzer contains CJKWidthFilter but it is applied after tokenization so some of FULL width numbers or latin alphabets are separated by the tokenizer).
Attachments
Issue Links
- relates to
-
LUCENE-9853 Use CJKWidthCharFilter as the default character normalizer for JapaneseAnalyzer instead of CJKWidthFilter
- Reopened
- links to