Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-9413

Add a char filter corresponding to CJKWidthFilter

Details

    • New Feature
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • None
    • 9.0, 8.8
    • None
    • None
    • New

    Description

      In association with issues in Elasticsearch (https://github.com/elastic/elasticsearch/issues/58384 and https://github.com/elastic/elasticsearch/issues/58385), it might be useful for Japanese default analyzer.

      Although I don't think it's a bug to not normalize FULL and HALF width characters before tokenization, the behaviour sometimes confuses beginners or users who have limited knowledge about Japanese analysis (and Unicode).

      If we have a FULL and HALF width character normalization filter in analyzers-common, we can include it into JapaneseAnalyzer (currently, JapaneseAnalyzer contains CJKWidthFilter but it is applied after tokenization so some of FULL width numbers or latin alphabets are separated by the tokenizer).

      Attachments

        Issue Links

          Activity

            People

              tomoko Tomoko Uchida
              tomoko Tomoko Uchida
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 0.5h
                  0.5h