Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-8752

Apply a patch to kuromoji dictionary to properly handle Japanese new era '令和' (REIWA)

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 8.1, master (9.0)
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      As of May 1st, 2019, Japanese era '元号' (Gengo) will be set to '令和' (Reiwa). See this article for more details:

      https://www.bbc.com/news/world-asia-47769566

      Currently '令和' is splitted up to '令' and '和' by JapaneseTokenizer. It should be tokenized as one word so that Japanese texts including era names are searched as users expect. Because the default Kuromoji dictionary (mecab-ipadic) has not been maintained since 2007, a one-line patch to the source CSV file is needed for this era change.

      Era name is used in many official or formal documents in Japan, so it would be desirable the search systems properly handle this without adding a user dictionary or using phrase query.

      FYI, JDK DateTime API will support the new era (in the next updates.)

      https://blogs.oracle.com/java-platform-group/a-new-japanese-era-for-java

      The patch is available here:

      https://github.com/apache/lucene-solr/pull/632

       

        Attachments

        1. LUCENE-8752.patch
          5 kB
          Tomoko Uchida

          Issue Links

            Activity

              People

              • Assignee:
                tomoko Tomoko Uchida
                Reporter:
                tomoko Tomoko Uchida
              • Votes:
                5 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 20m
                  20m