Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-10102

Add JapaneseCompletionFilter for Input Method-aware auto-completion

Details

    • New Feature
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 9.0
    • modules/analysis
    • None
    • New

    Description

      Basic background information

      As you know, Japanese texts are written in Kanji (ideogram), Katakana, Hiragana (phonetic symbols), and their combination. Therefore it is desirable for intelligent auto-completion systems to treat various representations; one common practice we use is - translate all inputs to "romanized form" (https://en.wikipedia.org/wiki/Romanization_of_Japanese) then reduce the problem to simple Latin-alphabet string matching.
      For example: if a word "桜" (surface form) is given, we first convert it to "サクラ" (reading form) then further translate it to "sakura" (romanized form) so that we can suggest an auto-complete keyword "桜" for an incomplete query "さ" or "サ" or  "sa".

       

      The difficulties
      A simplistic approach to implementing such romanization-based auto-completion is to use JapaneseReadingFormFilter (this has "useRomaji" option). Unfortunately, this off-the-shelf method doesn't work due not to its fault - but complex combinations of multiple romanization systems and IMEs (https://en.wikipedia.org/wiki/Input_method). It is a little difficult for me to explain their detailed specifications in English, but let me provide some examples.

      1) Multiple romanization systems
      There are three major romanization systems - modified Hepburn-shiki, Kunrei-shiki (Nihon-shiki) and Wāpuro shiki. JapaneseReadingFormFilter supports only modified Hepburn-shiki, so it isn't sufficient to cover all possible romanized forms.
      e.g.; "新橋" can be translated into eight romanized forms (in theory) - "sinbasi", "shinbasi", "sinnbasi", "shinnbasi", "sinbashi", "shinbashi", "sinnbashi", and "shinnbashi".

      2) interaction with Input Method
      When querying, mid-IME composition strings will be sent to the search systems, and auto-complete systems should handle them (or, it may just ignore such inputs, but it hurts users' experience).
      e.g.; "会sy" can be an input to an auto-completion system. If we have a method to translate it to "kaisy", we can suggest "会社" (kaisya).

       

      Solution
      I implemented a token filter (and added an analyzer for ease of use) that handles those two challenges. With this filter, we can utilize AnalysingSuggester for fast automaton-based auto-completion for Japanese.
      (Though I acknowledged it contains some peculiar logic, I suppose those are required complexities for a tool that deals with the intricacy of natural language systems...)

       

      Note

      • The filter has worked well for us on a production system with moderate-sized business users (1000~) for one year, and I've fixed some weird bugs we've encountered so far. Also, the donation of the code was granted by the managers.
      • There is one missing thing - offset correction. I found correct offset calculation is not required for auto-completion use-cases, but I'm trying to emit the correct offsets for completeness.

      Attachments

        Issue Links

          Activity

            People

              tomoko Tomoko Uchida
              tomoko Tomoko Uchida
              Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2h 50m
                  2h 50m