Lucene - Core
  1. Lucene - Core
  2. LUCENE-3901

Add katakana stem filter to better deal with certain katakana spelling variants

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.6, 4.0-ALPHA
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Many Japanese katakana words end in a long sound that is sometimes optional.

      For example, パーティー and パーティ are both perfectly valid for "party". Similarly we have センター and センタ that are variants of "center" as well as サーバー and サーバ for "server".

      I'm proposing that we add a katakana stemmer that removes this long sound if the terms are longer than a configurable length. It's also possible to add the variant as a synonym, but I think stemming is preferred from a ranking point of view.

      1. LUCENE-3901.patch
        16 kB
        Christian Moen
      2. LUCENE-3901.patch
        15 kB
        Christian Moen
      3. LUCENE-3901.patch
        15 kB
        Christian Moen

        Activity

        Hide
        Christian Moen added a comment -

        Patch for this coming up shortly.

        Show
        Christian Moen added a comment - Patch for this coming up shortly.
        Hide
        Christian Moen added a comment -

        Find attached a patch for this.

        The stemming is done by KuromojiKatakanaStemFilter, which has been added to KuromojiAnalyzer and a corresponding KuromojiKatakanaStemFilterFactory has been added to the text_ja field type in schema.xml.

        Note that this stemming is now turned on by default and I think it makes good sense to do so. The minimum length of a token considered for stemming is configurable and I've made the default of 4 explicit in schema.xml to convey that it's there.

        The stemmer only supports full-width katakana and should be used in combination with a CJKWidthFilter if stemming half-width characters is required and you're doing your wiring. Both text_ja and KuromojiAnalyzer takes care of this, and the default overall processing is the same.

        There are some test cases in TestKuromojiKatakanaStemFilter, but I've added a case to TestKuromojiAnalyzer that demonstrates how the stemming works in combination with katakana compound splitting.

        In Japanese, "manager" can be written both as マネージャー and マネージャ (and probably also マネジャー), and for the compound シニアプロジェクトマネージャー (senior project manager), we now get tokens シニア (senior) プロジェクト (project) マネージャ (manager), and we've stemmed the last token by removing the trailing ー. Kuromoji also makes the compound シニアプロジェクトマネージャ a synonym to シニア, and ー is also removed for the synonym compound.

        Tests pass and I've also tested this end-to-end in a Solr trunk build.

        Show
        Christian Moen added a comment - Find attached a patch for this. The stemming is done by KuromojiKatakanaStemFilter , which has been added to KuromojiAnalyzer and a corresponding KuromojiKatakanaStemFilterFactory has been added to the text_ja field type in schema.xml . Note that this stemming is now turned on by default and I think it makes good sense to do so. The minimum length of a token considered for stemming is configurable and I've made the default of 4 explicit in schema.xml to convey that it's there. The stemmer only supports full-width katakana and should be used in combination with a CJKWidthFilter if stemming half-width characters is required and you're doing your wiring. Both text_ja and KuromojiAnalyzer takes care of this, and the default overall processing is the same. There are some test cases in TestKuromojiKatakanaStemFilter , but I've added a case to TestKuromojiAnalyzer that demonstrates how the stemming works in combination with katakana compound splitting. In Japanese, "manager" can be written both as マネージャー and マネージャ (and probably also マネジャー), and for the compound シニアプロジェクトマネージャー (senior project manager), we now get tokens シニア (senior) プロジェクト (project) マネージャ (manager), and we've stemmed the last token by removing the trailing ー. Kuromoji also makes the compound シニアプロジェクトマネージャ a synonym to シニア, and ー is also removed for the synonym compound. Tests pass and I've also tested this end-to-end in a Solr trunk build.
        Hide
        Robert Muir added a comment -

        patch looks great!

        Show
        Robert Muir added a comment - patch looks great!
        Hide
        Christian Moen added a comment -

        Thanks a lot, Robert.

        I'll do some more testing and hopefully I can commit this to trunk and branch_3x tomorrow.

        Show
        Christian Moen added a comment - Thanks a lot, Robert. I'll do some more testing and hopefully I can commit this to trunk and branch_3x tomorrow.
        Hide
        Christian Moen added a comment -

        Updated patch with minor whitespace changes to schema.xml and added an entry in CHANGES.txt.

        Show
        Christian Moen added a comment - Updated patch with minor whitespace changes to schema.xml and added an entry in CHANGES.txt .
        Hide
        Christian Moen added a comment -

        Committed revision 1304719 on trunk. Backporting to branch_3x.

        Show
        Christian Moen added a comment - Committed revision 1304719 on trunk . Backporting to branch_3x .
        Hide
        Christian Moen added a comment - - edited

        Committed revision 1304727 on branch_3x. Fixed a small javadoc issue in revisions 1304728 and 1304741.

        Show
        Christian Moen added a comment - - edited Committed revision 1304727 on branch_3x . Fixed a small javadoc issue in revisions 1304728 and 1304741.

          People

          • Assignee:
            Christian Moen
            Reporter:
            Christian Moen
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development