Lucene - Core
  1. Lucene - Core
  2. LUCENE-5224

org.apache.lucene.analysis.hunspell.HunspellDictionary should implement ICONV and OCONV lines in the affix file

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.0, 4.4
    • Fix Version/s: 4.8, 6.0
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      There are some Hunspell dictionaries that need to emulate Unicode normalization and collation in order to get the correct stem of a word. The original Hunspell provides a way to do this with the ICONV and OCONV lines in the affix file. The Lucene HunspellDictionary ignores these lines right now.

      Please support these keys in the affix file.

      This bit of functionality is briefly described in the hunspell man page http://manpages.ubuntu.com/manpages/lucid/man4/hunspell.4.html

      This functionality is practically required in order to use a Korean dictionary because you want only some of the Jamos of a Hangul character (grapheme cluster) when using stemming. Other languages will find this to be helpful functionality.

      Here is an example for a .aff file:

      ICONV 각 각
      ...
      OCONV 각 각
      

      Here is the same example escaped.

      ICONV \uAC01 \u1100\u1161\u11A8
      ...
      OCONV \u1100\u1161\u11A8 \uAC01
      
      1. LUCENE-5224.patch
        25 kB
        Robert Muir
      2. LUCENE-5224.patch
        23 kB
        Robert Muir
      3. LUCENE-5224.patch
        23 kB
        Robert Muir

        Activity

        Hide
        Robert Muir added a comment -

        Patch recognizing ICONV, OCONV, and IGNORE keywords.

        Show
        Robert Muir added a comment - Patch recognizing ICONV, OCONV, and IGNORE keywords.
        Hide
        Robert Muir added a comment -

        oops, set needsInputCleaning and needsOutputCleaning correctly. I think its ready.

        Show
        Robert Muir added a comment - oops, set needsInputCleaning and needsOutputCleaning correctly. I think its ready.
        Hide
        Robert Muir added a comment -

        When trying to test the korean dictionary referenced in this issue, i hit some limits (which are out of date, since we switched the internal representation). So this patch adjusts those to reality.

        Later, I'll make a second @Ignore'd test similar to my current "TestAllDictionaries" but using the list from thunderbird, which has much newer ones than the one referenced from the old openoffice link. This way I can ensure newer dictionaries like this one are working, too.

        Show
        Robert Muir added a comment - When trying to test the korean dictionary referenced in this issue, i hit some limits (which are out of date, since we switched the internal representation). So this patch adjusts those to reality. Later, I'll make a second @Ignore'd test similar to my current "TestAllDictionaries" but using the list from thunderbird, which has much newer ones than the one referenced from the old openoffice link. This way I can ensure newer dictionaries like this one are working, too.
        Hide
        ASF subversion and git services added a comment -

        Commit 1574135 from Robert Muir in branch 'dev/trunk'
        [ https://svn.apache.org/r1574135 ]

        LUCENE-5224: Add iconv, oconv, and ignore support to HunspellStemFilter

        Show
        ASF subversion and git services added a comment - Commit 1574135 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1574135 ] LUCENE-5224 : Add iconv, oconv, and ignore support to HunspellStemFilter
        Hide
        ASF subversion and git services added a comment -

        Commit 1574143 from Robert Muir in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1574143 ]

        LUCENE-5224: Add iconv, oconv, and ignore support to HunspellStemFilter

        Show
        ASF subversion and git services added a comment - Commit 1574143 from Robert Muir in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1574143 ] LUCENE-5224 : Add iconv, oconv, and ignore support to HunspellStemFilter
        Hide
        Uwe Schindler added a comment -

        Close issue after release of 4.8.0

        Show
        Uwe Schindler added a comment - Close issue after release of 4.8.0

          People

          • Assignee:
            Robert Muir
            Reporter:
            George Rhoten
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development