Lucene - Core
  1. Lucene - Core
  2. LUCENE-5778

Support hunspell morphological description fields

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.10, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Currently hunspell stemmer doesn't support these (particularly the st:XYZ field which signifies a stemming "exception" basically).

      For example in english "feet" might have "st:foot".

      These can be encoded two ways, inline into the .dic or aliased via AM entries from the .aff.

      Unfortunately, our parsing was really lenient and in order to do this properly (e.g. handling words with spaces and morphological fields containing slashes and all that jazz), it had to be cleaned up a bit to follow the hunspell rules.

      For now, we dont waste space with part of speech and only concern ourselves with the "st:" field and the stemmer uses it transparently.

      Encoding these exceptions is a little complicated because these exceptions are rarely used, but when they are, they are typically common verbs and stuff (like english 'be'), so we dont want it to be slow.
      They are also not "per-word" but "per-form", so you could have homonyms with different stems (at least theoretically).
      On the other hand this is silly stuff particular to these silly languages, so we dont want it to blow up the datastructure for 99% of languages that dont use it.

      So the way we do it is to just store the exception ID alongside the form ID (this doubles the intsref, which is usually 1). So for e.g. english i think it typically boils down to an extra byte or so in the FST and doesn't blow up. For languages not using this stuff there is no impact.

        Activity

        Hide
        Robert Muir added a comment -

        patch, with tests

        Show
        Robert Muir added a comment - patch, with tests
        Hide
        ASF subversion and git services added a comment -

        Commit 1604354 from Robert Muir in branch 'dev/trunk'
        [ https://svn.apache.org/r1604354 ]

        LUCENE-5778: support hunspell morphological description fields

        Show
        ASF subversion and git services added a comment - Commit 1604354 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1604354 ] LUCENE-5778 : support hunspell morphological description fields
        Hide
        ASF subversion and git services added a comment -

        Commit 1604355 from Robert Muir in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1604355 ]

        LUCENE-5778: support hunspell morphological description fields

        Show
        ASF subversion and git services added a comment - Commit 1604355 from Robert Muir in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1604355 ] LUCENE-5778 : support hunspell morphological description fields

          People

          • Assignee:
            Unassigned
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development