Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-1515

Improved(?) Swedish snowball stemmer

Details

    • New Feature
    • Status: Reopened
    • Major
    • Resolution: Unresolved
    • 2.4
    • None
    • modules/analysis
    • None
    • New, Patch Available

    Description

      Snowball stemmer for Swedish lacks support for '-an' and '-ans' related suffix stripping, ending up with non compatible stems for example "klocka", "klockor", "klockornas", "klockAN", "klockANS". Complete list of new suffix stripping rules:

      {pre}
      'an' 'anen' 'anens' 'anare' 'aner' 'anerna' 'anernas'
      'ans' 'ansen' 'ansens' 'anser' 'ansera' 'anserar' 'anserna' 'ansernas'
      'iera'
      (delete){pre}

      The problem is all the exceptions (e.g. svans|svan, finans|fin, nyans|ny) and this is an attempt at solving that problem. The rules and exceptions are based on the SAOL entries suffixed with 'an' and 'ans'. There a few known problematic stemming rules but seems to work quite a bit better than the current SwedishStemmer. It would not be a bad idea to check all of SAOL entries in order to make sure the integrity of the rules.

      My Snowball syntax skills are rather limited so I'm certain the code could be optimized quite a bit.

      The code is released under BSD and not ASL. I've been posting a bit in the Snowball forum and privatly to Martin Porter himself but never got any response so now I post it here instead in hope for some momentum.

      Attachments

        1. LUCENE-1515.txt
          153 kB
          Karl Wettin

        Activity

          People

            Unassigned Unassigned
            karl.wettin Karl Wettin
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: