Details

    • Type: New Feature New Feature
    • Status: Reopened
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 2.4
    • Fix Version/s: None
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      Snowball stemmer for Swedish lacks support for '-an' and '-ans' related suffix stripping, ending up with non compatible stems for example "klocka", "klockor", "klockornas", "klockAN", "klockANS". Complete list of new suffix stripping rules:

      {pre}
      'an' 'anen' 'anens' 'anare' 'aner' 'anerna' 'anernas'
      'ans' 'ansen' 'ansens' 'anser' 'ansera' 'anserar' 'anserna' 'ansernas'
      'iera'
      (delete){pre}

      The problem is all the exceptions (e.g. svans|svan, finans|fin, nyans|ny) and this is an attempt at solving that problem. The rules and exceptions are based on the SAOL entries suffixed with 'an' and 'ans'. There a few known problematic stemming rules but seems to work quite a bit better than the current SwedishStemmer. It would not be a bad idea to check all of SAOL entries in order to make sure the integrity of the rules.

      My Snowball syntax skills are rather limited so I'm certain the code could be optimized quite a bit.

      The code is released under BSD and not ASL. I've been posting a bit in the Snowball forum and privatly to Martin Porter himself but never got any response so now I post it here instead in hope for some momentum.

      1. LUCENE-1515.txt
        153 kB
        Karl Wettin

        Activity

          People

          • Assignee:
            Unassigned
            Reporter:
            Karl Wettin
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:

              Development