Lucene - Core
  1. Lucene - Core
  2. LUCENE-2503

light/minimal stemming for euro languages

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 3.1, 4.0-ALPHA
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      The snowball stemmers are very aggressive and it would be nice if there were lighter alternatives.

      Some applications may want to perform less aggressive stemming, for example:
      http://www.lucidimagination.com/search/document/5d16391e21ca6faf/plural_only_stemmer

      Good, relevance tested algorithms exist and I think we should provide these alternatives.

      1. LUCENE-2503_modules_analysis_testdata.zip
        1.83 MB
        Robert Muir
      2. LUCENE-2503.patch
        238 kB
        Robert Muir
      3. LUCENE-2503.patch
        179 kB
        Robert Muir

        Activity

        Shai Erera made changes -
        Component/s modules/analysis [ 12310230 ]
        Component/s contrib/analyzers [ 12312333 ]
        Grant Ingersoll made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Hide
        Grant Ingersoll added a comment -

        Bulk close for 3.1

        Show
        Grant Ingersoll added a comment - Bulk close for 3.1
        Mark Thomas made changes -
        Workflow Default workflow, editable Closed status [ 12564268 ] jira [ 12584087 ]
        Mark Thomas made changes -
        Workflow jira [ 12513669 ] Default workflow, editable Closed status [ 12564268 ]
        rmuir committed 964057 (46 files)
        Reviews: none

        LUCENE-2503: add forgotten javadoc/citation (sorry)

        Lucene branch_3x
        rmuir committed 964054 (1 file)
        Robert Muir made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Hide
        Robert Muir added a comment -

        Committed revision 964019 (trunk) / 964034 (3x)

        Show
        Robert Muir added a comment - Committed revision 964019 (trunk) / 964034 (3x)
        rmuir committed 964034 (140 files)
        Reviews: none

        LUCENE-2503: add light stemmers for european languages

        Lucene branch_3x
        rmuir committed 964019 (95 files)
        Reviews: none

        LUCENE-2503: add light stemmers for european languages

        Lucene trunk
        Robert Muir made changes -
        Hide
        Robert Muir added a comment -

        zip file containing the vocab test zipfiles, relevant to modules/analysis

        Show
        Robert Muir added a comment - zip file containing the vocab test zipfiles, relevant to modules/analysis
        Robert Muir made changes -
        Attachment LUCENE-2503.patch [ 12449010 ]
        Hide
        Robert Muir added a comment -

        I updated the patch, I think this is ready to go:

        • added finnish
        • created vocabulary tests from reference C,perl,whatever impls, and found/fixed bugs in every language but en,pt,fr (as promised in my last comment)
        • created a VocabularyAssert junit util class, and refactored the existing snowball,porter,german,and russian tests to use it, too.
        • refactored a bunch of utility stuff that was duplicated everywhere such as endsWith()/delete() and put it in StemmerUtil.

        to apply the patch, first apply the patch itself, then please unzip the zip file containing vocabulary tests (LUCENE-2503_modules_analysis_testdata.zip) from the modules/analysis/common dir.

        if no one objects, i'll commit in a few days.

        Show
        Robert Muir added a comment - I updated the patch, I think this is ready to go: added finnish created vocabulary tests from reference C,perl,whatever impls, and found/fixed bugs in every language but en,pt,fr (as promised in my last comment) created a VocabularyAssert junit util class, and refactored the existing snowball,porter,german,and russian tests to use it, too. refactored a bunch of utility stuff that was duplicated everywhere such as endsWith()/delete() and put it in StemmerUtil. to apply the patch, first apply the patch itself, then please unzip the zip file containing vocabulary tests ( LUCENE-2503 _modules_analysis_testdata.zip) from the modules/analysis/common dir. if no one objects, i'll commit in a few days.
        Hide
        Robert Muir added a comment -

        Man are you fast!

        not really, i've been working it for a while but since someone asked i figure i would create the issue.
        testing isnt done, but english, french, portuguese I think are ok.
        the others need a lot of tests and probably have bugs.

        Does the English one deal with women/ woman and foci / focus type stuff?

        Nope, the english one is the Harman "s-stemming" algorithm.

        its very simple:

        if final is '-ies' but not '-eies' or '-aies' then
        replace '-ies' by '-y', return;
        if final is '-es' but not '-aes', '-ees' or '-oes' then
        replace '-es' by '-e', return;
        if final is '-s' but not '-us' or '-ss' then
        remove '-s';
        return.
        

        For special cases like you mentioned (if you want them), i would recommend adding these customizations yourself
        as documented here: http://wiki.apache.org/solr/LanguageAnalysis#Customizing_Stemming

        just make a tab-separated file of words-stems and put a StemmerOverrideFilter(Factory) before the stemmer in the stream.

        I think this alone provides a lot of flexibility. if it isn't enough, then i think these stemmers are much simpler to modify if you wanted to go that route also

        Show
        Robert Muir added a comment - Man are you fast! not really, i've been working it for a while but since someone asked i figure i would create the issue. testing isnt done, but english, french, portuguese I think are ok. the others need a lot of tests and probably have bugs. Does the English one deal with women/ woman and foci / focus type stuff? Nope, the english one is the Harman "s-stemming" algorithm. its very simple: if final is '-ies' but not '-eies' or '-aies' then replace '-ies' by '-y', return; if final is '-es' but not '-aes', '-ees' or '-oes' then replace '-es' by '-e', return; if final is '-s' but not '-us' or '-ss' then remove '-s'; return. For special cases like you mentioned (if you want them), i would recommend adding these customizations yourself as documented here: http://wiki.apache.org/solr/LanguageAnalysis#Customizing_Stemming just make a tab-separated file of words-stems and put a StemmerOverrideFilter(Factory) before the stemmer in the stream. I think this alone provides a lot of flexibility. if it isn't enough, then i think these stemmers are much simpler to modify if you wanted to go that route also
        Hide
        Otis Gospodnetic added a comment -

        Man are you fast!
        Does the English one deal with women/ woman and foci / focus type stuff?

        Show
        Otis Gospodnetic added a comment - Man are you fast! Does the English one deal with women/ woman and foci / focus type stuff?
        Robert Muir made changes -
        Field Original Value New Value
        Attachment LUCENE-2503.patch [ 12447384 ]
        Hide
        Robert Muir added a comment -

        patch, not ready for committing. only some of these are ready, others need tests (where I intentionally put a fail() placeholder to indicate they are still untested).

        also i didn't implement the finnish one yet, but it contains various implementations for 9 euro languages.

        Show
        Robert Muir added a comment - patch, not ready for committing. only some of these are ready, others need tests (where I intentionally put a fail() placeholder to indicate they are still untested). also i didn't implement the finnish one yet, but it contains various implementations for 9 euro languages.
        Robert Muir created issue -

          People

          • Assignee:
            Robert Muir
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development