Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-8876

EnglishMinimalStemmer does not implement s-stemmer paper correctly?

Details

    • Bug
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • None
    • None
    • modules/analysis
    • None
    • New

    Description

      The EnglishMinimalStemmer fails to stem ees suffixes like bees, trees and employees.

      The [original paper|http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.104.9828&rep=rep1&type=pdf] has this table of rules:

      The notes accompanying the table state :

      "the first applicable rule encountered is the only one used"

       

      For the ees and oes suffixes I think EnglishMinimalStemmer misinterpreted the rule logic and consequently bees != bee and tomatoes != tomato. The oes and ees suffixes are left intact.

      "The first applicable rule" for ees could be interpreted as rule 2 or 3 in the table depending on if you take applicable to mean "the THEN part of the rule has fired" or just that the suffix was referenced in the rule. EnglishMinimalStemmer has assumed the latter and I think it should be the former. We should fall through into rule 3 for ees and oes (remove any trailing S). That's certainly the conclusion I came to independently testing on real data.

      There are some additional changes I'd like to see in a plural stemmer but I won't list them here - the focus should be making the code here match the original paper it references.

      Attachments

        Activity

          People

            Unassigned Unassigned
            mharwood Mark Harwood
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: