Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-8876

EnglishMinimalStemmer does not implement s-stemmer paper correctly?

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      The EnglishMinimalStemmer fails to stem ees suffixes like bees, trees and employees.

      The [original paper|http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.104.9828&rep=rep1&type=pdf] has this table of rules:

      The notes accompanying the table state :

      "the first applicable rule encountered is the only one used"

       

      For the ees and oes suffixes I think EnglishMinimalStemmer misinterpreted the rule logic and consequently bees != bee and tomatoes != tomato. The oes and ees suffixes are left intact.

      "The first applicable rule" for ees could be interpreted as rule 2 or 3 in the table depending on if you take applicable to mean "the THEN part of the rule has fired" or just that the suffix was referenced in the rule. EnglishMinimalStemmer has assumed the latter and I think it should be the former. We should fall through into rule 3 for ees and oes (remove any trailing S). That's certainly the conclusion I came to independently testing on real data.

      There are some additional changes I'd like to see in a plural stemmer but I won't list them here - the focus should be making the code here match the original paper it references.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              mharwood Mark Harwood
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated: