Details
-
Bug
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
None
-
None
-
None
-
New
Description
The EnglishMinimalStemmer fails to stem ees suffixes like bees, trees and employees.
The [original paper|http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.104.9828&rep=rep1&type=pdf] has this table of rules:
The notes accompanying the table state :
"the first applicable rule encountered is the only one used"
For the ees and oes suffixes I think EnglishMinimalStemmer misinterpreted the rule logic and consequently bees != bee and tomatoes != tomato. The oes and ees suffixes are left intact.
"The first applicable rule" for ees could be interpreted as rule 2 or 3 in the table depending on if you take applicable to mean "the THEN part of the rule has fired" or just that the suffix was referenced in the rule. EnglishMinimalStemmer has assumed the latter and I think it should be the former. We should fall through into rule 3 for ees and oes (remove any trailing S). That's certainly the conclusion I came to independently testing on real data.
There are some additional changes I'd like to see in a plural stemmer but I won't list them here - the focus should be making the code here match the original paper it references.