[LUCENE-2198] support protected words in Stemming TokenFilters - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: 3.0
Fix Version/s: 4.0-ALPHA
Component/s: modules/analysis
Labels:
None

Lucene Fields:

New, Patch Available

Description

This is from LUCENE-1515

I propose that all stemming TokenFilters have an 'exclusion set' that bypasses any stemming for words in this set.
Some stemming tokenfilters have this, some do not.

This would be one way for Karl to implement his new swedish stemmer (as a text file of ignore words).
Additionally, it would remove duplication between lucene and solr, as they reimplement snowballfilter since it does not have this functionality.
Finally, I think this is a pretty common use case, where people want to ignore things like proper nouns in the stemming.

As an alternative design I considered a case where we generalized this to CharArrayMap (and ignoring words would mean mapping them to themselves), which would also provide a mechanism to override the stemming algorithm. But I think this is too expert, could be its own filter, and the only example of this i can find is in the Dutch stemmer.

So I think we should just provide ignore with CharArraySet, but if you feel otherwise please comment.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

LUCENE-2198.patch
17/Jan/10 15:44
74 kB
Simon Willnauer
LUCENE-2198.patch
13/Jan/10 18:49
16 kB
Simon Willnauer

Issue Links

is depended upon by

LUCENE-2055 Fix buggy stemmers and Remove duplicate analysis functionality

Reopened

Activity

People

Assignee:: Uwe Schindler

Reporter:: Robert Muir

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 08/Jan/10 17:38

Updated:: 28/Aug/22 12:18

Resolved:: 27/Jan/10 11:19