Description
StopFilterFactory supports a "format" option for controlling wether "getWordSet" or "getSnowballWordSet" is used to parse the file, but this option is not advertised and people can be confused by looking at the example stopword files include in the releases (some of which are in the snoball format w/ "|" comments) and try to use them w/o explicitly specifying format="snowball" and silently get useless stopwords (that include the "| comments" as literal portions of hte stopwrds.
we need to better document the use of "format" and consider updating all of the example stopword files we ship that are in the snowball format with a note about the need to use format="snowball" with those files.
The StopFilterFactory builds a CharArraySet directly from the raw lines of the supplied words file. This causes a problem when using the stop word files supplied with the Solr/Lucene distribution. In particular, the comments in those files get added to the CharArraySet. A line like this...
ceci | this
Should result in the string "ceci" being added to the CharArraySet, but "ceci | this" is what actually gets added.
Workaround: Remove all comments from stop word files you are using.
Suggested fix: The StopFilterFactory should strip any comments, then strip trailing whitespace. The stop word files supplied with the distribution should be edited to conform to the supported comment format.