Details
-
Bug
-
Status: Resolved
-
Normal
-
Resolution: Fixed
-
None
-
Cassandra 3.7, Cassandra 3.8
-
Normal
Description
Right now, if skip stop words and stemming are enabled, SASI will put stemming in the filter pipeline BEFORE skip_stop_words:
private FilterPipelineTask getFilterPipeline() { FilterPipelineBuilder builder = new FilterPipelineBuilder(new BasicResultFilters.NoOperation()); ... if (options.shouldStemTerms()) builder = builder.add("term_stemming", new StemmingFilters.DefaultStemmingFilter(options.getLocale())); if (options.shouldIgnoreStopTerms()) builder = builder.add("skip_stop_words", new StopWordFilters.DefaultStopWordFilter(options.getLocale())); return builder.build(); }
The problem is that stemming before removing stop words can yield wrong results.
I have an example:
SELECT * FROM music.albums WHERE country='France' AND title LIKE 'danse' ALLOW FILTERING;
Because of stemming danse ( dance in English) becomes dans (the final vowel is removed). Then skip stop words is applied. Unfortunately dans (in in English) is a stop word in French so it is removed completely.
In the end the query is equivalent to SELECT * FROM music.albums WHERE country='France' and of course the results are wrong.
Attached is a trivial patch to move the skip_stop_words filter BEFORE stemming filter
/cc Pavel Yaskevich [~jrwest] [~beobal]