Details
-
Improvement
-
Status: Resolved
-
Trivial
-
Resolution: Duplicate
-
7.1
-
None
Description
ShingleFilterFactory should have an option to ignore filler tokens in the total shingle size.
For instance (adapted from https://stackoverflow.com/questions/33193144/solr-stemming-stop-words-and-shingles-not-giving-expected-outputs), consider the text "A brown fox quickly jumps over the lazy dog". When we remove stopwords and execute the ShingleFilter (shingle size = 3), it gives us the following result:
1. _ brown fox
2. brown fox quickly
3. fox quickly jump
4. quickly jump _
5. jump _ _
6. _ _ lazy
7. _ lazy dog
We can clearly see that the filler token "_" occupies one token in the shingle.
I suppose the returned shingles should be:
1. brown fox quickly
2. fox quickly jump
3. quickly jump lazy
4. jump lazy dog
To maintain backward compatibility, i suggest the creation of an option called "skipFillerTokens" to implement this behavior (note that this is different than using fillerTokens="", since the empty string occupies one token in the shingle)
I will attach a patch for the ShingleFilter class (getNextToken() method).
Attachments
Issue Links
- is a clone of
-
LUCENE-8036 ShingleFilter should have an option to skip filler tokens (e.g. stop words)
- Patch Available