Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-8036

ShingleFilter should have an option to skip filler tokens (e.g. stop words)

    XMLWordPrintableJSON

    Details

    • Lucene Fields:
      New

      Description

      ShingleFilterFactory should have an option to ignore filler tokens in the total shingle size.
      For instance (adapted from https://stackoverflow.com/questions/33193144/solr-stemming-stop-words-and-shingles-not-giving-expected-outputs), consider the text "A brown fox quickly jumps over the lazy dog". When we remove stopwords and execute the ShingleFilter (shingle size = 3), it gives us the following result:

      1. _ brown fox
      2. brown fox quickly
      3. fox quickly jump
      4. quickly jump _
      5. jump _ _
      6. _ _ lazy
      7. _ lazy dog

      We can clearly see that the filler token "_" occupies one token in the shingle.
      I suppose the returned shingles should be:
      1. brown fox quickly
      2. fox quickly jump
      3. quickly jump lazy
      4. jump lazy dog

      To maintain backward compatibility, i suggest the creation of an option called "skipFillerTokens" to implement this behavior (note that this is different than using fillerTokens="", since the empty string occupies one token in the shingle)

      I've attached a patch for the ShingleFilter class (getNextToken() method), ShingleFilterFactory and ShingleFilterTest clases.

        Attachments

        1. SOLR-11604.patch
          7 kB
          Edans Sandes

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                edans.sandes Edans Sandes
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:

                  Time Tracking

                  Estimated:
                  Original Estimate - 2h
                  2h
                  Remaining:
                  Remaining Estimate - 2h
                  2h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified