Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-8036

ShingleFilter should have an option to skip filler tokens (e.g. stop words)

    XMLWordPrintableJSON

Details

    • New

    Description

      ShingleFilterFactory should have an option to ignore filler tokens in the total shingle size.
      For instance (adapted from https://stackoverflow.com/questions/33193144/solr-stemming-stop-words-and-shingles-not-giving-expected-outputs), consider the text "A brown fox quickly jumps over the lazy dog". When we remove stopwords and execute the ShingleFilter (shingle size = 3), it gives us the following result:

      1. _ brown fox
      2. brown fox quickly
      3. fox quickly jump
      4. quickly jump _
      5. jump _ _
      6. _ _ lazy
      7. _ lazy dog

      We can clearly see that the filler token "_" occupies one token in the shingle.
      I suppose the returned shingles should be:
      1. brown fox quickly
      2. fox quickly jump
      3. quickly jump lazy
      4. jump lazy dog

      To maintain backward compatibility, i suggest the creation of an option called "skipFillerTokens" to implement this behavior (note that this is different than using fillerTokens="", since the empty string occupies one token in the shingle)

      I've attached a patch for the ShingleFilter class (getNextToken() method), ShingleFilterFactory and ShingleFilterTest clases.

      Attachments

        1. SOLR-11604.patch
          7 kB
          Edans Sandes

        Issue Links

          Activity

            People

              Unassigned Unassigned
              edans.sandes Edans Sandes
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:

                Time Tracking

                  Estimated:
                  Original Estimate - 2h
                  2h
                  Remaining:
                  Remaining Estimate - 2h
                  2h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified