Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-11605

ShingleFilter should have an option to skip filler tokens (e.g. stop words)

    XMLWordPrintableJSON

Details

    Description

      ShingleFilterFactory should have an option to ignore filler tokens in the total shingle size.
      For instance (adapted from https://stackoverflow.com/questions/33193144/solr-stemming-stop-words-and-shingles-not-giving-expected-outputs), consider the text "A brown fox quickly jumps over the lazy dog". When we remove stopwords and execute the ShingleFilter (shingle size = 3), it gives us the following result:

      1. _ brown fox
      2. brown fox quickly
      3. fox quickly jump
      4. quickly jump _
      5. jump _ _
      6. _ _ lazy
      7. _ lazy dog

      We can clearly see that the filler token "_" occupies one token in the shingle.
      I suppose the returned shingles should be:
      1. brown fox quickly
      2. fox quickly jump
      3. quickly jump lazy
      4. jump lazy dog

      To maintain backward compatibility, i suggest the creation of an option called "skipFillerTokens" to implement this behavior (note that this is different than using fillerTokens="", since the empty string occupies one token in the shingle)

      I will attach a patch for the ShingleFilter class (getNextToken() method).

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              edans.sandes Edans Sandes
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - 4h
                  4h
                  Remaining:
                  Remaining Estimate - 4h
                  4h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified