Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-10423

ShingleFilter causes overly restrictive queries to be produced

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 6.5
    • 6.5.1, 6.6, 7.0
    • query parsers
    • None

    Description

      When sow=false and ShingleFilter is included in the query analyzer, QueryBuilder produces queries that inappropriately require sequential terms. E.g. the query "A B C" produces (+A_B +B_C) A_B_C when the query analyzer includes <filter class="solr.ShingleFilterFactory" maxShingleSize="3" outputUnigrams="false" tokenSeparator="_"/>.

      Aman Deep Singh reported this problem on the solr-user list. From http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201703.mbox/%3cCANEGTX9BwBPwQc-cXieAc7QSAS7x2TgZovOMy5ZTiAgco1p11Q@mail.gmail.com%3e:

      I was trying to use the shingle filter but it was not creating the query as
      desirable.

      my schema is

      <fieldType name="cust_shingle" class="solr.TextField" positionIncrementGap="100">
        <analyzer>
          <tokenizer class="solr.StandardTokenizerFactory"/>
          <filter class="solr.ShingleFilterFactory" outputUnigrams="false" maxShingleSize="4"/>
          <filter class="solr.LowerCaseFilterFactory"/>
        </analyzer>
      </fieldType>
      <field name="nameShingle" type="cust_shingle" indexed="true" stored="true"/>
      

      my solr query is

      http://localhost:8983/solr/productCollection/select?
       defType=edismax
      &debugQuery=true
      &q=one%20plus%20one%20four
      &qf=nameShingle
      &sow=false
      &wt=xml
      

      and it was creating the parsed query as

      <str name="parsedquery">
      (+(DisjunctionMaxQuery(((+nameShingle:one plus +nameShingle:plus one
      +nameShingle:one four))) DisjunctionMaxQuery(((+nameShingle:one plus
      +nameShingle:plus one four))) DisjunctionMaxQuery(((+nameShingle:one plus one +nameShingle:one four))) DisjunctionMaxQuery((nameShingle:one plus one four)))~1)/no_coord
      </str>
      <str name="parsedquery_toString">
      *+((((+nameShingle:one plus +nameShingle:plus one +nameShingle:one four))
      ((+nameShingle:one plus +nameShingle:plus one four)) ((+nameShingle:one
      plus one +nameShingle:one four)) (nameShingle:one plus one four))~1)*
      </str>
      

      So ideally token creations is perfect but in the query it is using boolean + operator which is causing the problem as if i have a document with name as "one plus one" ,according to the shingles it has to matched as its token will be ("one plus","one plus one","plus one") .

      I have tried using the q.op and played around the mm also but nothing is
      giving me the correct response.

      Any idea how i can fetch that document even if the document is missing any
      token.

      My expected response will be getting the document "one plus one" even the user query has any additional term like "one plus one two" and so on.

      Attachments

        1. SOLR-10423.patch
          15 kB
          Steven Rowe

        Activity

          People

            sarowe Steven Rowe
            sarowe Steven Rowe
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: