Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-6243

eDisMax hidden change - no longer applies disjunction max to "pf" query

    XMLWordPrintableJSON

Details

    Description

      At some point after Solr 3.5 a bug was introduced into eDisMax (Extended DisMax Query parser) that is still there as of Solr 4.8.1. The "pf" part of the query (full phrase query) no longer is applied as a disjunction max query - instead all the matching field scores are simply added to the total score. I.e. they are just added together as opposed to the max being taken + tie-breaker times the sum of the other match scores.

      This changes the scores and the rankings significantly. When upgrading from Solr 3.5, one of our relevance test measures showed target results dropping over a full rank due to this bug. On key result went from being at rank 7 to past rank 40. I do not see any easy workaround for this.

      The following is a comparison between query results for Solr 3.5 and Solr 4.8, for one query, showing the "pf" parts of the query and scores.

      Turning debug query on, the results are the following, They clearly show that that max is used with the tiebreaker in 3.5 but not 4.8 for pf:

      query (3.5):
      boost(+(((inlink_text:edg^1.2 | body:edg^0.5 | title:edg^1.2 | meta_description:edg^0.5 | url_path:edg^1.2 | file_name:edg^1.2 | primary_header:edg^1.2 | secondary_header:edg^0.5)~0.17 (inlink_text:detect^1.2 | body:detect^0.5 | title:detect^1.2 | meta_description:detect^0.5 | url_path:detect^1.2 | file_name:detect^1.2 | primary_header:detect^1.2 | secondary_header:detect^0.5)~0.17)~2) (inlink_text:"edg detect"~100^1.2 | body:"edg detect"~100^0.5 | title:"edg detect"~100^1.2 | meta_description:"edg detect"~100^0.5 | url_path:"edg detect"~100^1.2 | file_name:"edg detect"~100^1.2 | primary_header:"edg detect"~100^1.2 | secondary_header:"edg detect"~100^0.5)~0.17,product(float(hier_score),pow(float(link_score),const(0.25))))

      I.e., the "pf" part of the query has the following disjunction max form:
      (inlink_text:"edg detect"~100^1.2 | body:"edg detect"~100^0.5 | ... | secondary_header:"edg detect"~100^0.5)~0.17

      pf results for one (3.5):
      <lst>
      <bool name="match">true</bool>
      <float name="value">1.5689207</float>
      <str name="description">max plus 0.17 times others of:</str>
      <arr name="details">
      <lst>
      <bool name="match">true</bool>
      <float name="value">1.5596248</float>
      <str name="description">...</str>
      <arr name="details">...</arr>
      </lst>
      <lst>
      <bool name="match">true</bool>
      <float name="value">0.054681662</float>
      <str name="description">...</str>
      <arr name="details">...</arr>
      </lst>
      </arr>

      However, in 4.8, "max" and the tie-breaker are nowhere to be seen for the pf part of the query:
      query (4.8):
      boost(+(((inlink_text:edg^1.2 | body:edg^0.5 | title:edg^1.2 | meta_description:edg^0.5 | url_path:edg^1.2 | file_name:edg^1.2 | primary_header:edg^1.2 | secondary_header:edg^0.5)~0.17 (inlink_text:detect^1.2 | body:detect^0.5 | title:detect^1.2 | meta_description:detect^0.5 | url_path:detect^1.2 | file_name:detect^1.2 | primary_header:detect^1.2 | secondary_header:detect^0.5)~0.17)~2) body:"edg detect"~100^0.5 title:"edg detect"~100^1.2 url_path:"edg detect"~100^1.2 file_name:"edg detect"~100^1.2 primary_header:"edg detect"~100^1.2 secondary_header:"edg detect"~100^0.5 meta_description:"edg detect"~100^0.5 inlink_text:"edg detect"~100^1.2,product(float(hier_score),pow(float(link_score),const(0.25))))

      I.e., the "pf" part of the query does NOT have the disjunction max form:
      body:"edg detect"~100^0.5 title:"edg detect"~100^1.2 ... inlink_text:"edg detect"~100^1.2,

      pf results for one (4.8) (no max, both values are just listed under the "sum of" element:
      <lst>
      <bool name="match">true</bool>
      <float name="value">0.03554287</float>
      <str name="description">...</str>
      <arr name="details">...</arr>
      </lst>
      <lst>
      <bool name="match">true</bool>
      <float name="value">1.0933692</float>
      <str name="description">...</str>
      <arr name="details">...</arr>
      </lst>

      The Solr 4 handler used is the following - it's also the same as the 3.5 one:
      <requestHandler class="solr.SearchHandler" name="/sitewide">

      <lst name="defaults">
      <str name="defType">edismax</str>
      <str name="echoParams">explicit</str>
      <float name="tie">0.17</float>
      <str name="qf">
      body^0.5 title^1.2 url_path^1.2 file_name^1.2 primary_header^1.2 secondary_header^0.5 meta_description^0.5 inlink_text^1.2
      </str>
      <str name="pf">
      body^0.5 title^1.2 url_path^1.2 file_name^1.2 primary_header^1.2 secondary_header^0.5 meta_description^0.5 inlink_text^1.2
      </str>
      <int name="ps">100</int>
      <str name="boost">
      hier_score
      </str>
      <str name="boost">
      pow(link_score,0.25)
      </str>
      </lst>
      <lst name="spellchecker">

      <str name="spellcheck.onlyMorePopular">false</str>

      <str name="spellcheck.extendedResults">true</str>

      <str name="spellcheck.count">3</str>
      <str name="buildOnCommit">true</str>
      </lst>
      <arr name="last-components">
      <str>spellcheck</str>
      </arr>
      </requestHandler>

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              brian44 Brian
              Votes:
              1 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: