Solr
  1. Solr
  2. SOLR-3875

Document boost does not work correctly when using multi-valued fields

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 4.0-BETA
    • Fix Version/s: 4.0, 4.1, 5.0
    • Component/s: Schema and Analysis, update
    • Labels:
      None

      Description

      In Solr 4 BETA & trunk, document boosts skews the ranking for documents with multi value fields tremendously. A document boost of 5 combined with 15 values in a multi value field results in scores above 1,000,000,000, while a boost of 0,5 results in scores below 0,001. The error is not present in Solr 3.6.

      Thomas Egense and I have tracked it down to a change in Solr DocumentBuilder committed 20110827 (@1162347) by Mike McCandless, as part of work done on LUCENE-2308. The problem is that Lucene multiplies the boosts of multiple instances of the same field when updating the index.

      The old DocumentBuilder, used in Lucene 3.6, handled this by calculating the score for the field (docBoost*fieldBoost) and assigning it to the first instance of the field, then setting the boost to 1.0f and assigning that to subsequent instances of the field. This effectively assigned docBoost*fieldBoost to the field, regardless of the number of instances.

      The updated DocumentBuilder (see https://svn.apache.org/viewvc/lucene/dev/branches/lucene_solr_4_0/solr/core/src/java/org/apache/solr/update/DocumentBuilder.java?revision=1388778&view=markup), used in Lucene 4 BETA & trunk, also assigns docBoost*fieldBoost to the first instance of the field. Then it sets fieldBoost = docBoost and continues to assign docBoost*fieldBoost to subsequent instances. Using the example mentioned above, the generated IndexableFields will get assigned boosts of 5, 5*5, 5*5... 5*5. As Lucene multiplies all the values, 15 instances of the same field will have a collective boost of 5*25^14.

      This can be demonstrated with the Solr tutorial example by indexing the sample documents and adding the document

      <add>
      <doc boost="5">
        <field name="id">Insane score Example. Score = 10E9 </field>
        <field name="name">Document boost broken for multivalued fields</field>
        <field name="manu">Thomas Egense and Toke Eskildsen</field>
        <field name="manu_id_s">Test</field>
        <field name="cat">bug</field>
        <field name="features">insane_boost</field>
        <field name="features">something else</field>
        <field name="features">something else</field>
        <field name="features">something else</field>
        <field name="features">something else</field>
        <field name="features">something else</field>
        <field name="features">something else</field>
        <field name="features">something else</field>
        <field name="features">something else</field>
        <field name="features">something else</field>
        <field name="features">something else</field>
        <field name="features">something else</field>
        <field name="features">something else</field>
        <field name="features">something else</field>  
      </doc>
      </add>
      

      The manu & features-fields gets copied to text and a search for thomas matches the text-field with query explanation

      <str name="Insane score Example. Score = 10E10 ">
      2.44373361E10 = (MATCH) weight(text:thomas in 0) [DefaultSimilarity], result of:
        2.44373361E10 = fieldWeight in 0, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          3.2512918 = idf(docFreq=3, maxDocs=38)
          7.5161928E9 = fieldNorm(doc=0)
      </str>
      

      Thomas and I are too pressed for time to attempt a proper patch at the moment, but we guess that a reversion to the old algorithm of assigning the combined boost to the first instance and 1.0f to all subsequent instances would work?

        Issue Links

          Activity

          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] Chris M. Hostetter
          http://svn.apache.org/viewvc?view=revision&revision=1389636

          SOLR-3875: Fixed index boosts on multi-valued fields when docBoost is used (merge r1389628)

          Show
          Commit Tag Bot added a comment - [branch_4x commit] Chris M. Hostetter http://svn.apache.org/viewvc?view=revision&revision=1389636 SOLR-3875 : Fixed index boosts on multi-valued fields when docBoost is used (merge r1389628)
          Hide
          Hoss Man added a comment -

          Toke: thanks for following up - too bad we didn't catch this other problem before 4.0.

          I've spun off SOLR-3981 to work on this since SOLR-3875 is already resolved and listed as fixed in 4.0 (we can't (sanely) re-open issues that were recorded in CHANGES.txt for official releases since it would leave users confused as to what parts of those issues were resolved in each version)

          Show
          Hoss Man added a comment - Toke: thanks for following up - too bad we didn't catch this other problem before 4.0. I've spun off SOLR-3981 to work on this since SOLR-3875 is already resolved and listed as fixed in 4.0 (we can't (sanely) re-open issues that were recorded in CHANGES.txt for official releases since it would leave users confused as to what parts of those issues were resolved in each version)
          Hide
          Toke Eskildsen added a comment -

          Unfortunately, the bug is only partly solved. Thomas and I encountered strange scores again. While boosting of multi-value fields is handled correctly in Solr 4.0.0, boosting for copyFields are not. A sample document:

             <add><doc boost="10.0">
            <field name="id">Insane score Example. Score = 10E9 </field>
            <field name="name">Document boost broken for copyFields</field>
            <field name="manu" >video ThomasEgense and Toke Eskildsen</field>
            <field name="manu_id_s">Test</field>
            <field name="cat">bug</field>
            <field name="features">something else</field>
            <field name="keywords">bug</field>
            <field name="content">bug</field>
            </doc></add>
          

          The fields name, manu, cat, features, keywords and content gets copied to text and a search for thomasegense matches the text-field with query explanation

          70384.67 = (MATCH) weight(text:thomasegense in 0) [DefaultSimilarity], result of:
            70384.67 = fieldWeight in 0, product of:
              1.0 = tf(freq=1.0), with freq of:
                1.0 = termFreq=1.0
              0.30685282 = idf(docFreq=1, maxDocs=1)
              229376.0 = fieldNorm(doc=0)
          

          If the two last fields keywords and content are removed from the sample document, the score is reduced by a factor 100 (docBoost^2).

          The current DocumentBuilder https://svn.apache.org/viewvc/lucene/dev/branches/lucene_solr_4_0/solr/core/src/java/org/apache/solr/update/DocumentBuilder.java?revision=1389648&view=markup works roughly like this:

          foreach (field) {
            boost = docBoost*fieldBoost
            foreach (value) {
              assignField(field, value, boost)
              foreach (copyField) {
                assignField(copyField, value, boost)
              }
              boost = 1f
            }
          }
          

          When all fields share the same copyField (text in this example), the copyField will have the full boost assigned for each directly specified field which uses that copyField. That's 5 times with the provided sample, so the total boost for the field text will be 10^5.

          One solution would be to keep track of used fields (directly specified as well as copyFields) and only assign the full boost once per document. If the number of unique fields/document is low, a simple list would probably be the fastest and with low GC impact. For a higher number of unique fields, a Set might be better. An optimization would be to only create the tracking structure once a boost != 1.0f is encountered and only store the fields with boost != 1.0f, so that an update without boosts would not get a performance penalty.

          Show
          Toke Eskildsen added a comment - Unfortunately, the bug is only partly solved. Thomas and I encountered strange scores again. While boosting of multi-value fields is handled correctly in Solr 4.0.0, boosting for copyFields are not. A sample document: <add><doc boost= "10.0" > <field name= "id" >Insane score Example. Score = 10E9 </field> <field name= "name" >Document boost broken for copyFields</field> <field name= "manu" >video ThomasEgense and Toke Eskildsen</field> <field name= "manu_id_s" >Test</field> <field name= "cat" >bug</field> <field name= "features" >something else </field> <field name= "keywords" >bug</field> <field name= "content" >bug</field> </doc></add> The fields name , manu , cat , features , keywords and content gets copied to text and a search for thomasegense matches the text-field with query explanation 70384.67 = (MATCH) weight(text:thomasegense in 0) [DefaultSimilarity], result of: 70384.67 = fieldWeight in 0, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 0.30685282 = idf(docFreq=1, maxDocs=1) 229376.0 = fieldNorm(doc=0) If the two last fields keywords and content are removed from the sample document, the score is reduced by a factor 100 (docBoost^2). The current DocumentBuilder https://svn.apache.org/viewvc/lucene/dev/branches/lucene_solr_4_0/solr/core/src/java/org/apache/solr/update/DocumentBuilder.java?revision=1389648&view=markup works roughly like this: foreach (field) { boost = docBoost*fieldBoost foreach (value) { assignField(field, value, boost) foreach (copyField) { assignField(copyField, value, boost) } boost = 1f } } When all fields share the same copyField ( text in this example), the copyField will have the full boost assigned for each directly specified field which uses that copyField. That's 5 times with the provided sample, so the total boost for the field text will be 10^5. One solution would be to keep track of used fields (directly specified as well as copyFields) and only assign the full boost once per document. If the number of unique fields/document is low, a simple list would probably be the fastest and with low GC impact. For a higher number of unique fields, a Set might be better. An optimization would be to only create the tracking structure once a boost != 1.0f is encountered and only store the fields with boost != 1.0f, so that an update without boosts would not get a performance penalty.
          Hide
          Michael McCandless added a comment -

          Thanks Hoss.

          Show
          Michael McCandless added a comment - Thanks Hoss.
          Hide
          Thomas Egense added a comment -

          Thank you for the fast response and fix! Great job.

          Show
          Thomas Egense added a comment - Thank you for the fast response and fix! Great job.
          Hide
          Hoss Man added a comment -

          Is it possible we can improve tests somehow in this area?

          I'm open to suggestsion, but i've created SOLR-3885 to track it since we've already good a well vetted test for this specific problem commited and on track for a 4.0 respin.

          If other good tests pop up in SOLR-3885 before 4.0 final then we can certainly commit them, but it's not an area i'm likely to be focusing much attention right now.

          Show
          Hoss Man added a comment - Is it possible we can improve tests somehow in this area? I'm open to suggestsion, but i've created SOLR-3885 to track it since we've already good a well vetted test for this specific problem commited and on track for a 4.0 respin. If other good tests pop up in SOLR-3885 before 4.0 final then we can certainly commit them, but it's not an area i'm likely to be focusing much attention right now.
          Hide
          Hoss Man added a comment -

          Committed revision 1389648. - 4.0

          Show
          Hoss Man added a comment - Committed revision 1389648. - 4.0
          Hide
          Robert Muir added a comment -

          Is it possible we can improve tests somehow in this area?

          Both Mike and I have been ensnared by this file recently and introduced sneaky bugs.

          Additional tests showing we didnt introduce any regressions (especially with the crazy field types) would make me very happy.
          Additional tests that just make this less fragile would be very good for the future.

          Show
          Robert Muir added a comment - Is it possible we can improve tests somehow in this area? Both Mike and I have been ensnared by this file recently and introduced sneaky bugs. Additional tests showing we didnt introduce any regressions (especially with the crazy field types) would make me very happy. Additional tests that just make this less fragile would be very good for the future.
          Hide
          Hoss Man added a comment -

          Committed revision 1389636. -4x

          Show
          Hoss Man added a comment - Committed revision 1389636. -4x
          Hide
          Hoss Man added a comment -

          Committed revision 1389628. - trunk

          Show
          Hoss Man added a comment - Committed revision 1389628. - trunk
          Hide
          Hoss Man added a comment -

          ok ... planning to commit & backport for 4.0 respin (or 4.0.1), bear with me while i test on every branch

          Show
          Hoss Man added a comment - ok ... planning to commit & backport for 4.0 respin (or 4.0.1), bear with me while i test on every branch
          Hide
          Michael McCandless added a comment -

          Thanks Hoss, patch looks great!

          Show
          Michael McCandless added a comment - Thanks Hoss, patch looks great!
          Hide
          Mark Miller added a comment -

          patch with proposed test & fix

          +1

          I applied the patch, inspected the fix, inspected the test. It looks right to me.

          I also ran all tests, and verified the new test fails as expected without the fix.

          Show
          Mark Miller added a comment - patch with proposed test & fix +1 I applied the patch, inspected the fix, inspected the test. It looks right to me. I also ran all tests, and verified the new test fails as expected without the fix.
          Hide
          Yonik Seeley added a comment -

          +1, fix looks good!

          Show
          Yonik Seeley added a comment - +1, fix looks good!
          Hide
          Hoss Man added a comment -

          removing 4.0-beta from the fix version (we don't go back in time) and any affects versions that are in the future

          Show
          Hoss Man added a comment - removing 4.0-beta from the fix version (we don't go back in time) and any affects versions that are in the future
          Hide
          Hoss Man added a comment -

          patch with proposed test & fix ... i'm still running the full test sweet, but posting here for other folks to review early.

          Show
          Hoss Man added a comment - patch with proposed test & fix ... i'm still running the full test sweet, but posting here for other folks to review early.
          Hide
          Jan Høydahl added a comment -

          Upgrading to Critical. I seldom use document boosts, but could be a showstopper for some people upgrading.

          If a fix is not included, the release notes should mention it as a known bug and fix in next release.

          Show
          Jan Høydahl added a comment - Upgrading to Critical. I seldom use document boosts, but could be a showstopper for some people upgrading. If a fix is not included, the release notes should mention it as a known bug and fix in next release.

            People

            • Assignee:
              Hoss Man
              Reporter:
              Toke Eskildsen
            • Votes:
              1 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development