Solr
  1. Solr
  2. SOLR-3875

Document boost does not work correctly when using multi-valued fields

    Details

      Description

      In Solr 4 BETA & trunk, document boosts skews the ranking for documents with multi value fields tremendously. A document boost of 5 combined with 15 values in a multi value field results in scores above 1,000,000,000, while a boost of 0,5 results in scores below 0,001. The error is not present in Solr 3.6.

      Thomas Egense and I have tracked it down to a change in Solr DocumentBuilder committed 20110827 (@1162347) by Mike McCandless, as part of work done on LUCENE-2308. The problem is that Lucene multiplies the boosts of multiple instances of the same field when updating the index.

      The old DocumentBuilder, used in Lucene 3.6, handled this by calculating the score for the field (docBoost*fieldBoost) and assigning it to the first instance of the field, then setting the boost to 1.0f and assigning that to subsequent instances of the field. This effectively assigned docBoost*fieldBoost to the field, regardless of the number of instances.

      The updated DocumentBuilder (see https://svn.apache.org/viewvc/lucene/dev/branches/lucene_solr_4_0/solr/core/src/java/org/apache/solr/update/DocumentBuilder.java?revision=1388778&view=markup), used in Lucene 4 BETA & trunk, also assigns docBoost*fieldBoost to the first instance of the field. Then it sets fieldBoost = docBoost and continues to assign docBoost*fieldBoost to subsequent instances. Using the example mentioned above, the generated IndexableFields will get assigned boosts of 5, 5*5, 5*5... 5*5. As Lucene multiplies all the values, 15 instances of the same field will have a collective boost of 5*25^14.

      This can be demonstrated with the Solr tutorial example by indexing the sample documents and adding the document

      <add>
      <doc boost="5">
        <field name="id">Insane score Example. Score = 10E9 </field>
        <field name="name">Document boost broken for multivalued fields</field>
        <field name="manu">Thomas Egense and Toke Eskildsen</field>
        <field name="manu_id_s">Test</field>
        <field name="cat">bug</field>
        <field name="features">insane_boost</field>
        <field name="features">something else</field>
        <field name="features">something else</field>
        <field name="features">something else</field>
        <field name="features">something else</field>
        <field name="features">something else</field>
        <field name="features">something else</field>
        <field name="features">something else</field>
        <field name="features">something else</field>
        <field name="features">something else</field>
        <field name="features">something else</field>
        <field name="features">something else</field>
        <field name="features">something else</field>
        <field name="features">something else</field>  
      </doc>
      </add>
      

      The manu & features-fields gets copied to text and a search for thomas matches the text-field with query explanation

      <str name="Insane score Example. Score = 10E10 ">
      2.44373361E10 = (MATCH) weight(text:thomas in 0) [DefaultSimilarity], result of:
        2.44373361E10 = fieldWeight in 0, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          3.2512918 = idf(docFreq=3, maxDocs=38)
          7.5161928E9 = fieldNorm(doc=0)
      </str>
      

      Thomas and I are too pressed for time to attempt a proper patch at the moment, but we guess that a reversion to the old algorithm of assigning the combined boost to the first instance and 1.0f to all subsequent instances would work?

        Issue Links

          Activity

          Steve Rowe made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Hoss Man made changes -
          Link This issue is related to SOLR-3981 [ SOLR-3981 ]
          Hoss Man made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Hoss Man made changes -
          Link This issue relates to SOLR-3885 [ SOLR-3885 ]
          Hoss Man made changes -
          Assignee Hoss Man [ hossman ]
          Hoss Man made changes -
          Fix Version/s 4.0-BETA [ 12322455 ]
          Affects Version/s 4.1 [ 12321141 ]
          Affects Version/s 5.0 [ 12321664 ]
          Affects Version/s 4.0 [ 12322551 ]
          Hoss Man made changes -
          Attachment SOLR-3875.patch [ 12546368 ]
          Jan Høydahl made changes -
          Field Original Value New Value
          Priority Major [ 3 ] Critical [ 2 ]
          Toke Eskildsen created issue -

            People

            • Assignee:
              Hoss Man
              Reporter:
              Toke Eskildsen
            • Votes:
              1 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development