Solr
  1. Solr
  2. SOLR-3981

docBoost is compounded on copyField

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.0
    • Fix Version/s: 4.1, 6.0
    • Component/s: None
    • Labels:
      None

      Description

      As noted by Toke in a comment on SOLR-3875...

      https://issues.apache.org/jira/browse/SOLR-3875?focusedCommentId=13482233&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13482233

      While boosting of multi-value fields is handled correctly in Solr 4.0.0, boosting for copyFields are not. A sample document:

      <add><doc boost="10.0">
        <field name="id">Insane score Example. Score = 10E9 </field>
        <field name="name">Document boost broken for copyFields</field>
        <field name="manu" >video ThomasEgense and Toke Eskildsen</field>
        <field name="manu_id_s">Test</field>
        <field name="cat">bug</field>
        <field name="features">something else</field>
        <field name="keywords">bug</field>
        <field name="content">bug</field>
        </doc></add>
      

      The fields name, manu, cat, features, keywords and content gets copied to text and a search for thomasegense matches the text-field with query explanation

      70384.67 = (MATCH) weight(text:thomasegense in 0) [DefaultSimilarity], result of:
        70384.67 = fieldWeight in 0, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          0.30685282 = idf(docFreq=1, maxDocs=1)
          229376.0 = fieldNorm(doc=0)
      

      If the two last fields keywords and content are removed from the sample document, the score is reduced by a factor 100 (docBoost^2).

      (This is a continuation of some of the problems caused by the changes made when the concept of docBoost was eliminated from the underly IndexWRiter code, and overlooked due to the lack of testing of docBoosts at the solr level - SOLR-3885))

      1. SOLR-3981.patch
        13 kB
        Hoss Man
      2. SOLR-3981.patch
        11 kB
        Hoss Man
      3. SOLR-3981.patch
        8 kB
        Hoss Man

        Issue Links

          Activity

          Hide
          Hoss Man added a comment -

          Toke suggested in SOLR-3875...

          One solution would be to keep track of used fields (directly specified as well as copyFields) and only assign the full boost once per document. If the number of unique fields/document is low, a simple list would probably be the fastest and with low GC impact. For a higher number of unique fields, a Set might be better. An optimization would be to only create the tracking structure once a boost != 1.0f is encountered and only store the fields with boost != 1.0f, so that an update without boosts would not get a performance penalty.

          I was thinking that a more straight forward solution would be to build up the entire "Document" w/o any regard to the docBoost, and then only at the end loop over the fields in that Document and multiple the docBoost if it's indexed & !omitNorms – but then i realized that at that level there is no general way to "set" the boost.

          I'm working on a patch with a test demonstrating the problem ... that may help inform an appropriate solution.

          Show
          Hoss Man added a comment - Toke suggested in SOLR-3875 ... One solution would be to keep track of used fields (directly specified as well as copyFields) and only assign the full boost once per document. If the number of unique fields/document is low, a simple list would probably be the fastest and with low GC impact. For a higher number of unique fields, a Set might be better. An optimization would be to only create the tracking structure once a boost != 1.0f is encountered and only store the fields with boost != 1.0f, so that an update without boosts would not get a performance penalty. I was thinking that a more straight forward solution would be to build up the entire "Document" w/o any regard to the docBoost, and then only at the end loop over the fields in that Document and multiple the docBoost if it's indexed & !omitNorms – but then i realized that at that level there is no general way to "set" the boost. I'm working on a patch with a test demonstrating the problem ... that may help inform an appropriate solution.
          Hide
          Hoss Man added a comment -

          patch wit hteh test i was working on, as well as a fix...

          the Document itself can serve as the "set" to keep track of which field names have already been added. because the final boost for the field name is the product of the individual boosts, we don't have to ensure that the (solr) docBoost and (solr) fieldBoost(s) are combined into the first value of each copyField – we just have to ensure that each is only used once. (multiple copyFields with the same dest will result in them being multiplied in the final dest field's norm but that's always been true)

          i'm still running the full test suite, and i want to work on a test that actually indexes a doc and inspects the encoded norms just to be certain i'm not missing something.

          Show
          Hoss Man added a comment - patch wit hteh test i was working on, as well as a fix... the Document itself can serve as the "set" to keep track of which field names have already been added. because the final boost for the field name is the product of the individual boosts, we don't have to ensure that the (solr) docBoost and (solr) fieldBoost(s) are combined into the first value of each copyField – we just have to ensure that each is only used once. (multiple copyFields with the same dest will result in them being multiplied in the final dest field's norm but that's always been true) i'm still running the full test suite, and i want to work on a test that actually indexes a doc and inspects the encoded norms just to be certain i'm not missing something.
          Hide
          Hoss Man added a comment -

          i want to work on a test that actually indexes a doc and inspects the encoded norms just to be certain i'm not missing something.

          Updated patch adds this to the test – kludgy to reach this deep into the lucene code in the solr test, but do-able.

          Unfortunately the test fails because the decoded norms from the index wind up being way lower then the expected values.

          At first i thought it was just because i forgot to factor in the term length in my expected norm, but even taking that into account the numbers are still way off. i'm guessing either i don't understand something about the new 4.0 APIs for getting the DocValues/Norms, or i've got some trivially silly bug that i'm blind too because i've been staring at it too long.

          I'd appreciate a second set of eyes.

          Show
          Hoss Man added a comment - i want to work on a test that actually indexes a doc and inspects the encoded norms just to be certain i'm not missing something. Updated patch adds this to the test – kludgy to reach this deep into the lucene code in the solr test, but do-able. Unfortunately the test fails because the decoded norms from the index wind up being way lower then the expected values. At first i thought it was just because i forgot to factor in the term length in my expected norm, but even taking that into account the numbers are still way off. i'm guessing either i don't understand something about the new 4.0 APIs for getting the DocValues/Norms, or i've got some trivially silly bug that i'm blind too because i've been staring at it too long. I'd appreciate a second set of eyes.
          Hide
          Robert Muir added a comment -

          that adoc() you are using doesnt work with boosts. (I found this from another test)

          Show
          Robert Muir added a comment - that adoc() you are using doesnt work with boosts. (I found this from another test)
          Hide
          Toke Eskildsen added a comment -

          Thank you for investigating this so quickly, Hoss.

          Applying the boosts once from all source fields for a given copyField destination seems a bit strange to me, but since it is old behaviour, I understand that it cannot be changed.

          Show
          Toke Eskildsen added a comment - Thank you for investigating this so quickly, Hoss. Applying the boosts once from all source fields for a given copyField destination seems a bit strange to me, but since it is old behaviour, I understand that it cannot be changed.
          Hide
          Hoss Man added a comment -

          that adoc() you are using doesnt work with boosts. (I found this from another test)

          Grr... thanks rmuir, never would have even thought to check that ... easy fix.

          Applying the boosts once from all source fields for a given copyField destination seems a bit strange to me, but since it is old behaviour, I understand that it cannot be changed.

          right ... copyField has always copied the field boosts, the bug here is the compounded docBoost.

          FWIW: we could add a ton more options to copyField to give more fine grained control over stuff like this as feature improvements if you'd like to file some Jiras for feature impreovements along those lines – but personally i think: a) update processors make more sense for stuff like this; b) people to move away from doc/field boosts and start doing more with functions on numeric fields (and ultimately DocValues fields) where you have a lot more control of this stuff

          Show
          Hoss Man added a comment - that adoc() you are using doesnt work with boosts. (I found this from another test) Grr... thanks rmuir, never would have even thought to check that ... easy fix. Applying the boosts once from all source fields for a given copyField destination seems a bit strange to me, but since it is old behaviour, I understand that it cannot be changed. right ... copyField has always copied the field boosts, the bug here is the compounded docBoost. FWIW: we could add a ton more options to copyField to give more fine grained control over stuff like this as feature improvements if you'd like to file some Jiras for feature impreovements along those lines – but personally i think: a) update processors make more sense for stuff like this; b) people to move away from doc/field boosts and start doing more with functions on numeric fields (and ultimately DocValues fields) where you have a lot more control of this stuff
          Hide
          Hoss Man added a comment -

          updated patch to include fix for the test-harness. Still running exhaustive tests

          Show
          Hoss Man added a comment - updated patch to include fix for the test-harness. Still running exhaustive tests
          Hide
          Hoss Man added a comment -

          tests & precommit look good ... unless anyone spots any problems i'll commit later today.

          Show
          Hoss Man added a comment - tests & precommit look good ... unless anyone spots any problems i'll commit later today.
          Hide
          Hoss Man added a comment -

          Committed revision 1401916. - trunk
          Committed revision 1401920. - 4x

          Show
          Hoss Man added a comment - Committed revision 1401916. - trunk Committed revision 1401920. - 4x
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] Chris M. Hostetter
          http://svn.apache.org/viewvc?view=revision&revision=1401920

          SOLR-3988: Fixed SolrTestCaseJ4.adoc(SolrInputDocument) to respect field and document boosts

          SOLR-3981: Fixed bug that resulted in document boosts being compounded in <copyField/> destination fields

          (merge r41401916)

          Show
          Commit Tag Bot added a comment - [branch_4x commit] Chris M. Hostetter http://svn.apache.org/viewvc?view=revision&revision=1401920 SOLR-3988 : Fixed SolrTestCaseJ4.adoc(SolrInputDocument) to respect field and document boosts SOLR-3981 : Fixed bug that resulted in document boosts being compounded in <copyField/> destination fields (merge r41401916)

            People

            • Assignee:
              Hoss Man
              Reporter:
              Hoss Man
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development