Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-6758

Adding a SHOULD clause to a BQ over an empty field clears the score when using DefaultSimilarity

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 6.0
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      Patch with unit test to show the bug will be attached.

      I've narrowed this change in behavior with git bisect to the following commit:

      commit 698b4b56f0f2463b21c9e3bc67b8b47d635b7d1f
      Author: Robert Muir <rmuir@apache.org>
      Date:   Thu Aug 13 17:37:15 2015 +0000
      
          LUCENE-6711: Use CollectionStatistics.docCount() for IDF and average field length computations
          
          git-svn-id: https://svn.apache.org/repos/asf/lucene/dev/trunk@1695744 13f79535-47bb-0310-9956-ffa450edef68
      
      1. LUCENE-6758.patch
        25 kB
        Robert Muir
      2. LUCENE-6758.patch
        6 kB
        Terry Smith

        Issue Links

          Activity

          Hide
          shebiki Terry Smith added a comment -

          Run this unit test a few times and you'll hit a failure when DefaultSimilarity is picked.

          The method testBQHitOrEmpty() will fail because the score is zero. It's friend testBQHitOrMiss() has a non-zero score.

          The difference between the two is that the field "empty" is unused, whereas the field "test" has one token ("hit").

          Show
          shebiki Terry Smith added a comment - Run this unit test a few times and you'll hit a failure when DefaultSimilarity is picked. The method testBQHitOrEmpty() will fail because the score is zero. It's friend testBQHitOrMiss() has a non-zero score. The difference between the two is that the field "empty" is unused, whereas the field "test" has one token ("hit").
          Hide
          shebiki Terry Smith added a comment -

          Explain output for the failing query (testBQHitOrEmpty):

          0.0 = product of:
            0.0 = sum of:
              0.0 = weight(test:hit in 0) [DefaultSimilarity], result of:
                0.0 = score(doc=0,freq=1.0), product of:
                  0.0 = queryWeight, product of:
                    0.30685282 = idf(docFreq=1, docCount=1)
                    0.0 = queryNorm
                  0.30685282 = fieldWeight in 0, product of:
                    1.0 = tf(freq=1.0), with freq of:
                      1.0 = termFreq=1.0
                    0.30685282 = idf(docFreq=1, docCount=1)
                    1.0 = fieldNorm(doc=0)
            0.5 = coord(1/2)
          

          Explain output for the variant against a populated field (testBQHitOrMiss):

          0.04500804 = product of:
            0.09001608 = sum of:
              0.09001608 = weight(test:hit in 0) [DefaultSimilarity], result of:
                0.09001608 = score(doc=0,freq=1.0), product of:
                  0.29335263 = queryWeight, product of:
                    0.30685282 = idf(docFreq=1, docCount=1)
                    0.9560043 = queryNorm
                  0.30685282 = fieldWeight in 0, product of:
                    1.0 = tf(freq=1.0), with freq of:
                      1.0 = termFreq=1.0
                    0.30685282 = idf(docFreq=1, docCount=1)
                    1.0 = fieldNorm(doc=0)
            0.5 = coord(1/2)
          
          Show
          shebiki Terry Smith added a comment - Explain output for the failing query (testBQHitOrEmpty): 0.0 = product of: 0.0 = sum of: 0.0 = weight(test:hit in 0) [DefaultSimilarity], result of: 0.0 = score(doc=0,freq=1.0), product of: 0.0 = queryWeight, product of: 0.30685282 = idf(docFreq=1, docCount=1) 0.0 = queryNorm 0.30685282 = fieldWeight in 0, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 0.30685282 = idf(docFreq=1, docCount=1) 1.0 = fieldNorm(doc=0) 0.5 = coord(1/2) Explain output for the variant against a populated field (testBQHitOrMiss): 0.04500804 = product of: 0.09001608 = sum of: 0.09001608 = weight(test:hit in 0) [DefaultSimilarity], result of: 0.09001608 = score(doc=0,freq=1.0), product of: 0.29335263 = queryWeight, product of: 0.30685282 = idf(docFreq=1, docCount=1) 0.9560043 = queryNorm 0.30685282 = fieldWeight in 0, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 0.30685282 = idf(docFreq=1, docCount=1) 1.0 = fieldNorm(doc=0) 0.5 = coord(1/2)
          Hide
          rcmuir Robert Muir added a comment -

          The problem is just with crappy queryNorm in DefaultSimilarity, as expected.

          Previously maxDoc was used, which was always assumed to be a positive integer... but docCount can be zero.

          Show
          rcmuir Robert Muir added a comment - The problem is just with crappy queryNorm in DefaultSimilarity, as expected. Previously maxDoc was used, which was always assumed to be a positive integer... but docCount can be zero.
          Hide
          jpountz Adrien Grand added a comment -

          +1

          Show
          jpountz Adrien Grand added a comment - +1
          Hide
          shebiki Terry Smith added a comment -

          Ah, you've changed DefaultSimilarity.idf() to use (docCount + 1) instead of just docCount forcing it to be larger than 0.

          That looks like a great fix, thanks.

          Show
          shebiki Terry Smith added a comment - Ah, you've changed DefaultSimilarity.idf() to use (docCount + 1) instead of just docCount forcing it to be larger than 0. That looks like a great fix, thanks.
          Hide
          rcmuir Robert Muir added a comment -

          Thank you for contributing the tests.

          Show
          rcmuir Robert Muir added a comment - Thank you for contributing the tests.
          Hide
          mikemccand Michael McCandless added a comment -

          +1

          Show
          mikemccand Michael McCandless added a comment - +1
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 1701895 from Robert Muir in branch 'dev/trunk'
          [ https://svn.apache.org/r1701895 ]

          LUCENE-6758: don't let queries over nonexistent fields screw up querynorm

          Show
          jira-bot ASF subversion and git services added a comment - Commit 1701895 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1701895 ] LUCENE-6758 : don't let queries over nonexistent fields screw up querynorm

            People

            • Assignee:
              Unassigned
              Reporter:
              shebiki Terry Smith
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:

                Development