Details

    • Type: Bug
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 2.0.0
    • Fix Version/s: 2.2
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Filtered query ignores it's own boost.

      1. lucene-698.patch
        5 kB
        Michael Busch

        Activity

        Hide
        yseeley@gmail.com Yonik Seeley added a comment -

        I just commited hashCode() and equals() changes to take boost into account so that
        generic tests in QueryUtils.check(query) can pass.

        One should arguably be able to set the boost on any query clause, so I'm leaving this open since I think scoring probably ignores the boost too.

        Show
        yseeley@gmail.com Yonik Seeley added a comment - I just commited hashCode() and equals() changes to take boost into account so that generic tests in QueryUtils.check(query) can pass. One should arguably be able to set the boost on any query clause, so I'm leaving this open since I think scoring probably ignores the boost too.
        Hide
        michaelbusch Michael Busch added a comment -

        With this patch FilteredQuery takes the boost into account for
        scoring. It includes a test that fails with the trunk version
        and passes with this patch.

        This patch also removes one test from TestSimpleExplanations:
        testFQ7(). These tests check if the score and the value in the
        explanation are the same. testFQ7() in particular verifies this
        for a FilterQuery with a boost of 0. But with a boost of 0 the
        score and the explanation has the value NaN, which makes
        assertEquals() fail. So I believe this is a incorrect test case.
        We just didn't notice it before because FilteredQuery did not
        take the boost into account.

        All unit tests pass.

        Show
        michaelbusch Michael Busch added a comment - With this patch FilteredQuery takes the boost into account for scoring. It includes a test that fails with the trunk version and passes with this patch. This patch also removes one test from TestSimpleExplanations: testFQ7(). These tests check if the score and the value in the explanation are the same. testFQ7() in particular verifies this for a FilterQuery with a boost of 0. But with a boost of 0 the score and the explanation has the value NaN, which makes assertEquals() fail. So I believe this is a incorrect test case. We just didn't notice it before because FilteredQuery did not take the boost into account. All unit tests pass.
        Hide
        hossman Hoss Man added a comment -

        i think the test class and test case testFQ7 in particular are "correct" in the sense that they try to verify every conceivable permutation of stock query times has an explanation that matches it's score ... the problem may just be in the CheckHits.ExplanationAsserter class ... perhaps it should test if either the score or the explanation value are NaN before comparing them, and fail if only one is NaN or if neither is NaN but they are not equal)

        (after all: if the score is NaN, then the explanation should be NaN as well)

        Show
        hossman Hoss Man added a comment - i think the test class and test case testFQ7 in particular are "correct" in the sense that they try to verify every conceivable permutation of stock query times has an explanation that matches it's score ... the problem may just be in the CheckHits.ExplanationAsserter class ... perhaps it should test if either the score or the explanation value are NaN before comparing them, and fail if only one is NaN or if neither is NaN but they are not equal) (after all: if the score is NaN, then the explanation should be NaN as well)
        Hide
        michaelbusch Michael Busch added a comment -

        > perhaps it should test if either the score or the explanation value are NaN
        > before comparing them, and fail if only one is NaN or if neither is NaN but
        > they are not equal)

        Thanks for reviewing, Hoss! You are right, we could do that and I was actually
        thinking about it already. The problem is if I make this fix than testFQ7 fails
        for TestSimpleExplanationsOfNonMatches because it is assumed that all
        non matching docs have a score of 0.0. I can easily change that, so that non
        matching docs can either have a score of 0.0 or NaN but I was not sure if we
        want that, because other scoring bugs resulting in a score of NaN (which we
        will hopefully never have) wouldn't be noticed then anymore.

        The reason why I argued that testFQ7 is an invalid test case is that it would
        fail for any other query with a boost set to 0. Ironically we have this test
        only for FilteredQuery, the only query class that ignores the boost, which
        made it pass in the past.

        Show
        michaelbusch Michael Busch added a comment - > perhaps it should test if either the score or the explanation value are NaN > before comparing them, and fail if only one is NaN or if neither is NaN but > they are not equal) Thanks for reviewing, Hoss! You are right, we could do that and I was actually thinking about it already. The problem is if I make this fix than testFQ7 fails for TestSimpleExplanationsOfNonMatches because it is assumed that all non matching docs have a score of 0.0. I can easily change that, so that non matching docs can either have a score of 0.0 or NaN but I was not sure if we want that, because other scoring bugs resulting in a score of NaN (which we will hopefully never have) wouldn't be noticed then anymore. The reason why I argued that testFQ7 is an invalid test case is that it would fail for any other query with a boost set to 0. Ironically we have this test only for FilteredQuery, the only query class that ignores the boost, which made it pass in the past.
        Hide
        hossman Hoss Man added a comment -

        Ahhhh ... yes, looking back at the comments in LUCENE-557 I remember now: I originally thought boosts of 0.0 were legal for all queries, and then discovered i was wrong, and removed a bunch of tests – but i clearly missed this one because it wasn't failing.

        we should go ahead and remove the test ... but we should probably also fix FilteredQuery so that a boost of 0 produces some other result then just a NaN score (either an exception, or a score of 0) since as you say: NaN scores are bad.

        Show
        hossman Hoss Man added a comment - Ahhhh ... yes, looking back at the comments in LUCENE-557 I remember now: I originally thought boosts of 0.0 were legal for all queries, and then discovered i was wrong, and removed a bunch of tests – but i clearly missed this one because it wasn't failing. we should go ahead and remove the test ... but we should probably also fix FilteredQuery so that a boost of 0 produces some other result then just a NaN score (either an exception, or a score of 0) since as you say: NaN scores are bad.
        Hide
        michaelbusch Michael Busch added a comment -

        > but we should probably also fix FilteredQuery so that a boost of 0
        > produces some other result then just a NaN score (either an exception,
        > or a score of 0) since as you say: NaN scores are bad.

        TermQuery actually behaves the same way. If boost is zero, then
        sumOfSquaredWeights() returns zero as well, resulting in a
        queryNorm of Infinity (due to a div by zero if DefaultSimilarity is
        used). Then it multiplies boost and queryNorm and 0*Infinity=NaN.

        Show
        michaelbusch Michael Busch added a comment - > but we should probably also fix FilteredQuery so that a boost of 0 > produces some other result then just a NaN score (either an exception, > or a score of 0) since as you say: NaN scores are bad. TermQuery actually behaves the same way. If boost is zero, then sumOfSquaredWeights() returns zero as well, resulting in a queryNorm of Infinity (due to a div by zero if DefaultSimilarity is used). Then it multiplies boost and queryNorm and 0*Infinity=NaN.
        Hide
        michaelbusch Michael Busch added a comment -

        Maybe Query.setBoost() should throw an IllegalArgumentException
        in case the value is zero?

        Show
        michaelbusch Michael Busch added a comment - Maybe Query.setBoost() should throw an IllegalArgumentException in case the value is zero?
        Hide
        hossman Hoss Man added a comment -

        Hmmm.. didn't realize that. I withdrawal all previous comments. patch seems fine to me.

        Show
        hossman Hoss Man added a comment - Hmmm.. didn't realize that. I withdrawal all previous comments. patch seems fine to me.
        Hide
        hossman Hoss Man added a comment -

        whoops .. comment collision.

        i think the patch as it stands is fine for this issue .. but we may want another issue to hollisticly question NaN as a score.

        Show
        hossman Hoss Man added a comment - whoops .. comment collision. i think the patch as it stands is fine for this issue .. but we may want another issue to hollisticly question NaN as a score.
        Hide
        cutting Doug Cutting added a comment -

        > Maybe Query.setBoost() should throw an IllegalArgumentException in case the value is zero?

        FYI, Nutch uses Query.setBoost(0.0f) to add clauses which affect the set of results but not their ranking. In particular, it uses this to automatically convert query clauses into filters, so that query clauses like "lang:en" can be implemented as cached filters. Note that not all such clauses are so optimized.

        http://svn.apache.org/viewvc/lucene/nutch/trunk/src/java/org/apache/nutch/searcher/LuceneQueryOptimizer.java?view=markup

        Show
        cutting Doug Cutting added a comment - > Maybe Query.setBoost() should throw an IllegalArgumentException in case the value is zero? FYI, Nutch uses Query.setBoost(0.0f) to add clauses which affect the set of results but not their ranking. In particular, it uses this to automatically convert query clauses into filters, so that query clauses like "lang:en" can be implemented as cached filters. Note that not all such clauses are so optimized. http://svn.apache.org/viewvc/lucene/nutch/trunk/src/java/org/apache/nutch/searcher/LuceneQueryOptimizer.java?view=markup
        Hide
        michaelbusch Michael Busch added a comment -

        > FYI, Nutch uses Query.setBoost(0.0f) to add clauses which affect the
        > set of results but not their ranking. In particular, it uses this to
        > automatically convert query clauses into filters, so that query
        > clauses like "lang:en" can be implemented as cached filters. Note
        > that not all such clauses are so optimized.

        Thanks for the hint, Doug. OK, I understand how you use boost=0.0f in
        Nutch. Quite cool and elegant idea actually!

        I guess then throwing an IllegalArgumentException in case boost=0 would
        break this. The question remains if we should fix the scorers to never
        return NaN. Hmm, I'm not completely sure how to do this. Maybe
        DefaultSimilarity.queryNorm() should return 0 instead of Infinity in
        case sumOfSquaredWeights is 0. But then with custom Similarity
        implemenations we could still end up getting NaN.

        A different solution of course is to fix it in the scorers itself, to
        return a score of 0 in case boost is 0. But then we'd have to add
        checks in the score() and explain() methods, which might be a
        performance overhead.

        So I'm not sure if we should "fix" this at all considering these
        difficulties and the fact that nobody complained (I think?) about the
        NaN so far.

        Anyway, I'll go ahead and commit LUCENE-698 since this NaN problem is a
        separate issue and not only happing for the FilteredQuery.

        Show
        michaelbusch Michael Busch added a comment - > FYI, Nutch uses Query.setBoost(0.0f) to add clauses which affect the > set of results but not their ranking. In particular, it uses this to > automatically convert query clauses into filters, so that query > clauses like "lang:en" can be implemented as cached filters. Note > that not all such clauses are so optimized. Thanks for the hint, Doug. OK, I understand how you use boost=0.0f in Nutch. Quite cool and elegant idea actually! I guess then throwing an IllegalArgumentException in case boost=0 would break this. The question remains if we should fix the scorers to never return NaN. Hmm, I'm not completely sure how to do this. Maybe DefaultSimilarity.queryNorm() should return 0 instead of Infinity in case sumOfSquaredWeights is 0. But then with custom Similarity implemenations we could still end up getting NaN. A different solution of course is to fix it in the scorers itself, to return a score of 0 in case boost is 0. But then we'd have to add checks in the score() and explain() methods, which might be a performance overhead. So I'm not sure if we should "fix" this at all considering these difficulties and the fact that nobody complained (I think?) about the NaN so far. Anyway, I'll go ahead and commit LUCENE-698 since this NaN problem is a separate issue and not only happing for the FilteredQuery.
        Hide
        michaelbusch Michael Busch added a comment -

        Patch committed.

        Show
        michaelbusch Michael Busch added a comment - Patch committed.
        Hide
        cutting Doug Cutting added a comment -

        > If boost is zero, then
        > sumOfSquaredWeights() returns zero as well, resulting in a
        > queryNorm of Infinity (due to a div by zero if DefaultSimilarity is
        > used). Then it multiplies boost and queryNorm and 0*Infinity=NaN.

        The bug here to me seems that queryNorm is Infinity. A boost of zero has a reasonable interpretation (don't influence scoring), but I don't see how a queryNorm of Infinity is ever useful. So perhaps we can remove the NaN by modifying the default implementation of queryNorm to return 1.0 instead of Infinity when passed zero. Would that cause any harm?

        Show
        cutting Doug Cutting added a comment - > If boost is zero, then > sumOfSquaredWeights() returns zero as well, resulting in a > queryNorm of Infinity (due to a div by zero if DefaultSimilarity is > used). Then it multiplies boost and queryNorm and 0*Infinity=NaN. The bug here to me seems that queryNorm is Infinity. A boost of zero has a reasonable interpretation (don't influence scoring), but I don't see how a queryNorm of Infinity is ever useful. So perhaps we can remove the NaN by modifying the default implementation of queryNorm to return 1.0 instead of Infinity when passed zero. Would that cause any harm?
        Hide
        yseeley@gmail.com Yonik Seeley added a comment -

        > the default implementation of queryNorm to return 1.0 instead of Infinity when passed zero.

        That seems like it should be fine, esp since Similarity.queryNorm is only called at the top level when creating a weight.

        Show
        yseeley@gmail.com Yonik Seeley added a comment - > the default implementation of queryNorm to return 1.0 instead of Infinity when passed zero. That seems like it should be fine, esp since Similarity.queryNorm is only called at the top level when creating a weight.
        Hide
        michaelbusch Michael Busch added a comment -

        > So perhaps we can remove the NaN by modifying the default implementation of
        > queryNorm to return 1.0 instead of Infinity when passed zero. Would that
        > cause any harm?

        Yes I believe this should work, too. This would prevent the NaN score when
        DefaultSimilarity is used. It will be the responsibility of people
        who implement their own Similarity then to take care of this in a similar way.

        I'll open a new issue for fixing the DefaultSimilarity.

        Show
        michaelbusch Michael Busch added a comment - > So perhaps we can remove the NaN by modifying the default implementation of > queryNorm to return 1.0 instead of Infinity when passed zero. Would that > cause any harm? Yes I believe this should work, too. This would prevent the NaN score when DefaultSimilarity is used. It will be the responsibility of people who implement their own Similarity then to take care of this in a similar way. I'll open a new issue for fixing the DefaultSimilarity.

          People

          • Assignee:
            michaelbusch Michael Busch
            Reporter:
            yseeley@gmail.com Yonik Seeley
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development