Lucene - Core
  1. Lucene - Core
  2. LUCENE-2617

coord should still apply to missing terms/clauses

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 3.1, 4.0-ALPHA
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Missing terms in a boolean query "disappear" (i.e. they don't even affect the coord factor).

        Activity

        Hide
        Yonik Seeley added a comment -

        For solr users, this is easy to see in the example data.

        The following query has a coord of 1/2 for the solr document (since apple appears in a different doc). Press CTRL-U to see the explain properly formatted:

        http://localhost:8983/solr/select?q=solr%20apple&debugQuery=true

        The following query has no coord since zzz does not appear in the example corpus:

        http://localhost:8983/solr/select?q=solr%20zzz&debugQuery=true

        Show
        Yonik Seeley added a comment - For solr users, this is easy to see in the example data. The following query has a coord of 1/2 for the solr document (since apple appears in a different doc). Press CTRL-U to see the explain properly formatted: http://localhost:8983/solr/select?q=solr%20apple&debugQuery=true The following query has no coord since zzz does not appear in the example corpus: http://localhost:8983/solr/select?q=solr%20zzz&debugQuery=true
        Hide
        Yonik Seeley added a comment -

        This happens because of short circuiting during scorer creation - if the term doesn't exist in the index, then TermWeight.scorer() returns null. In the higher level BooleanWeight.scorer(), the clause is completely disregarded (unless it is mandatory).

        Show
        Yonik Seeley added a comment - This happens because of short circuiting during scorer creation - if the term doesn't exist in the index, then TermWeight.scorer() returns null. In the higher level BooleanWeight.scorer(), the clause is completely disregarded (unless it is mandatory).
        Hide
        Yonik Seeley added a comment -

        OK, the fix should be easy - just keep track of maxcoord in the weights and pass it into the scorer instead of having the scorers add it up. I'll get to it tomorrow.

        Show
        Yonik Seeley added a comment - OK, the fix should be easy - just keep track of maxcoord in the weights and pass it into the scorer instead of having the scorers add it up. I'll get to it tomorrow.
        Hide
        Yonik Seeley added a comment -

        Here's an unfinished (and untested) draft patch.

        Show
        Yonik Seeley added a comment - Here's an unfinished (and untested) draft patch.
        Hide
        Paul Elschot added a comment -

        Wouldn't it be easier to use a non matching Scorer instead of null?

        Show
        Paul Elschot added a comment - Wouldn't it be easier to use a non matching Scorer instead of null?
        Hide
        Yonik Seeley added a comment -

        Wouldn't it be easier to use a non matching Scorer instead of null?

        That was my bias, but people decided against that in LUCENE-1754

        Show
        Yonik Seeley added a comment - Wouldn't it be easier to use a non matching Scorer instead of null? That was my bias, but people decided against that in LUCENE-1754
        Hide
        Yonik Seeley added a comment -

        OK, this bug does manifest in 3.1 also - just not with term queries. A phrase query of non-existing terms can tickle it:
        http://localhost:8983/solr/select?q=solr+"zzz+zzz"&debugQuery=true

        This does still work on Solr 1.4/lLucene 2.9. Anyone know if null is actually ever returned from Weight.scorer() in 2.9?

        Show
        Yonik Seeley added a comment - OK, this bug does manifest in 3.1 also - just not with term queries. A phrase query of non-existing terms can tickle it: http://localhost:8983/solr/select?q=solr+ "zzz+zzz"&debugQuery=true This does still work on Solr 1.4/lLucene 2.9. Anyone know if null is actually ever returned from Weight.scorer() in 2.9?
        Hide
        Michael McCandless added a comment -

        Are we sure that non-existent terms/phrases should contribute to coord? I have no idea which way is "right" I'm just asking...

        Anyone know if null is actually ever returned from Weight.scorer() in 2.9?

        Hmm LUCENE-1754 was fixed in 2.9? Didn't in bring allowed null return from Weight.scorer()?

        Show
        Michael McCandless added a comment - Are we sure that non-existent terms/phrases should contribute to coord? I have no idea which way is "right" I'm just asking... Anyone know if null is actually ever returned from Weight.scorer() in 2.9? Hmm LUCENE-1754 was fixed in 2.9? Didn't in bring allowed null return from Weight.scorer()?
        Hide
        Marvin Humphrey added a comment -

        > Anyone know if null is actually ever returned from Weight.scorer() in 2.9?

        This behavior goes back a long time – at least back as far as svn trunk just
        prior to 1.9, which is what I worked off of when I was porting.

        Show
        Marvin Humphrey added a comment - > Anyone know if null is actually ever returned from Weight.scorer() in 2.9? This behavior goes back a long time – at least back as far as svn trunk just prior to 1.9, which is what I worked off of when I was porting.
        Hide
        Michael McCandless added a comment -

        Amazingly, I just hit a random test failure (seed = 5148889059941714836L reproduces it):

            [junit] Testsuite: org.apache.lucene.search.TestMultiSearcherRanking
            [junit] Testcase: testTwoTermQuery(org.apache.lucene.search.TestMultiSearcherRanking):	FAILED
            [junit] expected:<0.48314467> but was:<0.60746753>
            [junit] junit.framework.AssertionFailedError: expected:<0.48314467> but was:<0.60746753>
            [junit] 	at org.apache.lucene.search.TestMultiSearcherRanking.checkQuery(TestMultiSearcherRanking.java:102)
            [junit] 	at org.apache.lucene.search.TestMultiSearcherRanking.testTwoTermQuery(TestMultiSearcherRanking.java:47)
            [junit] 	at org.apache.lucene.util.LuceneTestCase.runBare(LuceneTestCase.java:380)
            [junit] 	at org.apache.lucene.util.LuceneTestCase.run(LuceneTestCase.java:372)
        

        The failure is in fact caused by this issue! (The patch fixes it). The RandomIndexWriter in this test w/ the above seed creates a multi-segment index, and then tests on a 2-term BQ. Some segments did/didn't have both of the terms and this causes the score difference.

        So... I think we really should fix this, for consistency. Otherwise, how your docs fall into segments can alter your scores, which is awful.

        Show
        Michael McCandless added a comment - Amazingly, I just hit a random test failure (seed = 5148889059941714836L reproduces it): [junit] Testsuite: org.apache.lucene.search.TestMultiSearcherRanking [junit] Testcase: testTwoTermQuery(org.apache.lucene.search.TestMultiSearcherRanking): FAILED [junit] expected:<0.48314467> but was:<0.60746753> [junit] junit.framework.AssertionFailedError: expected:<0.48314467> but was:<0.60746753> [junit] at org.apache.lucene.search.TestMultiSearcherRanking.checkQuery(TestMultiSearcherRanking.java:102) [junit] at org.apache.lucene.search.TestMultiSearcherRanking.testTwoTermQuery(TestMultiSearcherRanking.java:47) [junit] at org.apache.lucene.util.LuceneTestCase.runBare(LuceneTestCase.java:380) [junit] at org.apache.lucene.util.LuceneTestCase.run(LuceneTestCase.java:372) The failure is in fact caused by this issue! (The patch fixes it). The RandomIndexWriter in this test w/ the above seed creates a multi-segment index, and then tests on a 2-term BQ. Some segments did/didn't have both of the terms and this causes the score difference. So... I think we really should fix this, for consistency. Otherwise, how your docs fall into segments can alter your scores, which is awful.
        Hide
        Robert Muir added a comment -

        So... I think we really should fix this, for consistency. Otherwise, how your docs fall into segments can alter your scores, which is awful.

        +1

        Show
        Robert Muir added a comment - So... I think we really should fix this, for consistency. Otherwise, how your docs fall into segments can alter your scores, which is awful. +1
        Hide
        Yonik Seeley added a comment -

        Are we sure that non-existent terms/phrases should contribute to coord?

        Yep

        Adding an unrelated document and having scores radically change on other docs isn't a good thing.
        It also makes testing and relevancy tuning hard (hoss & I were really scratching our heads last night trying to figure out why coord wasn't being applied). idf is currently the only similarity factor that varies with index size + content... adding coord to that list wouldn't be good.

        Actually, I just thought of something potentially much worse. With per-segment searching, doesn't this make coord vary per-segment? That means we're adding up scores that aren't comparable... and all of a sudden this is looking like a much worse bug. A match in a very small segment is likely to contribute much more to a score than a match in a large segment.

        Show
        Yonik Seeley added a comment - Are we sure that non-existent terms/phrases should contribute to coord? Yep Adding an unrelated document and having scores radically change on other docs isn't a good thing. It also makes testing and relevancy tuning hard (hoss & I were really scratching our heads last night trying to figure out why coord wasn't being applied). idf is currently the only similarity factor that varies with index size + content... adding coord to that list wouldn't be good. Actually, I just thought of something potentially much worse. With per-segment searching, doesn't this make coord vary per-segment? That means we're adding up scores that aren't comparable... and all of a sudden this is looking like a much worse bug. A match in a very small segment is likely to contribute much more to a score than a match in a large segment.
        Hide
        Yonik Seeley added a comment -

        All tests pass for me: I'll commit shortly to trunk and then backport to 3.1

        Show
        Yonik Seeley added a comment - All tests pass for me: I'll commit shortly to trunk and then backport to 3.1
        Hide
        Yonik Seeley added a comment -

        I backported the test to the 3.0 branch, and everything looks good.
        Both 2.9 and 3.0 remove null scorers w/o accounting for them in coord, but I don't believe we actually ever return any null scorers in those versions.

        Show
        Yonik Seeley added a comment - I backported the test to the 3.0 branch, and everything looks good. Both 2.9 and 3.0 remove null scorers w/o accounting for them in coord, but I don't believe we actually ever return any null scorers in those versions.
        Hide
        Grant Ingersoll added a comment -

        Bulk close for 3.1

        Show
        Grant Ingersoll added a comment - Bulk close for 3.1

          People

          • Assignee:
            Unassigned
            Reporter:
            Yonik Seeley
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development