Lucene - Core
  1. Lucene - Core
  2. LUCENE-2336

off by one: DisjunctionSumScorer::advance

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: core/search
    • Labels:
    • Lucene Fields:
      New

      Description

      The bug is:

      if (target <= currentDoc) {

      should be

      if (target < currentDoc) {

      based on the comments for the method as well as the contract for DocIdSetIterator: "Advances to the first beyond the current"

      It can be demonstrated by:

      assertEquals("advance(1) first match failed", 1, scorer.advance(1));
      assertEquals("advance(1) second match failed", n, scorer.advance(1));

      if docId: 1 is a hit and n is the next hit. (Tests all pass if this code change is made.)

      I'm not labeling it as major because the class is package-protected and currently passes spec.

      Relevant excerpt:

      /**

      • Advances to the first match beyond the current whose document number is
      • greater than or equal to a given target. <br>
      • When this method is used the {@link #explain(int)}

        method should not be

      • used. <br>
      • The implementation uses the skipTo() method on the subscorers.
      • @param target
      • The target document number.
      • @return the document whose number is greater than or equal to the given
      • target, or -1 if none exist.
        */
        public int advance(int target) throws IOException {
        if (scorerDocQueue.size() < minimumNrMatchers) { return currentDoc = NO_MORE_DOCS; }

        if (target <= currentDoc)

        { return currentDoc; }

        do {
        if (scorerDocQueue.topDoc() >= target)

        { boolean b = advanceAfterCurrent(); return b ? currentDoc : (currentDoc = NO_MORE_DOCS); }

        else if (!scorerDocQueue.topSkipToAndAdjustElsePop(target))

        Unknown macro: { if (scorerDocQueue.size() < minimumNrMatchers) { return currentDoc = NO_MORE_DOCS; } }

        } while (true);
        }

        Activity

        Hide
        Gary Yngve added a comment -

        I don't have a specific use case. Long story short, I am toying with a BM25 implementation via subclassing a slightly refactored DisjunctionSumScorer, as the published BM25 research implementation doesn't work w/ phrases and is too smelly for my tastes. I'm attempting to understand how everything works deep in lucene by reading the code and playing with it (maybe I should try to read choice checkin logs too) and discovered this inconsistency in a unit test that I was playing with.

        The advance docs for DISI state that it behaves as written:

        int advance(int target) {
        int doc;
        while ((doc = nextDoc()) < target) {
        }
        return doc;
        }

        This to me implies that advance(0) should be equivalent to calling nextDoc() in all cases.
        (And this is how advance behaves in TermScorer and PhraseScorer.)

        However the real behavior for DSS is more like:

        while (docID() < target)

        { if (!next()) return -1; }

        return docID();

        On the other hand, the thread you cite states:

        > > To me, calling skipTo or advance with the same target multiple times and get different result every time is weird. I'd like to change that semantic

        > But I thought we had just agreed that skipTo(doc) is well defined only for doc>current? (see "bigger question" point above).
        > If we all agree that , then we don't need to worry about skipTo(10) being called twice in a row.

        So it sounds like I'd be reasonably happy with a resolution of documenting advance in DISI with:
        "The behavior of calling advance with a target equal to docID() is undefined,"

        and happier with a convincing argument of why this situation never arises within any Lucene scorers in real life.

        -Gary

        Show
        Gary Yngve added a comment - I don't have a specific use case. Long story short, I am toying with a BM25 implementation via subclassing a slightly refactored DisjunctionSumScorer, as the published BM25 research implementation doesn't work w/ phrases and is too smelly for my tastes. I'm attempting to understand how everything works deep in lucene by reading the code and playing with it (maybe I should try to read choice checkin logs too) and discovered this inconsistency in a unit test that I was playing with. The advance docs for DISI state that it behaves as written: int advance(int target) { int doc; while ((doc = nextDoc()) < target) { } return doc; } This to me implies that advance(0) should be equivalent to calling nextDoc() in all cases. (And this is how advance behaves in TermScorer and PhraseScorer.) However the real behavior for DSS is more like: while (docID() < target) { if (!next()) return -1; } return docID(); On the other hand, the thread you cite states: > > To me, calling skipTo or advance with the same target multiple times and get different result every time is weird. I'd like to change that semantic > But I thought we had just agreed that skipTo(doc) is well defined only for doc>current? (see "bigger question" point above). > If we all agree that , then we don't need to worry about skipTo(10) being called twice in a row. So it sounds like I'd be reasonably happy with a resolution of documenting advance in DISI with: "The behavior of calling advance with a target equal to docID() is undefined," and happier with a convincing argument of why this situation never arises within any Lucene scorers in real life. -Gary
        Hide
        Shai Erera added a comment -

        Hi Gary

        This has been discussed before (I'm not sure if about DisjunctionSumScorer specifically), and therefore there is also a NOTE in advance() of DISI:

           * <b>NOTE:</b> certain implementations may return a different value (each
           * time) if called several times in a row with the same target.
        

        Note the may return a different value... part. I remember while working on LUCENE-1614 that this has been discussed and thus we ended up w/ documenting that may return part. See here: https://issues.apache.org/jira/browse/LUCENE-1614?focusedCommentId=12710860&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12710860 and read some above and below to see relevant discussion.

        I'll need to refresh my memory though why DisjunctionSumScorer works like that ... perhaps an overlook on my side from 1614, but perhaps there was a reason.

        Anyway, about the code example you gave above, why would you want to call advance w/ the same value many times? What's the use case? If you're only dealing w/ one DISI, then unless you really want to skip to a certain document, I don't see any reason for calling advance. The usage is typically if you have 2 or more DISIs, and one's nextDoc or advance returned a value that is greater than the other's doc() ...

        Also, it's risky to write the code you wrote, because some scorers, upon init are already on a certain doc (I think the Disj. ones, but maybe also the Conj. one), and so by calling advance(1), you will actually skip over the first document and miss a hit.

        Can you clarify the usage then?

        Show
        Shai Erera added a comment - Hi Gary This has been discussed before (I'm not sure if about DisjunctionSumScorer specifically), and therefore there is also a NOTE in advance() of DISI: * <b>NOTE:</b> certain implementations may return a different value (each * time) if called several times in a row with the same target. Note the may return a different value... part. I remember while working on LUCENE-1614 that this has been discussed and thus we ended up w/ documenting that may return part. See here: https://issues.apache.org/jira/browse/LUCENE-1614?focusedCommentId=12710860&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12710860 and read some above and below to see relevant discussion. I'll need to refresh my memory though why DisjunctionSumScorer works like that ... perhaps an overlook on my side from 1614, but perhaps there was a reason. Anyway, about the code example you gave above, why would you want to call advance w/ the same value many times? What's the use case? If you're only dealing w/ one DISI, then unless you really want to skip to a certain document, I don't see any reason for calling advance. The usage is typically if you have 2 or more DISIs, and one's nextDoc or advance returned a value that is greater than the other's doc() ... Also, it's risky to write the code you wrote, because some scorers, upon init are already on a certain doc (I think the Disj. ones, but maybe also the Conj. one), and so by calling advance(1), you will actually skip over the first document and miss a hit. Can you clarify the usage then?

          People

          • Assignee:
            Unassigned
            Reporter:
            Gary Yngve
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:

              Time Tracking

              Estimated:
              Original Estimate - 4h
              4h
              Remaining:
              Remaining Estimate - 4h
              4h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development