Lucene - Core
  1. Lucene - Core
  2. LUCENE-1614

Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.9
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      See http://www.nabble.com/Another-possible-optimization---now-in-DocIdSetIterator-p23223319.html for the full discussion. The basic idea is to add variants to those two methods that return the current doc they are at, to save successive calls to doc(). If there are no more docs, return -1. A summary of what was discussed so far:

      1. Deprecate those two methods.
      2. Add nextDoc() and skipToDoc(int) that return doc, with default impl in DISI (calls next() and skipTo() respectively, and will be changed to abstract in 3.0).
        • I actually would like to propose an alternative to the names: advance() and advance(int) - the first advances by one, the second advances to target.
      3. Wherever these are used, do something like '(doc = advance()) >= 0' instead of comparing to -1 for improved performance.

      I will post a patch shortly

      1. LUCENE-1614.patch
        212 kB
        Shai Erera
      2. LUCENE-1614.patch
        212 kB
        Shai Erera
      3. LUCENE-1614.patch
        212 kB
        Shai Erera
      4. LUCENE-1614.patch
        152 kB
        Michael McCandless
      5. LUCENE-1614.patch
        201 kB
        Shai Erera
      6. LUCENE-1614.patch
        197 kB
        Shai Erera
      7. LUCENE-1614.patch
        197 kB
        Shai Erera
      8. LUCENE-1614.patch
        135 kB
        Michael McCandless
      9. LUCENE-1614.patch
        135 kB
        Shai Erera
      10. LUCENE-1614.patch
        135 kB
        Shai Erera
      11. LUCENE-1614.patch
        101 kB
        Shai Erera
      12. LUCENE-1614.patch
        101 kB
        Shai Erera
      13. LUCENE-1614.patch
        82 kB
        Shai Erera
      14. LUCENE-1614-advance-bw.patch
        0.6 kB
        Uwe Schindler
      15. LUCENE-1614-advance-bw.patch
        0.5 kB
        Uwe Schindler

        Activity

        Hide
        Michael McCandless added a comment -

        Shai did you forget to attach patch here? Or maybe you're just busy

        Show
        Michael McCandless added a comment - Shai did you forget to attach patch here? Or maybe you're just busy
        Hide
        Shai Erera added a comment -

        No I did not forget - I need to work on it (trying to juggle all the issues I opened ) ... in general I don't like to work on overlapping issues and this overlaps with 1593 (it will touch some of the same files). But I can start working on the patch - it looks much simpler than 1593 ...

        One thing I wanted to get feedback on is the proposal to use advance() and advance(target). Let's decide on that now, so that I don't need to "refactor" everything afterwards

        Show
        Shai Erera added a comment - No I did not forget - I need to work on it (trying to juggle all the issues I opened ) ... in general I don't like to work on overlapping issues and this overlaps with 1593 (it will touch some of the same files). But I can start working on the patch - it looks much simpler than 1593 ... One thing I wanted to get feedback on is the proposal to use advance() and advance(target). Let's decide on that now, so that I don't need to "refactor" everything afterwards
        Hide
        Marvin Humphrey added a comment -

        > advance() and advance(int)

        In the interest of coherent email exchanges, I think it would be best to give
        these methods distinct names, e.g. "nudge" and "advance".

        Show
        Marvin Humphrey added a comment - > advance() and advance(int) In the interest of coherent email exchanges, I think it would be best to give these methods distinct names, e.g. "nudge" and "advance".
        Hide
        Shai Erera added a comment -

        nudge doesn't sound like it changes anything, but just "touches". So if distinct method names is what we're after, I prefer nextDoc() and skipToDoc() or advance() for the latter.

        Show
        Shai Erera added a comment - nudge doesn't sound like it changes anything, but just "touches". So if distinct method names is what we're after, I prefer nextDoc() and skipToDoc() or advance() for the latter.
        Hide
        Marvin Humphrey added a comment -

        > nudge doesn't sound like it changes anything, but just "touches".

        If you say so. In Lucy, I expect we'll use "next" and "advance".

        > if distinct method names is what we're after

        Yes, that's the idea. These two methods are very different from each other.
        The official definition of skipTo() has many subtle gotchas. Just because
        they both move the iterator forward doesn't mean they do the same thing, and
        it is cumbersome and taxing to have to differentiate between methods using
        long-form signatures in the midst of standard prose.

        There's no good reason to conflate these two methods, just as there's no
        good reason why we should be forced to write "search(Collector)" instead
        of "collect()" or "collectHits()".

        > I prefer nextDoc() and skipToDoc() or advance() for the latter.

        IMO, "advance" more accurately describes what that method does than either
        "skipTo" or "skipToDoc". The problem is that if you're on doc 10, then
        skipToDoc(10) doesn't, in fact, skip to doc 10 as the method name implies –
        it takes you to at least doc 11. Furthermore, "advance" reinforces that you
        can only seek forwards.

        Show
        Marvin Humphrey added a comment - > nudge doesn't sound like it changes anything, but just "touches". If you say so. In Lucy, I expect we'll use "next" and "advance". > if distinct method names is what we're after Yes, that's the idea. These two methods are very different from each other. The official definition of skipTo() has many subtle gotchas. Just because they both move the iterator forward doesn't mean they do the same thing, and it is cumbersome and taxing to have to differentiate between methods using long-form signatures in the midst of standard prose. There's no good reason to conflate these two methods, just as there's no good reason why we should be forced to write "search(Collector)" instead of "collect()" or "collectHits()". > I prefer nextDoc() and skipToDoc() or advance() for the latter. IMO, "advance" more accurately describes what that method does than either "skipTo" or "skipToDoc". The problem is that if you're on doc 10, then skipToDoc(10) doesn't, in fact, skip to doc 10 as the method name implies – it takes you to at least doc 11. Furthermore, "advance" reinforces that you can only seek forwards.
        Hide
        Marvin Humphrey added a comment -

        Further illustration...

        Good method signature overloading, from IndexReader.java:

          public static boolean indexExists(String directory)
        
          public static boolean indexExists(File directory)
        
          public static boolean indexExists(Directory directory);
        

        Bad method signature overloading, from Searcher.java:

          public Hits search(Query query, Filter filter, Sort sort)
        
          public TopFieldDocs search(Query query, Filter filter, int n, Sort sort)
        
          public void search(Query query, HitCollector results)
        

        IMO, those three methods on Searcher should be named hits(), topFieldDocs(),
        and collect(), rather than search(), search(), and search(), making code that
        uses those methods more self documenting, and making it easier to discuss them
        out of context.

        For the same reasons, we should have different names for nextDoc() and
        advance().

        Show
        Marvin Humphrey added a comment - Further illustration... Good method signature overloading, from IndexReader.java: public static boolean indexExists(String directory) public static boolean indexExists(File directory) public static boolean indexExists(Directory directory); Bad method signature overloading, from Searcher.java: public Hits search(Query query, Filter filter, Sort sort) public TopFieldDocs search(Query query, Filter filter, int n, Sort sort) public void search(Query query, HitCollector results) IMO, those three methods on Searcher should be named hits(), topFieldDocs(), and collect(), rather than search(), search(), and search(), making code that uses those methods more self documenting, and making it easier to discuss them out of context. For the same reasons, we should have different names for nextDoc() and advance().
        Hide
        Michael McCandless added a comment -

        Maybe nextDoc and advanceToDoc?

        Re: skipTo doing alot.... one thing that's missing from
        DISI is method "advance to doc X then return a boolean
        telling me if you accept that doc, but do not do a next()". In
        LUCENE-1536 in some cases we gain sizable performance by using such an
        API with a relatively sparse filter against a relatively expensive
        (BooleanScorer2) scorer. Though "filter as BooleanClause" may undo
        these performance gains.

        Show
        Michael McCandless added a comment - Maybe nextDoc and advanceToDoc? Re: skipTo doing alot.... one thing that's missing from DISI is method "advance to doc X then return a boolean telling me if you accept that doc, but do not do a next()". In LUCENE-1536 in some cases we gain sizable performance by using such an API with a relatively sparse filter against a relatively expensive (BooleanScorer2) scorer. Though "filter as BooleanClause" may undo these performance gains.
        Hide
        Shai Erera added a comment -

        Whoa ! so many emails on this list that I completely missed the last two posts by Marvin (sorry about that Marvin).

        Maybe nextDoc and advanceToDoc? (or Matvin's nextDoc() and advance())

        I'm fine with nextDoc(). About advance, I think I prefer advance(int target) for the sake of not repeating the same information twice (i.e., advance*ToDoc*(int target)). So if I see a line like this: disi.advance(15) I think it's understood exactly as disi.advanceToDoc(15), only the latter is longer.

        advance to doc X then return a boolean telling me if you accept that doc, but do not do a next()

        What do you mean "do not do a next()"? Let's say I have a DISI that can return docs 1, 3, 7, 10 but it doesn't know so in advance. When you call skipTo(8), it must reach 7, then figure it's less than 8 and do a next() to land on 10. Do you mean that it should return false in that case, and not move from 7 to 10? Since it already moved, and let's assume it cannot go back (at least efficiently), that means it should remember the last skipTo called was 8 and it is on 10, so if it is requested to skip to 9, it should return false again ... is that what you meant?

        Show
        Shai Erera added a comment - Whoa ! so many emails on this list that I completely missed the last two posts by Marvin (sorry about that Marvin). Maybe nextDoc and advanceToDoc? (or Matvin's nextDoc() and advance()) I'm fine with nextDoc(). About advance, I think I prefer advance(int target) for the sake of not repeating the same information twice (i.e., advance*ToDoc*(int target )). So if I see a line like this: disi.advance(15) I think it's understood exactly as disi.advanceToDoc(15) , only the latter is longer. advance to doc X then return a boolean telling me if you accept that doc, but do not do a next() What do you mean "do not do a next()"? Let's say I have a DISI that can return docs 1, 3, 7, 10 but it doesn't know so in advance. When you call skipTo(8), it must reach 7, then figure it's less than 8 and do a next() to land on 10. Do you mean that it should return false in that case, and not move from 7 to 10? Since it already moved, and let's assume it cannot go back (at least efficiently), that means it should remember the last skipTo called was 8 and it is on 10, so if it is requested to skip to 9, it should return false again ... is that what you meant?
        Hide
        Michael McCandless added a comment -

        I think I prefer advance(int target)

        OK +1. Nice and succinct.

        What do you mean "do not do a next()"? Let's say I have a DISI that can return docs 1, 3, 7, 10 but it doesn't know so in advance. When you call skipTo(8), it must reach 7, then figure it's less than 8 and do a next() to land on 10. Do you mean that it should return false in that case, and not move from 7 to 10? Since it already moved, and let's assume it cannot go back (at least efficiently), that means it should remember the last skipTo called was 8 and it is on 10, so if it is requested to skip to 9, it should return false again ... is that what you meant?

        I mean the scorer should go straight to the doc I asked for and test whether it accepts that doc and do nothing else. Call this api "boolean check(int doc)" for now.

        It'd be useful for more interesting scorers. EG ConjunctionScorer.check would simply call check on each sub-scorer, stopping early if any return false; this is less work than what it does today.

        You're right, for the TermScorer case nothing is saved since it uses next internally to determine if a doc matches.

        Show
        Michael McCandless added a comment - I think I prefer advance(int target) OK +1. Nice and succinct. What do you mean "do not do a next()"? Let's say I have a DISI that can return docs 1, 3, 7, 10 but it doesn't know so in advance. When you call skipTo(8), it must reach 7, then figure it's less than 8 and do a next() to land on 10. Do you mean that it should return false in that case, and not move from 7 to 10? Since it already moved, and let's assume it cannot go back (at least efficiently), that means it should remember the last skipTo called was 8 and it is on 10, so if it is requested to skip to 9, it should return false again ... is that what you meant? I mean the scorer should go straight to the doc I asked for and test whether it accepts that doc and do nothing else. Call this api "boolean check(int doc)" for now. It'd be useful for more interesting scorers. EG ConjunctionScorer.check would simply call check on each sub-scorer, stopping early if any return false; this is less work than what it does today. You're right, for the TermScorer case nothing is saved since it uses next internally to determine if a doc matches.
        Hide
        Shai Erera added a comment -

        I mean the scorer should go straight to the doc I asked for and test whether it accepts that doc and do nothing else.

        Just to clarify for myself, in the example I gave above, suppose thar the scorer is on "3" and you call check(8). Do you expect it to go to 10, realize that 8 is not supported and go back to 3? Can it be called with check(7) afterwards? If not, then why not use advance(8), get back 10 and realize 8 is not supported? You should be able to call advance(9) without it advancing beyond 10 (which it's currently on).

        Your comment on TermScorer just reinforces my confusion - what is this API good for, and what's missing in advance(target) today? Can't ConjunctionScorer call advance until it gets a response which is not what was asked for?

        Show
        Shai Erera added a comment - I mean the scorer should go straight to the doc I asked for and test whether it accepts that doc and do nothing else. Just to clarify for myself, in the example I gave above, suppose thar the scorer is on "3" and you call check(8). Do you expect it to go to 10, realize that 8 is not supported and go back to 3? Can it be called with check(7) afterwards? If not, then why not use advance(8), get back 10 and realize 8 is not supported? You should be able to call advance(9) without it advancing beyond 10 (which it's currently on). Your comment on TermScorer just reinforces my confusion - what is this API good for, and what's missing in advance(target) today? Can't ConjunctionScorer call advance until it gets a response which is not what was asked for?
        Hide
        Shai Erera added a comment -

        I think I understand what you mean, but please correct me if I'm wrong. You propose this check() so that in case a DISI can save any extra operations it does in next() (such as reading a payload for example) it will do so. Therefore in the example you give above with CS, next()'s contract forces it to advance all the sub-scorers, but with check() it could stop in the middle.

        This warrants an explicit documentation and implementation by current DISIs ... I don't think that if you call a DISI today with next(10) and next(10) it will not move to 11 in the second call. But calling check(10) and next(10) MUST not advance the DISI further than 10. If the default impl in DISI just uses nextDoc() and returns true if the return value is the requested, we should be safe back-compat-wise, but this is still dangerous and we need clear documentation.

        BTW, perhaps a testAndSet-like version can save check(10) followed by a next(10), and will fit nicer?

        Show
        Shai Erera added a comment - I think I understand what you mean, but please correct me if I'm wrong. You propose this check() so that in case a DISI can save any extra operations it does in next() (such as reading a payload for example) it will do so. Therefore in the example you give above with CS, next()'s contract forces it to advance all the sub-scorers, but with check() it could stop in the middle. This warrants an explicit documentation and implementation by current DISIs ... I don't think that if you call a DISI today with next(10) and next(10) it will not move to 11 in the second call. But calling check(10) and next(10) MUST not advance the DISI further than 10. If the default impl in DISI just uses nextDoc() and returns true if the return value is the requested, we should be safe back-compat-wise, but this is still dangerous and we need clear documentation. BTW, perhaps a testAndSet-like version can save check(10) followed by a next(10), and will fit nicer?
        Hide
        Michael McCandless added a comment -

        Just to clarify for myself, in the example I gave above, suppose thar the scorer is on "3" and you call check(8).

        On check(8), TermScorer would go to 10, stop there, and return false. (It would not "rewind" to 3). Check can only be called on increasing arguments, so it's not truly "random access". It's "forward only random access".

        You propose this check() so that in case a DISI can save any extra operations it does in next() (such as reading a payload for example) it will do so. Therefore in the example you give above with CS, next()'s contract forces it to advance all the sub-scorers, but with check() it could stop in the middle.

        Precisely.

        This is important when you have a super-cheap iterator (say a somewhat sparse (<=10%?) in-memory filter that's represented as list-of-docIDs). It's very fast for such a filter to iterate over its docIDs. But when that iterator is AND'd with a Scorer, as is done today by IndexSearcher, they effectively play "leap frog", where first it's the filter's turn to next(), then it's the Scorer's turn, etc. But for the Scorer, next() can be extremely costly, only to find the filter doesn't accept it. So for such situations it's better to let the filter drive the search, calling Scorer.check() on the docs.

        But... once we switch to filter-as-BooleanClause, it's less clear whether check() is worthwhile, because I think the filter's constraint is more efficiently taken into account.

        For filters that support random access (if they are less sparse, say >= 25% or so), we should push them all the way down to the TermScorers and factor them in just like deletedDocs.

        . If the default impl in DISI just uses nextDoc() and returns true if the return value is the requested, we should be safe back-compat-wise, but this is still dangerous and we need clear documentation.

        Yes it does have a good default impl, I think.

        BTW, perhaps a testAndSet-like version can save check(10) followed by a next(10), and will fit nicer?

        Not sure what you mean by "testAndSet-like version"?

        Show
        Michael McCandless added a comment - Just to clarify for myself, in the example I gave above, suppose thar the scorer is on "3" and you call check(8). On check(8), TermScorer would go to 10, stop there, and return false. (It would not "rewind" to 3). Check can only be called on increasing arguments, so it's not truly "random access". It's "forward only random access". You propose this check() so that in case a DISI can save any extra operations it does in next() (such as reading a payload for example) it will do so. Therefore in the example you give above with CS, next()'s contract forces it to advance all the sub-scorers, but with check() it could stop in the middle. Precisely. This is important when you have a super-cheap iterator (say a somewhat sparse (<=10%?) in-memory filter that's represented as list-of-docIDs). It's very fast for such a filter to iterate over its docIDs. But when that iterator is AND'd with a Scorer, as is done today by IndexSearcher, they effectively play "leap frog", where first it's the filter's turn to next(), then it's the Scorer's turn, etc. But for the Scorer, next() can be extremely costly, only to find the filter doesn't accept it. So for such situations it's better to let the filter drive the search, calling Scorer.check() on the docs. But... once we switch to filter-as-BooleanClause, it's less clear whether check() is worthwhile, because I think the filter's constraint is more efficiently taken into account. For filters that support random access (if they are less sparse, say >= 25% or so), we should push them all the way down to the TermScorers and factor them in just like deletedDocs. . If the default impl in DISI just uses nextDoc() and returns true if the return value is the requested, we should be safe back-compat-wise, but this is still dangerous and we need clear documentation. Yes it does have a good default impl, I think. BTW, perhaps a testAndSet-like version can save check(10) followed by a next(10), and will fit nicer? Not sure what you mean by "testAndSet-like version"?
        Hide
        Shai Erera added a comment -

        Not sure what you mean by "testAndSet-like version"?

        I mean, instead of having the code call check(8), get true and then advance(8), just call checkAndAdvance(8) which returns true if 8 is supported and false otherwise, AND moves to 8. I don't propose to replace check() with it as sometimes you might want to check a couple of DISIs before making a decision to which doc to advance, but it could save calling advance() in case check() returns true.

        Yes it does have a good default impl, I think.

        It will have a good default impl, I can guarantee to try . What I meant is that we should have clear documentation about check() and nextDoc() and the possibility that check will be called for doc Id 'X' and later nextDoc or advance will be called with 'X', in that case the impl must ensure 'X' is not skipped, as is done today by TermScorer for example.

        So should I add this check()?

        Show
        Shai Erera added a comment - Not sure what you mean by "testAndSet-like version"? I mean, instead of having the code call check(8), get true and then advance(8), just call checkAndAdvance(8) which returns true if 8 is supported and false otherwise, AND moves to 8. I don't propose to replace check() with it as sometimes you might want to check a couple of DISIs before making a decision to which doc to advance, but it could save calling advance() in case check() returns true. Yes it does have a good default impl, I think. It will have a good default impl, I can guarantee to try . What I meant is that we should have clear documentation about check() and nextDoc() and the possibility that check will be called for doc Id 'X' and later nextDoc or advance will be called with 'X', in that case the impl must ensure 'X' is not skipped, as is done today by TermScorer for example. So should I add this check()?
        Hide
        Michael McCandless added a comment -

        I mean, instead of having the code call check(8), get true and then advance(8), just call checkAndAdvance(8) which returns true if 8 is supported and false otherwise, AND moves to 8.

        Oh, sorry: that's in fact what I intended check to do. But by "moves to 8" what it really means is "you now cannot call check on anything < 8 (maybe 9)"

        I think after check(N) is called, one cannot call doc() – the results are not defined. So check(N) logically puts the iterator at N, and you may at that point call next if you want, or call another check(M) but you cannot call doc() right after check.

        So should I add this check()?

        I think so? We can then do perf tests of that vs filter-as-BooleanClause?

        Show
        Michael McCandless added a comment - I mean, instead of having the code call check(8), get true and then advance(8), just call checkAndAdvance(8) which returns true if 8 is supported and false otherwise, AND moves to 8. Oh, sorry: that's in fact what I intended check to do. But by "moves to 8" what it really means is "you now cannot call check on anything < 8 (maybe 9)" I think after check(N) is called, one cannot call doc() – the results are not defined. So check(N) logically puts the iterator at N, and you may at that point call next if you want, or call another check(M) but you cannot call doc() right after check. So should I add this check()? I think so? We can then do perf tests of that vs filter-as-BooleanClause?
        Hide
        Michael McCandless added a comment -

        So should I add this check()?

        Though, in order to run perf tests, we'd need the AND/OR scorers to efficiently implement check().

        Show
        Michael McCandless added a comment - So should I add this check()? Though, in order to run perf tests, we'd need the AND/OR scorers to efficiently implement check().
        Hide
        Shai Erera added a comment -

        I think after check(N) is called, one cannot call doc()

        I think one cannot even call next(). If check(8) returns true, then you know that doc() will return 8 (otherwise it's a bug?). But if it returns false, it might be in 10 already, so calling next() will move it to 11 or something. So to be on the safe side, we should document that doc()'s result is unspecified if check() returns false, and next() is not recommended in that case, but skipTo() or check(M).

        Though, in order to run perf tests, we'd need the AND/OR scorers to efficiently implement check().

        I plan to, as much as I can, efficiently implement nextDoc() and advance() in all Scorers/DISIs. So I can include check() in the list as well. Or .. maybe you know something I don't and you think this should deserve its own issue?

        Show
        Shai Erera added a comment - I think after check(N) is called, one cannot call doc() I think one cannot even call next(). If check(8) returns true, then you know that doc() will return 8 (otherwise it's a bug?). But if it returns false, it might be in 10 already, so calling next() will move it to 11 or something. So to be on the safe side, we should document that doc()'s result is unspecified if check() returns false, and next() is not recommended in that case, but skipTo() or check(M). Though, in order to run perf tests, we'd need the AND/OR scorers to efficiently implement check(). I plan to, as much as I can, efficiently implement nextDoc() and advance() in all Scorers/DISIs. So I can include check() in the list as well. Or .. maybe you know something I don't and you think this should deserve its own issue?
        Hide
        Michael McCandless added a comment -

        I think one cannot even call next().

        Hmm, yeah I think you're right. We could perhaps make this an entirely different interface (abstract class). Ie, one should not mix and match "checking" with "next/advance"ing. In the case I can think of, at least, it's an up-front decision as to which scorer does next vs check.

        So I can include check() in the list as well.

        I think including it in this issue is fine.

        Show
        Michael McCandless added a comment - I think one cannot even call next(). Hmm, yeah I think you're right. We could perhaps make this an entirely different interface (abstract class). Ie, one should not mix and match "checking" with "next/advance"ing. In the case I can think of, at least, it's an up-front decision as to which scorer does next vs check. So I can include check() in the list as well. I think including it in this issue is fine.
        Hide
        Shai Erera added a comment -

        Ok I'll start with just adding nextDoc and advance, deprecate the old ones, implement in all extending classes and modify current code which uses DISI. Then, we can see how to proceed with check() and whether it should be part of this issue or another one.

        I have a feeling my last couple of issues just grew larger the more we discussed them, so I hope to keep this one containable

        Show
        Shai Erera added a comment - Ok I'll start with just adding nextDoc and advance, deprecate the old ones, implement in all extending classes and modify current code which uses DISI. Then, we can see how to proceed with check() and whether it should be part of this issue or another one. I have a feeling my last couple of issues just grew larger the more we discussed them, so I hope to keep this one containable
        Hide
        Michael McCandless added a comment -

        Then, we can see how to proceed with check() and whether it should be part of this issue or another one.

        Fair enough!

        I have a feeling my last couple of issues just grew larger the more we discussed them, so I hope to keep this one containable

        Alas this is what happens when you tackle tricky issues... sort of like pulling out that innocent looking thread sticking out on the bottom of your shirt

        Show
        Michael McCandless added a comment - Then, we can see how to proceed with check() and whether it should be part of this issue or another one. Fair enough! I have a feeling my last couple of issues just grew larger the more we discussed them, so I hope to keep this one containable Alas this is what happens when you tackle tricky issues... sort of like pulling out that innocent looking thread sticking out on the bottom of your shirt
        Hide
        Shai Erera added a comment -

        I started to work on that issue and noticed the javadoc of skipTo: "Skips entries to the first beyond the current whose document number is greater than or equal to target". That together with the code example implies that if a set has docs 9, 10, 11 and I call skipTo(10) and then doc() I'll get 10 the first time and 11 the second time (meaning two consecutive calls to skipTo with the same target return different results).

        I was wondering if we want to change it, i.e., have advance do something like that:

        while (doc() < target) {
          if (!next()) return -1;
        }
        return doc();
        

        I think this behavior is more accurate, at least in terms of semantics.

        Show
        Shai Erera added a comment - I started to work on that issue and noticed the javadoc of skipTo: "Skips entries to the first beyond the current whose document number is greater than or equal to target". That together with the code example implies that if a set has docs 9, 10, 11 and I call skipTo(10) and then doc() I'll get 10 the first time and 11 the second time (meaning two consecutive calls to skipTo with the same target return different results). I was wondering if we want to change it, i.e., have advance do something like that: while (doc() < target) { if (!next()) return -1; } return doc(); I think this behavior is more accurate, at least in terms of semantics.
        Hide
        Michael McCandless added a comment -

        I would tentatively prefer that change in semantics.

        Show
        Michael McCandless added a comment - I would tentatively prefer that change in semantics.
        Hide
        Yonik Seeley added a comment - - edited

        An implementation certainly shouldn't be required to carry state such that skipTo(10) called twice will yield different results.
        A bigger question though, is if we should support skipTo(doc) where doc<=current at all. That's sort of how I read "Skips entries to the first beyond the current"... that you shouldn't be calling skipTo(doc) unless doc>current.

        If we do want to define skipTo(doc) whendoc<=current, then I agree it should be as you describe.

        (edited to avoid JIRA markup )

        Show
        Yonik Seeley added a comment - - edited An implementation certainly shouldn't be required to carry state such that skipTo(10) called twice will yield different results. A bigger question though, is if we should support skipTo(doc) where doc<=current at all. That's sort of how I read "Skips entries to the first beyond the current"... that you shouldn't be calling skipTo(doc) unless doc>current. If we do want to define skipTo(doc) whendoc<=current, then I agree it should be as you describe. (edited to avoid JIRA markup )
        Hide
        Marvin Humphrey added a comment -

        > if a set has docs 9, 10, 11 and I call skipTo(10) and then doc() I'll get 10
        > the first time and 11 the second time

        > I was wondering if we want to change it,

        I agree that your proposed definition is more intuitive. I think it
        might make TermScorer.advance() a tad less efficient, though,
        because it would be necessary to check the current doc first.

          public boolean skipTo(int target) throws IOException {
            // first scan in cache
            for (pointer++; pointer < pointerMax; pointer++) {
               if (docs[pointer] >= target) {
                 doc = docs[pointer];
                 return true;
               }
            }
        

        There might be others, as well. I'd be concerned if something
        low-level like TermDocs or TermPositions was affected
        negatively. It seems likely, because we'd be changing from
        "advance, then check state" to "check state, then advance,
        then check state".

        Funny, but this is actually closer to "skip to" than "advance", since
        under this proposal, the iterator would not always advance.

        Show
        Marvin Humphrey added a comment - > if a set has docs 9, 10, 11 and I call skipTo(10) and then doc() I'll get 10 > the first time and 11 the second time > I was wondering if we want to change it, I agree that your proposed definition is more intuitive. I think it might make TermScorer.advance() a tad less efficient, though, because it would be necessary to check the current doc first. public boolean skipTo( int target) throws IOException { // first scan in cache for (pointer++; pointer < pointerMax; pointer++) { if (docs[pointer] >= target) { doc = docs[pointer]; return true ; } } There might be others, as well. I'd be concerned if something low-level like TermDocs or TermPositions was affected negatively. It seems likely, because we'd be changing from "advance, then check state" to "check state, then advance, then check state". Funny, but this is actually closer to "skip to" than "advance", since under this proposal, the iterator would not always advance.
        Hide
        Shai Erera added a comment -

        A bigger question though, is if we should support skipTo(doc) where doc<=current at all

        I don't think we should support it and it wasn't the intention in the first place. If we do want to support it than a different name for the method is required, such as seek(target) or moveTo(target).

        Funny, but this is actually closer to "skip to" than "advance", since under this proposal, the iterator would not always advance.

        I actually don't see it like that. skipTo and advance both imply that the iterator moves forward (as opposed to seek). However I don't think that when you say "advance to X" it means the iterator should move. The result of that command should put the iterator on X or beyond (if X does not exist). If the iterator is already on X, I don't think it should do anything.

        An implementation certainly shouldn't be required to carry state such that skipTo(10) called twice will yield different results.

        I'm not sure if by that you agree with me or disagree (maybe I'm misreading it). From the iterators I've modified already, calling skipTo(10) multiple times always yields the same result. Those are the ones that operate on BitSet. They never check their state, but just skip to wherever they're requested.

        In case you disagree, I'd like to ask why is it bad to request the implementation to remember its state? I think that all implementations already store the doc Id in case doc() will be called following skipTo, so in a sense they already remember their state. In addition, checking if the current doc Id is not the target is something I believe most will do - if they don't need to, like the bit-set variants, then they don't do it. If they cannot really skip to a target, but are forced to call next() until target is reached, they already check their state and so it does not add any overhead.

        To me, calling skipTo or advance with the same target multiple times and get different result every time is weird. I'd like to change that semantic, but if you strongly disagree, then we should at least document it.

        Show
        Shai Erera added a comment - A bigger question though, is if we should support skipTo(doc) where doc<=current at all I don't think we should support it and it wasn't the intention in the first place. If we do want to support it than a different name for the method is required, such as seek(target) or moveTo(target). Funny, but this is actually closer to "skip to" than "advance", since under this proposal, the iterator would not always advance. I actually don't see it like that. skipTo and advance both imply that the iterator moves forward (as opposed to seek). However I don't think that when you say "advance to X" it means the iterator should move. The result of that command should put the iterator on X or beyond (if X does not exist). If the iterator is already on X, I don't think it should do anything. An implementation certainly shouldn't be required to carry state such that skipTo(10) called twice will yield different results. I'm not sure if by that you agree with me or disagree (maybe I'm misreading it). From the iterators I've modified already, calling skipTo(10) multiple times always yields the same result. Those are the ones that operate on BitSet. They never check their state, but just skip to wherever they're requested. In case you disagree, I'd like to ask why is it bad to request the implementation to remember its state? I think that all implementations already store the doc Id in case doc() will be called following skipTo, so in a sense they already remember their state. In addition, checking if the current doc Id is not the target is something I believe most will do - if they don't need to, like the bit-set variants, then they don't do it. If they cannot really skip to a target, but are forced to call next() until target is reached, they already check their state and so it does not add any overhead. To me, calling skipTo or advance with the same target multiple times and get different result every time is weird. I'd like to change that semantic, but if you strongly disagree, then we should at least document it.
        Hide
        Shai Erera added a comment -

        Patch introduces the two added methods, as well as changes to our tests and code to utilize the new versions. All tests pass.

        I did not add check() yet, since I'd like to have this patch reviewed first, and then decide whether check() should be handled in that issue or not, and to get a hint of which Scorers / DISIs it should actually affect.

        Show
        Shai Erera added a comment - Patch introduces the two added methods, as well as changes to our tests and code to utilize the new versions. All tests pass. I did not add check() yet, since I'd like to have this patch reviewed first, and then decide whether check() should be handled in that issue or not, and to get a hint of which Scorers / DISIs it should actually affect.
        Hide
        Shai Erera added a comment -

        BTW, as I prepared that patch, I noticed the same improvement can be applied to TermDocs. However it's an interface and so we're stuck with back-compat, and replacing it with an abstract class means spending too much energy, at least on finding a better name . What do you think?

        Show
        Shai Erera added a comment - BTW, as I prepared that patch, I noticed the same improvement can be applied to TermDocs. However it's an interface and so we're stuck with back-compat, and replacing it with an abstract class means spending too much energy, at least on finding a better name . What do you think?
        Hide
        Michael McCandless added a comment -

        I wonder if instead of returning -1 when the iteration is done, we should return Integer.MAX_VALUE?

        This would save CPU for scorers that merge multiple sub-scorers (like BooleanScorer/2), because instead of having to check for -1 returned from each sub-scorer, they could simply proceed with their normal logic and check for Integer.MAX_VALUE just before collecting the doc. It'd be one fewer if to check, per sub-scorer per nextDoc/advance call, which would be a nice savings.

        This is what I'm doing on LUCENE-1594.

        Show
        Michael McCandless added a comment - I wonder if instead of returning -1 when the iteration is done, we should return Integer.MAX_VALUE? This would save CPU for scorers that merge multiple sub-scorers (like BooleanScorer/2), because instead of having to check for -1 returned from each sub-scorer, they could simply proceed with their normal logic and check for Integer.MAX_VALUE just before collecting the doc. It'd be one fewer if to check, per sub-scorer per nextDoc/advance call, which would be a nice savings. This is what I'm doing on LUCENE-1594 .
        Hide
        Yonik Seeley added a comment -

        > > A bigger question though, is if we should support skipTo(doc) where doc<=current at all
        > I don't think we should support it and it wasn't the intention in the first place.

        OK... I had read your proposal as saying you wanted skipTo(10) twice in a row to return 10 both times (assuming it matched).
        Marvin seemed to believe the same thing.

        > In case you disagree, I'd like to ask why is it bad to request the implementation to remember its state?

        I don't disagree, but requiring some implementations to keep and check state would mean less efficient implementations.

        > To me, calling skipTo or advance with the same target multiple times and get different result every time is weird. I'd like to change that semantic

        But I thought we had just agreed that skipTo(doc) is well defined only for doc>current? (see "bigger question" point above).
        If we all agree that , then we don't need to worry about skipTo(10) being called twice in a row.

        Show
        Yonik Seeley added a comment - > > A bigger question though, is if we should support skipTo(doc) where doc<=current at all > I don't think we should support it and it wasn't the intention in the first place. OK... I had read your proposal as saying you wanted skipTo(10) twice in a row to return 10 both times (assuming it matched). Marvin seemed to believe the same thing. > In case you disagree, I'd like to ask why is it bad to request the implementation to remember its state? I don't disagree, but requiring some implementations to keep and check state would mean less efficient implementations. > To me, calling skipTo or advance with the same target multiple times and get different result every time is weird. I'd like to change that semantic But I thought we had just agreed that skipTo(doc) is well defined only for doc>current? (see "bigger question" point above). If we all agree that , then we don't need to worry about skipTo(10) being called twice in a row.
        Hide
        Shai Erera added a comment -

        I wonder if instead of returning -1 when the iteration is done, we should return Integer.MAX_VALUE?

        The idea was to return a negative value and then compare the returned value to >= 0 for better performance. If we return MAX_VAL we'll need to compare to MAX_VAL, which is less efficient, CPU wise.

        But I thought we had just agreed that skipTo(doc) is well defined only for doc>current?

        Not sure - didn't we agree for >= current? An example are two iterators, one over 1, 3, 8, 11 and another on 2, 5, 9, 11. Let's say that you ask both for advance(9). The first one lands on 11 and the second on 9. Then you ask both to advance to 11 again - the first one returns "no more results" and the second one lands on 11.

        I have to be honest though that I'm not sure to which scenario that matches, I just had a feeling that calling advance(X) while the iterator is on X is something we may run in to and therefore we should make sure to return X. BTW, none of the existing iterators in the code needed to perform any extra check to ensure that behavior. Even TermScorer.

        I had read your proposal as saying you wanted skipTo(10) twice in a row to return 10 both times (assuming it matched)

        That's exactly what I meant

        Show
        Shai Erera added a comment - I wonder if instead of returning -1 when the iteration is done, we should return Integer.MAX_VALUE? The idea was to return a negative value and then compare the returned value to >= 0 for better performance. If we return MAX_VAL we'll need to compare to MAX_VAL, which is less efficient, CPU wise. But I thought we had just agreed that skipTo(doc) is well defined only for doc>current? Not sure - didn't we agree for >= current? An example are two iterators, one over 1, 3, 8, 11 and another on 2, 5, 9, 11. Let's say that you ask both for advance(9). The first one lands on 11 and the second on 9. Then you ask both to advance to 11 again - the first one returns "no more results" and the second one lands on 11. I have to be honest though that I'm not sure to which scenario that matches, I just had a feeling that calling advance(X) while the iterator is on X is something we may run in to and therefore we should make sure to return X. BTW, none of the existing iterators in the code needed to perform any extra check to ensure that behavior. Even TermScorer. I had read your proposal as saying you wanted skipTo(10) twice in a row to return 10 both times (assuming it matched) That's exactly what I meant
        Hide
        Yonik Seeley added a comment -

        This would save CPU for scorers that merge multiple sub-scorers (like BooleanScorer/2), because instead of having to check for -1 returned from each sub-scorer, they could simply proceed with their normal logic and check for Integer.MAX_VALUE just before collecting the doc.

        But for scorers that use a priority queue, does checking and immediately removing from the queue (hence making the heap smaller) offer any advantages? I had assumed so since this is what current scorers do. Immediately removing scorers also causes early termination for minimumNrMatchers>1 in DisjunctionSumScorer.

        Show
        Yonik Seeley added a comment - This would save CPU for scorers that merge multiple sub-scorers (like BooleanScorer/2), because instead of having to check for -1 returned from each sub-scorer, they could simply proceed with their normal logic and check for Integer.MAX_VALUE just before collecting the doc. But for scorers that use a priority queue, does checking and immediately removing from the queue (hence making the heap smaller) offer any advantages? I had assumed so since this is what current scorers do. Immediately removing scorers also causes early termination for minimumNrMatchers>1 in DisjunctionSumScorer.
        Hide
        Michael McCandless added a comment -

        BTW, none of the existing iterators in the code needed to perform any extra check to ensure that behavior. Even TermScorer.

        Actually, TermScorer is now effectively doing an additional check, because you removed the pointer++ from the for loop init. Ie, you now first check if (docs[pointer] >= target) before doing the pointer++.

        Show
        Michael McCandless added a comment - BTW, none of the existing iterators in the code needed to perform any extra check to ensure that behavior. Even TermScorer. Actually, TermScorer is now effectively doing an additional check, because you removed the pointer++ from the for loop init. Ie, you now first check if (docs [pointer] >= target) before doing the pointer++.
        Hide
        Yonik Seeley added a comment -

        Not sure - didn't we agree for >= current?

        "A bigger question though, is if we should support skipTo(doc) where doc<=current at all"
        implies well defined behavior only when doc>current, as all of our scorers adhere to.

        BTW, none of the existing iterators in the code needed to perform any extra check to ensure that behavior. Even TermScorer.

        Hmmm... TermScorer.skipTo() does first advance the pointer before checking any state. If we wanted to support calling skipTo(10) twice in a row, it would require extra code.

        Show
        Yonik Seeley added a comment - Not sure - didn't we agree for >= current? "A bigger question though, is if we should support skipTo(doc) where doc<=current at all" implies well defined behavior only when doc>current, as all of our scorers adhere to. BTW, none of the existing iterators in the code needed to perform any extra check to ensure that behavior. Even TermScorer. Hmmm... TermScorer.skipTo() does first advance the pointer before checking any state. If we wanted to support calling skipTo(10) twice in a row, it would require extra code.
        Hide
        Michael McCandless added a comment -

        > This would save CPU for scorers that merge multiple sub-scorers (like BooleanScorer/2), because instead of having to check for -1 returned from each sub-scorer, they could simply proceed with their normal logic and check for Integer.MAX_VALUE just before collecting the doc.

        But for scorers that use a priority queue, does checking and immediately removing from the queue (hence making the heap smaller) offer any advantages? I had assumed so since this is what current scorers do. Immediately removing scorers also causes early termination for minimumNrMatchers>1 in DisjunctionSumScorer.

        But that only helps at the tail end of the iteration, vs saving an if
        check per-sub-scorer X per-next?

        Ie presumably much more CPU is spent iterating while the PQ is full,
        than while it's winding down, so saving the if per-sub-scorer-next is
        better?

        Also, I think over time we should migrate away from the PQ (ie, use
        BooleanScorer's batch approach, not Disjunction*Scorer's PQ) since the
        batch scoring approach gives better performance. EG I think we should
        extend BooleanScorer to handle MUST clauses. BooleanScorer handles
        doc=Integer.MAX_VALUE for a sub-scorer quite efficiently (the chunk is
        always skipped for that sub-scorer, after one if check).

        Show
        Michael McCandless added a comment - > This would save CPU for scorers that merge multiple sub-scorers (like BooleanScorer/2), because instead of having to check for -1 returned from each sub-scorer, they could simply proceed with their normal logic and check for Integer.MAX_VALUE just before collecting the doc. But for scorers that use a priority queue, does checking and immediately removing from the queue (hence making the heap smaller) offer any advantages? I had assumed so since this is what current scorers do. Immediately removing scorers also causes early termination for minimumNrMatchers>1 in DisjunctionSumScorer. But that only helps at the tail end of the iteration, vs saving an if check per-sub-scorer X per-next? Ie presumably much more CPU is spent iterating while the PQ is full, than while it's winding down, so saving the if per-sub-scorer-next is better? Also, I think over time we should migrate away from the PQ (ie, use BooleanScorer's batch approach, not Disjunction*Scorer's PQ) since the batch scoring approach gives better performance. EG I think we should extend BooleanScorer to handle MUST clauses. BooleanScorer handles doc=Integer.MAX_VALUE for a sub-scorer quite efficiently (the chunk is always skipped for that sub-scorer, after one if check).
        Hide
        Michael McCandless added a comment -

        > I wonder if instead of returning -1 when the iteration is done, we should return Integer.MAX_VALUE?

        The idea was to return a negative value and then compare the returned value to >= 0 for better performance. If we return MAX_VAL we'll need to compare to MAX_VAL, which is less efficient, CPU wise.

        If you always check the returned result, right.

        But with BooleanScorer/2, and likely any scorer that invokes multiple
        sub-scorers, switching to Integer.MAX_INT as the sentinel would allow
        us to not test every sub-scorer's returned result from
        nextDoc/advance. Ie, because the docID moved forward (vs -1, which
        moved backwards), it's a "natural" fit for the main scorer's normal docID
        processing. So, far fewer if's (one per subscorer) are in the
        innermost loop.

        Marvin, what's your plan for Lucy's sentinel value for DISI/Scorer?

        Show
        Michael McCandless added a comment - > I wonder if instead of returning -1 when the iteration is done, we should return Integer.MAX_VALUE? The idea was to return a negative value and then compare the returned value to >= 0 for better performance. If we return MAX_VAL we'll need to compare to MAX_VAL, which is less efficient, CPU wise. If you always check the returned result, right. But with BooleanScorer/2, and likely any scorer that invokes multiple sub-scorers, switching to Integer.MAX_INT as the sentinel would allow us to not test every sub-scorer's returned result from nextDoc/advance. Ie, because the docID moved forward (vs -1, which moved backwards), it's a "natural" fit for the main scorer's normal docID processing. So, far fewer if's (one per subscorer) are in the innermost loop. Marvin, what's your plan for Lucy's sentinel value for DISI/Scorer?
        Hide
        Shai Erera added a comment -

        Ok I'll look into it tomorrow morning when I'll be near the code again. If it makes sense, then it makes sense

        You're also right about TermScorer. I missed the fact that checking the pointer occurs for the next pointer, and indeed now the code does one extra check than before.

        If you don't feel advance() should have this contract, then I can change TermScorer's implementation back and document that calling advance() with the same value several times in a row does not guarantee to return the same value every time ...

        I still think it's more logical to state that it should return the same value, however since I don't have a good scenario where that will happen (and perhaps the fact that nobody complained about it until now means it never happens?) and since it may pose a performance hit over some iterators, I'm willing to let go

        Show
        Shai Erera added a comment - Ok I'll look into it tomorrow morning when I'll be near the code again. If it makes sense, then it makes sense You're also right about TermScorer. I missed the fact that checking the pointer occurs for the next pointer, and indeed now the code does one extra check than before. If you don't feel advance() should have this contract, then I can change TermScorer's implementation back and document that calling advance() with the same value several times in a row does not guarantee to return the same value every time ... I still think it's more logical to state that it should return the same value, however since I don't have a good scenario where that will happen (and perhaps the fact that nobody complained about it until now means it never happens?) and since it may pose a performance hit over some iterators, I'm willing to let go
        Hide
        Michael McCandless added a comment -

        I still think it's more logical to state that it should return the same value

        I agree (hence my original tentative liking of this change), but I agree that this can only hurt performance.

        Ie the caller knows on calling advance() that the state of the sub-scorer must change. And so by altering advance to first check, that check is redundant. So maybe we should switch it back to "advance then check"?

        Show
        Michael McCandless added a comment - I still think it's more logical to state that it should return the same value I agree (hence my original tentative liking of this change), but I agree that this can only hurt performance. Ie the caller knows on calling advance() that the state of the sub-scorer must change. And so by altering advance to first check, that check is redundant. So maybe we should switch it back to "advance then check"?
        Hide
        Yonik Seeley added a comment -

        Scorers previously only had to worry about skipTo(doc) being called for doc>current. I don't currently see a use case for changing that.

        Show
        Yonik Seeley added a comment - Scorers previously only had to worry about skipTo(doc) being called for doc>current. I don't currently see a use case for changing that.
        Hide
        Paul Elschot added a comment -

        About using Integer.MAX_VALUE as sentinel, did anyone consider what happens when the first index actually reaches that number of documents?

        On moving from the priority queue (DisjunctionSumScorer/BooleanScorer2) to the batch approach (BooleanScorer): I did not find a way to do that while scoring docs in docId order. Basically it's a priority queue versus a distribution "sort" on the low docId bits into a linked list.
        The priority queue can be made faster by inlining (there is a patch for that, I can't get to the issue number now), but that's about the limit as far as I can see.

        Show
        Paul Elschot added a comment - About using Integer.MAX_VALUE as sentinel, did anyone consider what happens when the first index actually reaches that number of documents? On moving from the priority queue (DisjunctionSumScorer/BooleanScorer2) to the batch approach (BooleanScorer): I did not find a way to do that while scoring docs in docId order. Basically it's a priority queue versus a distribution "sort" on the low docId bits into a linked list. The priority queue can be made faster by inlining (there is a patch for that, I can't get to the issue number now), but that's about the limit as far as I can see.
        Hide
        Michael McCandless added a comment -

        About using Integer.MAX_VALUE as sentinel, did anyone consider what happens when the first index actually reaches that number of documents?

        Lucene already uses Integer.MAX_VALUE as a sentinel (eg the score(Collector) methods in Term/BooleanScorer/2), so a Lucene index can already only contain Integer.MAX_VALUE docs.

        On moving from the priority queue (DisjunctionSumScorer/BooleanScorer2) to the batch approach (BooleanScorer): I did not find a way to do that while scoring docs in docId order.

        What breaks if we allow docs to be collected out-of-order (besides external Hit/Collector)? As of LUCENE-1575, the core collectors can gain performance if they know the docs will be collected in order, but they can also handle out-or-order collection just fine.

        The priority queue can be made faster by inlining (there is a patch for that, I can't get to the issue number now), but that's about the limit as far as I can see.

        I think PQ is fundamentally not very friendly to modern CPUs, because of the hard-to-predict ifs; I think that's part of why the batch collection shows such gains.

        This doesn't hurt us so much during hit collection, which also uses PQ, since the queue typically quickly converges, but for OR scoring the PQ is intensely used the whole time.

        Show
        Michael McCandless added a comment - About using Integer.MAX_VALUE as sentinel, did anyone consider what happens when the first index actually reaches that number of documents? Lucene already uses Integer.MAX_VALUE as a sentinel (eg the score(Collector) methods in Term/BooleanScorer/2), so a Lucene index can already only contain Integer.MAX_VALUE docs. On moving from the priority queue (DisjunctionSumScorer/BooleanScorer2) to the batch approach (BooleanScorer): I did not find a way to do that while scoring docs in docId order. What breaks if we allow docs to be collected out-of-order (besides external Hit/Collector)? As of LUCENE-1575 , the core collectors can gain performance if they know the docs will be collected in order, but they can also handle out-or-order collection just fine. The priority queue can be made faster by inlining (there is a patch for that, I can't get to the issue number now), but that's about the limit as far as I can see. I think PQ is fundamentally not very friendly to modern CPUs, because of the hard-to-predict ifs; I think that's part of why the batch collection shows such gains. This doesn't hurt us so much during hit collection, which also uses PQ, since the queue typically quickly converges, but for OR scoring the PQ is intensely used the whole time.
        Hide
        Marvin Humphrey added a comment -

        > Marvin, what's your plan for Lucy's sentinel value for DISI/Scorer?

        Doc numbers start at 1, and 0 is the sentinel.

        Also, DISI will be named "Matcher". Features above and beyond what's
        currently in Lucene DISI to be determined.

        Show
        Marvin Humphrey added a comment - > Marvin, what's your plan for Lucy's sentinel value for DISI/Scorer? Doc numbers start at 1, and 0 is the sentinel. Also, DISI will be named "Matcher". Features above and beyond what's currently in Lucene DISI to be determined.
        Hide
        Shai Erera added a comment -

        So Mike - I've checked BS and BS2, and I don't see where does the return value can come into play the way you describe. Perhaps I'm missing it, but here's what I found:

        • BS - advance is not supported. In nextDoc the return value is checked only to verify if the sub scorer is done or not, therefore comparing to >= 0 seems the better option here.
        • BS2 - delegates advance and nextDoc to its counting scorer, which could have two variants that may be affected:
          • DisjunctionSumScorer - I don't see where the return value of the scorers is even considered.
          • ConjunctionScorer - both nextDoc and advance delegate the call to doNext() which iterates over the scorers. But it only checks the return value of advance to decide whether to continue with the iteration or not. The only thing I changed is using advance() >= 0 instead of skipTo which returned boolean. Is that the one you're talking about? Maybe you mean that in that scorer (only?) I could drop the 'more' member and stop iterating when the first scorer hits MAX_VAL, in which case it will not be less than the last doc, even if the last one is on MAX_VAL also?

        If I didn't miss anything, than that is just one scorer, which is not always instantiated. The rest do compare the return value of nextDoc and advance to determine what to do, exactly as they did before when the equivalent deprecated methods returned a boolean. It seems like this change will hurt the performance of most Scorers/DISIs more than improve the performance of BS2 in some cases where it instantiates a ConjunctionScorer. But like I said, maybe I'm missing a Scorer, so I'd appreciate if you can refer me to the one you had in mind.

        Also, given Marvin's response above, using 0 as sentinel is no different than using -1 in terms of "suddenly moving backwards".

        I fixed the documentation of advance and reinstated the implementation of TermScorer, so I'm ready to post an updated patch. I'd like us to resolve the return value issue before I do that though.

        Show
        Shai Erera added a comment - So Mike - I've checked BS and BS2, and I don't see where does the return value can come into play the way you describe. Perhaps I'm missing it, but here's what I found: BS - advance is not supported. In nextDoc the return value is checked only to verify if the sub scorer is done or not, therefore comparing to >= 0 seems the better option here. BS2 - delegates advance and nextDoc to its counting scorer, which could have two variants that may be affected: DisjunctionSumScorer - I don't see where the return value of the scorers is even considered. ConjunctionScorer - both nextDoc and advance delegate the call to doNext() which iterates over the scorers. But it only checks the return value of advance to decide whether to continue with the iteration or not. The only thing I changed is using advance() >= 0 instead of skipTo which returned boolean. Is that the one you're talking about? Maybe you mean that in that scorer (only?) I could drop the 'more' member and stop iterating when the first scorer hits MAX_VAL, in which case it will not be less than the last doc, even if the last one is on MAX_VAL also? If I didn't miss anything, than that is just one scorer, which is not always instantiated. The rest do compare the return value of nextDoc and advance to determine what to do, exactly as they did before when the equivalent deprecated methods returned a boolean. It seems like this change will hurt the performance of most Scorers/DISIs more than improve the performance of BS2 in some cases where it instantiates a ConjunctionScorer. But like I said, maybe I'm missing a Scorer, so I'd appreciate if you can refer me to the one you had in mind. Also, given Marvin's response above, using 0 as sentinel is no different than using -1 in terms of "suddenly moving backwards". I fixed the documentation of advance and reinstated the implementation of TermScorer, so I'm ready to post an updated patch. I'd like us to resolve the return value issue before I do that though.
        Hide
        Michael McCandless added a comment -

        I'm not referring to BS/2's exposure of advance/nextDoc to their
        callers; I'm talking about how BS/2 invoke advance/nextDoc on their
        sub-scorers.

        So, in BooleanScorer, starting on line 239 (w/ your patch), is this:

        int doc = sub.done ? -1 : scorer.doc();
        while (!sub.done && doc < end) {
          sub.collector.collect(doc);
          doc = scorer.nextDoc();
          sub.done = doc < 0;
        }
        

        this is the hotspot for BooleanScorer: it's advancing through a chunk
        (of 2048 docs) at once, for that one sub-scorer. Note the checks &
        assignments to sub.done that are required....

        If instead we switch to Integer.MAX_VALUE as the sentinel, that loop
        is simplified to this, instead:

        int doc = scorer.doc();
        while (doc < end) {
          sub.collector.collect(doc);
          doc = scorer.nextDoc();
        }
        

        scorer.done is no longer computed nor checked. What makes this
        possible is the existing "doc < end" check can be "reused" since the
        sentinel moves "forwards", not backwards.

        With this, we'd also need to change the toplevel collection (starting
        on line 150) to stop processing once all sub-scorers have advanced to
        the sentinel, but this is 1 added if per 2048 docs so the added cost
        (vs savings of not computing/checking scorer.done) is tiny.

        In ConjunctionScorer's doNext method (its hotspot), it currently does this:

        while (more && (firstScorer=scorers[first]).doc() < (lastDoc=lastScorer.doc())) {
          more = firstScorer.advance(lastDoc) >= 0;
          lastScorer = firstScorer;
          first = (first == (scorers.length-1)) ? 0 : first+1;
        }
        return more;
        

        with the sentinel change, it would do this:

        while ((firstScorer=scorers[first]).doc() < (lastDoc=lastScorer.doc())) {
          firstScorer.advance(lastDoc);
          lastScorer = firstScorer;
          first = (first == (scorers.length-1)) ? 0 : first+1;
        }
        return lastDoc != DOC_SENTINEL;
        

        ie, we no longer assign to, nor check, the more boolean. What makes
        this possible is we know all sub-scorers will advance to the sentinel,
        so we know that sentinel doc will pass the existing checks in the
        while loop.

        Also, given Marvin's response above, using 0 as sentinel is no different than using -1 in terms of "suddenly moving backwards".

        I agree that Marvin's choice of 0 is also "suddenly moving backwards",
        but it still seems to me to be a poor choice since it costs our
        BooleanScorers more CPU in their hotspots.

        Show
        Michael McCandless added a comment - I'm not referring to BS/2's exposure of advance/nextDoc to their callers; I'm talking about how BS/2 invoke advance/nextDoc on their sub-scorers. So, in BooleanScorer, starting on line 239 (w/ your patch), is this: int doc = sub.done ? -1 : scorer.doc(); while (!sub.done && doc < end) { sub.collector.collect(doc); doc = scorer.nextDoc(); sub.done = doc < 0; } this is the hotspot for BooleanScorer: it's advancing through a chunk (of 2048 docs) at once, for that one sub-scorer. Note the checks & assignments to sub.done that are required.... If instead we switch to Integer.MAX_VALUE as the sentinel, that loop is simplified to this, instead: int doc = scorer.doc(); while (doc < end) { sub.collector.collect(doc); doc = scorer.nextDoc(); } scorer.done is no longer computed nor checked. What makes this possible is the existing "doc < end" check can be "reused" since the sentinel moves "forwards", not backwards. With this, we'd also need to change the toplevel collection (starting on line 150) to stop processing once all sub-scorers have advanced to the sentinel, but this is 1 added if per 2048 docs so the added cost (vs savings of not computing/checking scorer.done) is tiny. In ConjunctionScorer's doNext method (its hotspot), it currently does this: while (more && (firstScorer=scorers[first]).doc() < (lastDoc=lastScorer.doc())) { more = firstScorer.advance(lastDoc) >= 0; lastScorer = firstScorer; first = (first == (scorers.length-1)) ? 0 : first+1; } return more; with the sentinel change, it would do this: while ((firstScorer=scorers[first]).doc() < (lastDoc=lastScorer.doc())) { firstScorer.advance(lastDoc); lastScorer = firstScorer; first = (first == (scorers.length-1)) ? 0 : first+1; } return lastDoc != DOC_SENTINEL; ie, we no longer assign to, nor check, the more boolean. What makes this possible is we know all sub-scorers will advance to the sentinel, so we know that sentinel doc will pass the existing checks in the while loop. Also, given Marvin's response above, using 0 as sentinel is no different than using -1 in terms of "suddenly moving backwards". I agree that Marvin's choice of 0 is also "suddenly moving backwards", but it still seems to me to be a poor choice since it costs our BooleanScorers more CPU in their hotspots.
        Hide
        Shai Erera added a comment -

        Thanks Mike for the clarification. One thing though is the comment I added to BS about not being able to call scorer.doc() since we may hit NPE. I hit it when it used ReqExclScorer. According to the latter, doc will hit NPE if next() or skipTo() returned false (or in our case nextDoc or advance return MAX_VAL).

          public int doc() {
            return reqScorer.doc(); // reqScorer may be null when next() or skipTo() already return false
          }
        

        If we drop sub.done in BS and more in ConjScorer, calling doc() will hit NPE. I.e. in BS the first line cannot exist, however I'm not sure I can start with nextDoc() before checking the current doc() < end, which may hit NPE.

        int doc = scorer.doc();
        while (doc < end) {
          sub.collector.collect(doc);
          doc = scorer.nextDoc();
        }
        

        And similarly in ConjunctionScorer dropping 'more' may hit NPE if that Scorer is used.

        There are a couple of ways to handle it:

        • In ReqExclScorer.doc() check for reqScorer == null and return MAX_VAL if it is. I count here on doc() not being called very frequently after this patch, since nextDoc() and advance() return the document. However that may not be the case (see ConjunctionScorer which uses doc() in doNext).
        • In ReqExclScorer make sure that reqScorer is never nullified, but then we'll need to figure out a different way to mark that there are no more docs (perhaps change the comparison of reqScorer to null to a boolean moreDocs?)
        Show
        Shai Erera added a comment - Thanks Mike for the clarification. One thing though is the comment I added to BS about not being able to call scorer.doc() since we may hit NPE. I hit it when it used ReqExclScorer. According to the latter, doc will hit NPE if next() or skipTo() returned false (or in our case nextDoc or advance return MAX_VAL). public int doc() { return reqScorer.doc(); // reqScorer may be null when next() or skipTo() already return false } If we drop sub.done in BS and more in ConjScorer, calling doc() will hit NPE. I.e. in BS the first line cannot exist, however I'm not sure I can start with nextDoc() before checking the current doc() < end, which may hit NPE. int doc = scorer.doc(); while (doc < end) { sub.collector.collect(doc); doc = scorer.nextDoc(); } And similarly in ConjunctionScorer dropping 'more' may hit NPE if that Scorer is used. There are a couple of ways to handle it: In ReqExclScorer.doc() check for reqScorer == null and return MAX_VAL if it is. I count here on doc() not being called very frequently after this patch, since nextDoc() and advance() return the document. However that may not be the case (see ConjunctionScorer which uses doc() in doNext). In ReqExclScorer make sure that reqScorer is never nullified, but then we'll need to figure out a different way to mark that there are no more docs (perhaps change the comparison of reqScorer to null to a boolean moreDocs?)
        Hide
        Shai Erera added a comment -

        Actually, as I read the "back-compat discussion", I think that doing what you propose may break back-comapt? I mean, what if someone else's Scorer's doc() may throw an exception after its next() or skipTo() return false? So far our code made sure we don't call doc(), next() or skipTo() after the Scorer 'told' us it has no more documents. Now we're saying we will call doc() even if the scorer told us it has no more documents.

        Even though I believe it's very unlikely there is a Scorer which behaves that way, there is one for sure in Lucene (ReqExclScorer). So I don't know ... this whole back-compat is really a pain .

        What do you think?

        Show
        Shai Erera added a comment - Actually, as I read the "back-compat discussion", I think that doing what you propose may break back-comapt? I mean, what if someone else's Scorer's doc() may throw an exception after its next() or skipTo() return false? So far our code made sure we don't call doc(), next() or skipTo() after the Scorer 'told' us it has no more documents. Now we're saying we will call doc() even if the scorer told us it has no more documents. Even though I believe it's very unlikely there is a Scorer which behaves that way, there is one for sure in Lucene (ReqExclScorer). So I don't know ... this whole back-compat is really a pain . What do you think?
        Hide
        Michael McCandless added a comment -

        I think I'd favor not nulling reqScorer, though that's a bigger change, so perhaps instead just add the null test in ReqExclScorer.doc() and we can later optimize away the nulling?

        Also, in general, we can decouple the optimizations that become possible with Integer.MAX_VALUE as sentinel, from this issue...

        Show
        Michael McCandless added a comment - I think I'd favor not nulling reqScorer, though that's a bigger change, so perhaps instead just add the null test in ReqExclScorer.doc() and we can later optimize away the nulling? Also, in general, we can decouple the optimizations that become possible with Integer.MAX_VALUE as sentinel, from this issue...
        Hide
        Shai Erera added a comment -

        Regarding my previous post on back-compat, maybe we can get away with it by documenting it in CHANGES? If we think it's not very likely there are many Scorers out there that will be affected by this change?

        Show
        Shai Erera added a comment - Regarding my previous post on back-compat, maybe we can get away with it by documenting it in CHANGES? If we think it's not very likely there are many Scorers out there that will be affected by this change?
        Hide
        Michael McCandless added a comment -

        I think that doing what you propose may break back-comapt?

        Hmm.. this is a challenge... as DISI now stands, it's undefined what doc() will do once next/skipTo has returned false.

        Even if it doesn't throw an exception, it won't return our new sentinel value.

        I fear our only choice is to make a new method to return the doc, whose semantics are defined to reflect this change? We'd provide a default impl that returns doc() if the iteration hasn't ended, else the sentinel? Or.... some sort of wrapper to wrap an old DISI as the new one, but then we'd need a new DISI class to tell the difference. Hmm.

        Show
        Michael McCandless added a comment - I think that doing what you propose may break back-comapt? Hmm.. this is a challenge... as DISI now stands, it's undefined what doc() will do once next/skipTo has returned false. Even if it doesn't throw an exception, it won't return our new sentinel value. I fear our only choice is to make a new method to return the doc, whose semantics are defined to reflect this change? We'd provide a default impl that returns doc() if the iteration hasn't ended, else the sentinel? Or.... some sort of wrapper to wrap an old DISI as the new one, but then we'd need a new DISI class to tell the difference. Hmm.
        Hide
        Shai Erera added a comment -

        Maybe there's a way out of this. In 2.9 we're already changing the contract of DISI with the new nextDoc and advance. As opposed to before (boolean), we're now stating what these methods should return when there are no more documents. So what if we do this:

        1. In 2.9 we document that these methods should return MAX_VAL when there are no more documents.
        2. Document doc() that in 3.0 it should return MAX_VAL when there are no more docs. Also put that in CHANGES.
        3. I change the patch to return MAX_VAL for all current DISIs/Scorers, but don't take advantage of that yet in BS and ConjunctionScorer.
        4. I open another issue for 3.0 which will take advantage of that, and ensure all current DISIs/Scorers' doc() return MAX_VAL when there are no more docs?

        I think that should work. We'll need to delay with that optimization beyond 2.9, but I don't think there's a nice and clean way around it otherwise. If 3.0 should quickly follow 2.9, we won't wait for long .

        Show
        Shai Erera added a comment - Maybe there's a way out of this. In 2.9 we're already changing the contract of DISI with the new nextDoc and advance. As opposed to before (boolean), we're now stating what these methods should return when there are no more documents. So what if we do this: In 2.9 we document that these methods should return MAX_VAL when there are no more documents. Document doc() that in 3.0 it should return MAX_VAL when there are no more docs. Also put that in CHANGES. I change the patch to return MAX_VAL for all current DISIs/Scorers, but don't take advantage of that yet in BS and ConjunctionScorer. I open another issue for 3.0 which will take advantage of that, and ensure all current DISIs/Scorers' doc() return MAX_VAL when there are no more docs? I think that should work. We'll need to delay with that optimization beyond 2.9, but I don't think there's a nice and clean way around it otherwise. If 3.0 should quickly follow 2.9, we won't wait for long .
        Hide
        Michael McCandless added a comment -

        Maybe there's a way out of this.

        OK that sounds like a good plan. (Though it's yet another example of how we hurt our new users, by delaying these optimizations, in favor of our back-compat users...)

        Show
        Michael McCandless added a comment - Maybe there's a way out of this. OK that sounds like a good plan. (Though it's yet another example of how we hurt our new users, by delaying these optimizations, in favor of our back-compat users...)
        Hide
        Shai Erera added a comment -

        Though it's yet another example of how we hurt our new users, by delaying these optimizations, in favor of our back-compat users..

        Maybe the thread on back-compat will go our way and we'll be free to introduce this change in 2.9? Although I doubt it, since for the sake of back-compat, even if the last one, we'll probably decide to start that after 3.0 is out.

        Oh well ... like I said - hopefully our users will not have to wait long between 2.9 and 3.0.

        Show
        Shai Erera added a comment - Though it's yet another example of how we hurt our new users, by delaying these optimizations, in favor of our back-compat users.. Maybe the thread on back-compat will go our way and we'll be free to introduce this change in 2.9? Although I doubt it, since for the sake of back-compat, even if the last one, we'll probably decide to start that after 3.0 is out. Oh well ... like I said - hopefully our users will not have to wait long between 2.9 and 3.0.
        Hide
        Shai Erera added a comment -

        Hmmm ... there is a problem with MAX_VAL as sentinel:

        1. OpenBitSetIterator already has a nextDoc() method which returns -1 when exhausted. The current patch used -1 as sentinel, and therefore it was ok. But when I changed it to MAX_VAL, test-tag fail on TestOpenBitSet. We can document this change in "changes to back-compat policy" and fix the test on trunk and tag.
        2. SortedVIntList returns an iterator which is a DISI. MAX_VAL is considered a valid value for SortedVIntList (TestSortedVIntList.test03() validates that). So if we use it as sentinel, we declare that MAX_VAL is invalid for SortedVIntList. Not sure if we can do that.

        I think the second is the problematic one - how do we handle iterators for which MAX_VAL is a valid value? I tend to say MAX_VAL should not be a valid value since by the name, *DocId*SetIterator, we should return doc Ids, and MAX_VAL is used as sentinel elsewhere and is not even considered a valid doc Id. Therefore those iterators should change their logic, if they rely on MAX_VAL being a valid value.

        BTW, besides the convenience, why should SortedVIntList expose a DocIdSetIterator? Nothing implies the list will hold doc Ids, therefore why commit to an iterator which returns doc Ids? If it's for convenience only, then maybe wrap that iterator with a true DISI where the Lucene code will need a true DISI?

        What do you think?

        Show
        Shai Erera added a comment - Hmmm ... there is a problem with MAX_VAL as sentinel: OpenBitSetIterator already has a nextDoc() method which returns -1 when exhausted. The current patch used -1 as sentinel, and therefore it was ok. But when I changed it to MAX_VAL, test-tag fail on TestOpenBitSet. We can document this change in "changes to back-compat policy" and fix the test on trunk and tag. SortedVIntList returns an iterator which is a DISI. MAX_VAL is considered a valid value for SortedVIntList (TestSortedVIntList.test03() validates that). So if we use it as sentinel, we declare that MAX_VAL is invalid for SortedVIntList. Not sure if we can do that. I think the second is the problematic one - how do we handle iterators for which MAX_VAL is a valid value? I tend to say MAX_VAL should not be a valid value since by the name, *DocId*SetIterator, we should return doc Ids, and MAX_VAL is used as sentinel elsewhere and is not even considered a valid doc Id. Therefore those iterators should change their logic, if they rely on MAX_VAL being a valid value. BTW, besides the convenience, why should SortedVIntList expose a DocIdSetIterator? Nothing implies the list will hold doc Ids, therefore why commit to an iterator which returns doc Ids? If it's for convenience only, then maybe wrap that iterator with a true DISI where the Lucene code will need a true DISI? What do you think?
        Hide
        Michael McCandless added a comment -

        OpenBitSetIterator already has a nextDoc() method which returns -1 when exhausted.

        Hmm – maybe we need to choose a different name than nextDoc()? Or... we make a new class (DISI2 or something) so we can strongly differentiate old from new semantics?

        I tend to say MAX_VAL should not be a valid value

        Right, we are saying (have already said, elsewhere in Lucene's core code) that MAX_VAL is not a valid docID.

        MAX_VAL is considered a valid value for SortedVIntList (TestSortedVIntList.test03() validates that). So if we use it as sentinel, we declare that MAX_VAL is invalid for SortedVIntList. Not sure if we can do that.

        I think we should remove that test, and decide MAX_VAL is not valid value in the list, because SortedVIntList is a DocIdSet.

        BTW, besides the convenience, why should SortedVIntList expose a DocIdSetIterator?

        I don't follow – SortedVIntList subclasses DocIdSet, which necessarily provides DISI iterator() method. Why is this not a "true DISI"?

        Show
        Michael McCandless added a comment - OpenBitSetIterator already has a nextDoc() method which returns -1 when exhausted. Hmm – maybe we need to choose a different name than nextDoc()? Or... we make a new class (DISI2 or something) so we can strongly differentiate old from new semantics? I tend to say MAX_VAL should not be a valid value Right, we are saying (have already said, elsewhere in Lucene's core code) that MAX_VAL is not a valid docID. MAX_VAL is considered a valid value for SortedVIntList (TestSortedVIntList.test03() validates that). So if we use it as sentinel, we declare that MAX_VAL is invalid for SortedVIntList. Not sure if we can do that. I think we should remove that test, and decide MAX_VAL is not valid value in the list, because SortedVIntList is a DocIdSet. BTW, besides the convenience, why should SortedVIntList expose a DocIdSetIterator? I don't follow – SortedVIntList subclasses DocIdSet, which necessarily provides DISI iterator() method. Why is this not a "true DISI"?
        Hide
        Shai Erera added a comment -

        SortedVIntList subclasses DocIdSet

        Sorry, did not notice that. It's just that the test confused me, since I though it just stores VInts with no direct relation to doc Ids.

        maybe we need to choose a different name than nextDoc()

        Why? just because OBSI declared a method which we wanted anyway? You know .. it's something we don't give much thought to when we add methods to abstract classes, but what if someone extended DISI and added his own advance(int) or nextDoc() which don't behave like we expect them to. When he'll pass his DISI to the search flow somehow, not knowing these have become the primary methods, something will break.

        I'm not saying we should protect these cases too, because otherwise we won't be able to make any changes. But just because OBSI had nextDoc() declared doesn't mean we should go and find a different name. That's slightly unrelated to this issue, but our back-compat policy forces us to replace good names with moderate ones, just because we cannot change methods.

        Can't we just document in CHANGES that nextDoc() now returns MAX_VAL when no more docs exist, and we fix the test in tag? I mean, how many users do we think use OBSI directly?

        MAX_VAL is not a valid docID

        I'll remove the test then (from trunk and tag) and document on DISI this assumption.

        Show
        Shai Erera added a comment - SortedVIntList subclasses DocIdSet Sorry, did not notice that. It's just that the test confused me, since I though it just stores VInts with no direct relation to doc Ids. maybe we need to choose a different name than nextDoc() Why? just because OBSI declared a method which we wanted anyway? You know .. it's something we don't give much thought to when we add methods to abstract classes, but what if someone extended DISI and added his own advance(int) or nextDoc() which don't behave like we expect them to. When he'll pass his DISI to the search flow somehow, not knowing these have become the primary methods, something will break. I'm not saying we should protect these cases too, because otherwise we won't be able to make any changes. But just because OBSI had nextDoc() declared doesn't mean we should go and find a different name. That's slightly unrelated to this issue, but our back-compat policy forces us to replace good names with moderate ones, just because we cannot change methods. Can't we just document in CHANGES that nextDoc() now returns MAX_VAL when no more docs exist, and we fix the test in tag? I mean, how many users do we think use OBSI directly? MAX_VAL is not a valid docID I'll remove the test then (from trunk and tag) and document on DISI this assumption.
        Hide
        Shai Erera added a comment -

        BTW, regarding SortedVIntList - even though it extends DocIdSet, its javadocs start with "Store and iterate sorted integers in compressed form in RAM." - doc Ids are not mentioned. Also, the class is public, so nothing prevents someone from using it for integers that are not Doc Ids.

        I think I'll emphasize that in the javadocs, documenting the limitation of MAX_VAL so that people won't assume the wrong things.

        Show
        Shai Erera added a comment - BTW, regarding SortedVIntList - even though it extends DocIdSet, its javadocs start with "Store and iterate sorted integers in compressed form in RAM." - doc Ids are not mentioned. Also, the class is public, so nothing prevents someone from using it for integers that are not Doc Ids. I think I'll emphasize that in the javadocs, documenting the limitation of MAX_VAL so that people won't assume the wrong things.
        Hide
        Michael McCandless added a comment -

        I think I'll emphasize that in the javadocs, documenting the limitation of MAX_VAL so that people won't assume the wrong things.

        +1

        A docID is different from an "int", because docIDs must be 0 .. MAX_VAL-1.

        Why? just because OBSI declared a method which we wanted anyway? You know .. it's something we don't give much thought to when we add methods to abstract classes, but what if someone extended DISI and added his own advance(int) or nextDoc() which don't behave like we expect them to. When he'll pass his DISI to the search flow somehow, not knowing these have become the primary methods, something will break.

        I'm not saying we should protect these cases too, because otherwise we won't be able to make any changes. But just because OBSI had nextDoc() declared doesn't mean we should go and find a different name. That's slightly unrelated to this issue, but our back-compat policy forces us to replace good names with moderate ones, just because we cannot change methods.

        You're right, a random subclass of an abstract class could very well choose the name we are wanting to add, and then their class fails to compile, or (if the sigs turn out to be identical) runs bug possibly causes problems.

        But in this case we know we have just such a class that has done so (OBSI). And of course it did so for exactly the reasons that we are now wanting to add nextDoc to DISI. My guess is eg Solr probably relies heavily on OBSI.nextDoc returning -1 when it's done and we're gonna cause AIOOB exceptions if we up and change to returning MAX_VAL.

        That's slightly unrelated to this issue, but our back-compat policy forces us to replace good names with moderate ones, just because we cannot change methods.

        As much as it bothers me having to accept inferior names (so they don't conflict with the existing names), I think it's very much the lesser-of-evils here.

        Show
        Michael McCandless added a comment - I think I'll emphasize that in the javadocs, documenting the limitation of MAX_VAL so that people won't assume the wrong things. +1 A docID is different from an "int", because docIDs must be 0 .. MAX_VAL-1. Why? just because OBSI declared a method which we wanted anyway? You know .. it's something we don't give much thought to when we add methods to abstract classes, but what if someone extended DISI and added his own advance(int) or nextDoc() which don't behave like we expect them to. When he'll pass his DISI to the search flow somehow, not knowing these have become the primary methods, something will break. I'm not saying we should protect these cases too, because otherwise we won't be able to make any changes. But just because OBSI had nextDoc() declared doesn't mean we should go and find a different name. That's slightly unrelated to this issue, but our back-compat policy forces us to replace good names with moderate ones, just because we cannot change methods. You're right, a random subclass of an abstract class could very well choose the name we are wanting to add, and then their class fails to compile, or (if the sigs turn out to be identical) runs bug possibly causes problems. But in this case we know we have just such a class that has done so (OBSI). And of course it did so for exactly the reasons that we are now wanting to add nextDoc to DISI. My guess is eg Solr probably relies heavily on OBSI.nextDoc returning -1 when it's done and we're gonna cause AIOOB exceptions if we up and change to returning MAX_VAL. That's slightly unrelated to this issue, but our back-compat policy forces us to replace good names with moderate ones, just because we cannot change methods. As much as it bothers me having to accept inferior names (so they don't conflict with the existing names), I think it's very much the lesser-of-evils here.
        Hide
        Shai Erera added a comment -

        My guess is eg Solr probably relies heavily on OBSI.nextDoc returning -1

        Perhaps the Solr guys can state then if and how much they mind this change? Before we start the journey of finding a different name for DISI.nextDoc(), just to deprecated OBSI.nextDoc() ...

        Show
        Shai Erera added a comment - My guess is eg Solr probably relies heavily on OBSI.nextDoc returning -1 Perhaps the Solr guys can state then if and how much they mind this change? Before we start the journey of finding a different name for DISI.nextDoc(), just to deprecated OBSI.nextDoc() ...
        Hide
        Shalin Shekhar Mangar added a comment -

        Perhaps the Solr guys can state then if and how much they mind this change? Before we start the journey of finding a different name for DISI.nextDoc(), just to deprecated OBSI.nextDoc()

        I don't see any calls to OpenBitSetIterator.nextDoc in solr's source code.

        Show
        Shalin Shekhar Mangar added a comment - Perhaps the Solr guys can state then if and how much they mind this change? Before we start the journey of finding a different name for DISI.nextDoc(), just to deprecated OBSI.nextDoc() I don't see any calls to OpenBitSetIterator.nextDoc in solr's source code.
        Hide
        Shai Erera added a comment -

        I don't see any calls to OpenBitSetIterator.nextDoc in solr's source code.

        So Mike - does that mean I can change nextDoc() behavior in OBSI and document it?

        Show
        Shai Erera added a comment - I don't see any calls to OpenBitSetIterator.nextDoc in solr's source code. So Mike - does that mean I can change nextDoc() behavior in OBSI and document it?
        Hide
        Michael McCandless added a comment -

        So Mike - does that mean I can change nextDoc() behavior in OBSI and document it?

        OK let's tentatively go forward with that?

        Show
        Michael McCandless added a comment - So Mike - does that mean I can change nextDoc() behavior in OBSI and document it? OK let's tentatively go forward with that?
        Hide
        Yonik Seeley added a comment -

        I'm warming to some of the simplifications that a MAX_VAL sentinel can bring.

        On the other end of the scale... getting rid of "if (firstTime)" is another check I've long wanted to eliminate.
        if doc() produced -1 the first time, before any calls to next() or skipTo(), we could get rid of the if (firstTime) code in ConjunctionScorer and others I think. The question is, would this be a burden to any scorers or DISI implementations?

        Show
        Yonik Seeley added a comment - I'm warming to some of the simplifications that a MAX_VAL sentinel can bring. On the other end of the scale... getting rid of "if (firstTime)" is another check I've long wanted to eliminate. if doc() produced -1 the first time, before any calls to next() or skipTo(), we could get rid of the if (firstTime) code in ConjunctionScorer and others I think. The question is, would this be a burden to any scorers or DISI implementations?
        Hide
        Shai Erera added a comment -

        I plan to open another issue for 3.0 to take advantage of MAX_VAL being returned from doc() also (we cannot rely on doc() returning MAX_VAL today when there are no more docs, hence why we need to wait with these changes until 3.0).

        You're proposing to add another contract to doc() - to return -1 before nextDoc() and advance(int) were called. I can do that, but we can use this contract only in 3.0.

        Unless the community decides to change back-compat policy starting with 2.9, which will give us the opportunity to take advantage of "latest & greatest" right away.

        Show
        Shai Erera added a comment - I plan to open another issue for 3.0 to take advantage of MAX_VAL being returned from doc() also (we cannot rely on doc() returning MAX_VAL today when there are no more docs, hence why we need to wait with these changes until 3.0). You're proposing to add another contract to doc() - to return -1 before nextDoc() and advance(int) were called. I can do that, but we can use this contract only in 3.0. Unless the community decides to change back-compat policy starting with 2.9, which will give us the opportunity to take advantage of "latest & greatest" right away.
        Hide
        Shai Erera added a comment -

        MAX_VAL as sentinel + the documentation changes + a new entry to CHANGES "back-compat change" on OBSI.nextDoc() and next(int) + tag fixes.

        All tests pass

        Show
        Shai Erera added a comment - MAX_VAL as sentinel + the documentation changes + a new entry to CHANGES "back-compat change" on OBSI.nextDoc() and next(int) + tag fixes. All tests pass
        Hide
        Michael McCandless added a comment -

        On the other end of the scale... getting rid of "if (firstTime)" is another check I've long wanted to eliminate.
        if doc() produced -1 the first time, before any calls to next() or skipTo(), we could get rid of the if (firstTime) code in ConjunctionScorer and others I think. The question is, would this be a burden to any scorers or DISI implementations?

        +1

        Since we're changing DISI's semantics, now seems like a great time to make this change to. Eliminating the "if (firstTime)" from next() would be great.

        But: wouldn't ConjunctionScorer still need an init() to sort its sub-scorers? (Though, really, we ought to do that sort based on more accurate criteria, eg add a DIS.approxCount() (the first docID of each sub-scorer is an approximation that could easily be very wrong). If we had that, then in the ConjunctionScorer's ctor we would do the ordering).

        Show
        Michael McCandless added a comment - On the other end of the scale... getting rid of "if (firstTime)" is another check I've long wanted to eliminate. if doc() produced -1 the first time, before any calls to next() or skipTo(), we could get rid of the if (firstTime) code in ConjunctionScorer and others I think. The question is, would this be a burden to any scorers or DISI implementations? +1 Since we're changing DISI's semantics, now seems like a great time to make this change to. Eliminating the "if (firstTime)" from next() would be great. But: wouldn't ConjunctionScorer still need an init() to sort its sub-scorers? (Though, really, we ought to do that sort based on more accurate criteria, eg add a DIS.approxCount() (the first docID of each sub-scorer is an approximation that could easily be very wrong). If we had that, then in the ConjunctionScorer's ctor we would do the ordering).
        Hide
        Michael McCandless added a comment -

        We could also consider adding DISI.start (we discussed this under another issue).

        And maybe likewise DISI.finish – there's a question on the user's list now "Do TermDocs and TermEnum need to be closed?" that notes that DISI never gives one a chance to close.

        Show
        Michael McCandless added a comment - We could also consider adding DISI.start (we discussed this under another issue). And maybe likewise DISI.finish – there's a question on the user's list now "Do TermDocs and TermEnum need to be closed?" that notes that DISI never gives one a chance to close.
        Hide
        Yonik Seeley added a comment -

        But: wouldn't ConjunctionScorer still need an init() to sort its sub-scorers?

        If they are all on -1 to start with, they are already all sorted.

        We could do some smart sorting in the constructor so that we skip in cheap and fast scorers first (TermScorers first, ordered by df, followed by simple conjunctions of terms, followed by other more expensive stuff like sloppy phrase queries and complex boolean queries. Perhaps in the future, even a method on Scorer that estimates it's cost?

        Show
        Yonik Seeley added a comment - But: wouldn't ConjunctionScorer still need an init() to sort its sub-scorers? If they are all on -1 to start with, they are already all sorted. We could do some smart sorting in the constructor so that we skip in cheap and fast scorers first (TermScorers first, ordered by df, followed by simple conjunctions of terms, followed by other more expensive stuff like sloppy phrase queries and complex boolean queries. Perhaps in the future, even a method on Scorer that estimates it's cost?
        Hide
        Michael McCandless added a comment -

        If they are all on -1 to start with, they are already all sorted.

        Right but that defeats the optimization. I'm talking about this code in ConjunctionScorer:

        Arrays.sort(scorers, new Comparator() {         // sort the array
            public int compare(Object o1, Object o2) {
              return ((Scorer)o1).doc() - ((Scorer)o2).doc();
            }
          });
        
        doNext();
        
        // If first-time skip distance is any predictor of
        // scorer sparseness, then we should always try to skip first on
        // those scorers.
        // Keep last scorer in it's last place (it will be the first
        // to be skipped on), but reverse all of the others so that
        // they will be skipped on in order of original high skip.
        int end=(scorers.length-1);
        for (int i=0; i<(end>>1); i++) {
          Scorer tmp = scorers[i];
          scorers[i] = scorers[end-i-1];
          scorers[end-i-1] = tmp;
        }
        

        Ie it sets things up so that "typically" the rarest sub-scorer drives the intersection. If they are all on -1 then this heuristic won't work.

        We could do some smart sorting in the constructor so that we skip in cheap and fast scorers first (TermScorers first, ordered by df, followed by simple conjunctions of terms, followed by other more expensive stuff like sloppy phrase queries and complex boolean queries. Perhaps in the future, even a method on Scorer that estimates it's cost?

        Right, we'd need to do something along these lines if we switch DISI to start with doc() = -1.

        Show
        Michael McCandless added a comment - If they are all on -1 to start with, they are already all sorted. Right but that defeats the optimization. I'm talking about this code in ConjunctionScorer: Arrays.sort(scorers, new Comparator() { // sort the array public int compare( Object o1, Object o2) { return ((Scorer)o1).doc() - ((Scorer)o2).doc(); } }); doNext(); // If first-time skip distance is any predictor of // scorer sparseness, then we should always try to skip first on // those scorers. // Keep last scorer in it's last place (it will be the first // to be skipped on), but reverse all of the others so that // they will be skipped on in order of original high skip. int end=(scorers.length-1); for ( int i=0; i<(end>>1); i++) { Scorer tmp = scorers[i]; scorers[i] = scorers[end-i-1]; scorers[end-i-1] = tmp; } Ie it sets things up so that "typically" the rarest sub-scorer drives the intersection. If they are all on -1 then this heuristic won't work. We could do some smart sorting in the constructor so that we skip in cheap and fast scorers first (TermScorers first, ordered by df, followed by simple conjunctions of terms, followed by other more expensive stuff like sloppy phrase queries and complex boolean queries. Perhaps in the future, even a method on Scorer that estimates it's cost? Right, we'd need to do something along these lines if we switch DISI to start with doc() = -1.
        Hide
        Michael McCandless added a comment -

        Oh, it turns out OBSI.nextDoc is new in 2.9! So we are free to change it...

        Show
        Michael McCandless added a comment - Oh, it turns out OBSI.nextDoc is new in 2.9! So we are free to change it...
        Hide
        Earwin Burrfoot added a comment -

        Oh, it turns out OBSI.nextDoc is new in 2.9!

        The phrase sounds all too familiar
        There's one absolutely cool javadoc tag, which I suggest we start using for all user-visible classes and their members. It's called - @Since. Suddenly, everything that's not yet released (and that's a big bunch), is clearly marked as free for changes and amendments.

        Show
        Earwin Burrfoot added a comment - Oh, it turns out OBSI.nextDoc is new in 2.9! The phrase sounds all too familiar There's one absolutely cool javadoc tag, which I suggest we start using for all user-visible classes and their members. It's called - @Since. Suddenly, everything that's not yet released (and that's a big bunch), is clearly marked as free for changes and amendments.
        Hide
        Shai Erera added a comment -

        Oh, it turns out OBSI.nextDoc is new in 2.9!

        Are you sure about it? If so, then why test-tag failed on it? Notice that there are two methods nextDoc() and next(int). Are both new in 2.9? If so, it means somebody added them to the tag, for some reason ...

        BTW, I'm going to open the follow-up issue to that, so we can discuss whatever improvements we want to make to the Scorers following the MAX_VAL sentinel there. Otherwise, they will get lost in this issue, and when we'll handle the follow-up one, we might not remember everything.

        It's called - @Since

        That's absolutely a great idea !

        Show
        Shai Erera added a comment - Oh, it turns out OBSI.nextDoc is new in 2.9! Are you sure about it? If so, then why test-tag failed on it? Notice that there are two methods nextDoc() and next(int). Are both new in 2.9? If so, it means somebody added them to the tag, for some reason ... BTW, I'm going to open the follow-up issue to that, so we can discuss whatever improvements we want to make to the Scorers following the MAX_VAL sentinel there. Otherwise, they will get lost in this issue, and when we'll handle the follow-up one, we might not remember everything. It's called - @Since That's absolutely a great idea !
        Hide
        Michael McCandless added a comment -

        Are you sure about it?

        Yes.

        If so, then why test-tag failed on it?

        Unfortunately, we came up with the idea of the back-compat branch after 2.4 was released, so we cut the branch at that point (in 2.9), so the back-compat branch does contain tests for early 2.9-only features.

        Are both new in 2.9?

        Yes.

        BTW, I'm going to open the follow-up issue to that, so we can discuss whatever improvements we want to make to the Scorers following the MAX_VAL sentinel there. Otherwise, they will get lost in this issue, and when we'll handle the follow-up one, we might not remember everything.

        Agreed!

        Show
        Michael McCandless added a comment - Are you sure about it? Yes. If so, then why test-tag failed on it? Unfortunately, we came up with the idea of the back-compat branch after 2.4 was released, so we cut the branch at that point (in 2.9), so the back-compat branch does contain tests for early 2.9-only features. Are both new in 2.9? Yes. BTW, I'm going to open the follow-up issue to that, so we can discuss whatever improvements we want to make to the Scorers following the MAX_VAL sentinel there. Otherwise, they will get lost in this issue, and when we'll handle the follow-up one, we might not remember everything. Agreed!
        Hide
        Shai Erera added a comment -

        Are both new in 2.9?

        Yes.

        Oh that's great - and here I was deprecating next(int) in favor of the new advance. I'll just delete it then.

        I'm going to open the follow-up issue to that

        Opened LUCENE-1652 and copied what's relevant from this issue to there. If I missed something, please add it.

        Show
        Shai Erera added a comment - Are both new in 2.9? Yes. Oh that's great - and here I was deprecating next(int) in favor of the new advance. I'll just delete it then. I'm going to open the follow-up issue to that Opened LUCENE-1652 and copied what's relevant from this issue to there. If I missed something, please add it.
        Hide
        Shai Erera added a comment -

        Removed next(int) from OBSI and put @since 2.9 on the new methods in DISI.

        I think the patch is ready to be committed. We can add check() as appropriate in LUCENE-1652, when we utilize MAX_VAL.

        Show
        Shai Erera added a comment - Removed next(int) from OBSI and put @since 2.9 on the new methods in DISI. I think the patch is ready to be committed. We can add check() as appropriate in LUCENE-1652 , when we utilize MAX_VAL.
        Hide
        Shai Erera added a comment -

        A question regarding BS.nextDoc(). I paste the current code (that I'm working on, for reference).

          public int nextDoc() throws IOException {
            // TODO: can remove more?
        //    boolean more;
            do {
              while (bucketTable.first != null) {         // more queued
                current = bucketTable.first;
                bucketTable.first = current.next;         // pop the queue
        
                // check prohibited & required, and minNrShouldMatch
                if ((current.bits & prohibitedMask) == 0 &&
                    (current.bits & requiredMask) == requiredMask &&
                    current.coord >= minNrShouldMatch) {
                  return doc = current.doc;
                }
              }
        
              // refill the queue
        //      more = false;
              end += BucketTable.SIZE;
              for (SubScorer sub = scorers; sub != null; sub = sub.next) {
                Scorer scorer = sub.scorer;
                sub.collector.setScorer(scorer);
                int doc = scorer.docID();
                if (doc == -1) {
                  doc = scorer.nextDoc();
                }
                while (doc < end) {
                  sub.collector.collect(doc);
                  doc = scorer.nextDoc();
                }
        //        if (doc != NO_MORE_DOCS) {
        //          more = true;
        //        }
              }
            } while (bucketTable.first != null/* || more*/);
        
            return doc = NO_MORE_DOCS;
          }
        

        I wanted to get rid of 'more', following all the changes I'm doing to DISI. I did it and all tests pass, but I want to double-check my understanding, since I don't know for sure if there is a test that tests my change.

        As far as I see it, there are two code sections: (1) iterates on bucketTable until it is exhausted (i.e., calls to nextDoc() will first consume bucketTable) and (2) iterate on all sub scorers. The second iteration populates the bucket table from all sub-scorers and reiterates, consuming bucket table again, in calls to nextDoc(). Then, at some point, all sub scorers don't have anything more to collect, and bucketTable.first is null for the last time, at which point nextDoc() returns NO_MORE_DOCS.

        So it looks like I can indeed get rid of 'more'. But what puzzles me is why it was there in the first place. Leaving the code as-is means that someone thought of a case where a sub-scorer will have more documents to collect, but still after the 2nd code segment the bucketTable was empty. That's why I'm not sure I can remove 'more' safely - i.e., is it possible that a sub-scorer will have more documents, however all the docs that were collected by sub.collector will not affect the bucketTable? It doesn't look like in the code.

        Same question for BS.score(Collector collector, int max).

        Show
        Shai Erera added a comment - A question regarding BS.nextDoc(). I paste the current code (that I'm working on, for reference). public int nextDoc() throws IOException { // TODO: can remove more? // boolean more; do { while (bucketTable.first != null ) { // more queued current = bucketTable.first; bucketTable.first = current.next; // pop the queue // check prohibited & required, and minNrShouldMatch if ((current.bits & prohibitedMask) == 0 && (current.bits & requiredMask) == requiredMask && current.coord >= minNrShouldMatch) { return doc = current.doc; } } // refill the queue // more = false ; end += BucketTable.SIZE; for (SubScorer sub = scorers; sub != null ; sub = sub.next) { Scorer scorer = sub.scorer; sub.collector.setScorer(scorer); int doc = scorer.docID(); if (doc == -1) { doc = scorer.nextDoc(); } while (doc < end) { sub.collector.collect(doc); doc = scorer.nextDoc(); } // if (doc != NO_MORE_DOCS) { // more = true ; // } } } while (bucketTable.first != null /* || more*/); return doc = NO_MORE_DOCS; } I wanted to get rid of 'more', following all the changes I'm doing to DISI. I did it and all tests pass, but I want to double-check my understanding, since I don't know for sure if there is a test that tests my change. As far as I see it, there are two code sections: (1) iterates on bucketTable until it is exhausted (i.e., calls to nextDoc() will first consume bucketTable) and (2) iterate on all sub scorers. The second iteration populates the bucket table from all sub-scorers and reiterates, consuming bucket table again, in calls to nextDoc(). Then, at some point, all sub scorers don't have anything more to collect, and bucketTable.first is null for the last time, at which point nextDoc() returns NO_MORE_DOCS. So it looks like I can indeed get rid of 'more'. But what puzzles me is why it was there in the first place. Leaving the code as-is means that someone thought of a case where a sub-scorer will have more documents to collect, but still after the 2nd code segment the bucketTable was empty. That's why I'm not sure I can remove 'more' safely - i.e., is it possible that a sub-scorer will have more documents, however all the docs that were collected by sub.collector will not affect the bucketTable? It doesn't look like in the code. Same question for BS.score(Collector collector, int max).
        Hide
        Shai Erera added a comment -

        About ConjunctionScorer.doNext() (this also applies to FilteredQuery.advanceToCommon()). I've changed it, following Mike's proposal to this:

          private boolean doNext() throws IOException {
            int first = 0;
            lastDoc = scorers[scorers.length - 1].docID();
            Scorer firstScorer;
            while ((firstScorer = scorers[first]).docID() < lastDoc) {
              lastDoc = firstScorer.advance(lastDoc);
              first = first == scorers.length - 1 ? 0 : first + 1;
            }
            return lastDoc != NO_MORE_DOCS;
          }
        

        This indeed gets rid of 'more', the check for 'more' in the while condition and also the assignment to more. But now I think it may introduce a different inefficiency. Let's say that firstScorer.advance() returns NO_MORE_DOCS. The next scorer's docID is obviously smaller, and therefore the following call will be (first line in the 'while' body): lastDoc = firstScorer.advance(Integer.MAX_VALUE);. There are Scorers which cannot implement that efficiently.

        With 'more' this would not have happened, since the while condition would terminate before that.

        Are we sure that that's a worthwhile enhancement.

        BTW, the code for FilteredQuery looks like this:

                    while (scorerDoc != disiDoc) {
                      if (scorerDoc < disiDoc) {
                        if ((scorerDoc = scorer.advance(disiDoc)) == NO_MORE_DOCS) {
                          return NO_MORE_DOCS;
                        }
                      } else {
                        if ((disiDoc = docIdSetIterator.advance(scorerDoc)) == NO_MORE_DOCS) {
                          return NO_MORE_DOCS;
                        }
                      }
                    }
                    return scorerDoc;
        

        And I thought to change it to this:

        while (scorerDoc != disiDoc) {
          if (scorerDoc < disiDoc) {
            scorerDoc = scorer.advance(disiDoc);
          } else {
          disiDoc = docIdSetIterator.advance(scorerDoc);
          }
        }
        return scorerDoc;
        

        What do you think?

        Show
        Shai Erera added a comment - About ConjunctionScorer.doNext() (this also applies to FilteredQuery.advanceToCommon()). I've changed it, following Mike's proposal to this: private boolean doNext() throws IOException { int first = 0; lastDoc = scorers[scorers.length - 1].docID(); Scorer firstScorer; while ((firstScorer = scorers[first]).docID() < lastDoc) { lastDoc = firstScorer.advance(lastDoc); first = first == scorers.length - 1 ? 0 : first + 1; } return lastDoc != NO_MORE_DOCS; } This indeed gets rid of 'more', the check for 'more' in the while condition and also the assignment to more. But now I think it may introduce a different inefficiency. Let's say that firstScorer.advance() returns NO_MORE_DOCS. The next scorer's docID is obviously smaller, and therefore the following call will be (first line in the 'while' body): lastDoc = firstScorer.advance(Integer.MAX_VALUE); . There are Scorers which cannot implement that efficiently. With 'more' this would not have happened, since the while condition would terminate before that. Are we sure that that's a worthwhile enhancement. BTW, the code for FilteredQuery looks like this: while (scorerDoc != disiDoc) { if (scorerDoc < disiDoc) { if ((scorerDoc = scorer.advance(disiDoc)) == NO_MORE_DOCS) { return NO_MORE_DOCS; } } else { if ((disiDoc = docIdSetIterator.advance(scorerDoc)) == NO_MORE_DOCS) { return NO_MORE_DOCS; } } } return scorerDoc; And I thought to change it to this: while (scorerDoc != disiDoc) { if (scorerDoc < disiDoc) { scorerDoc = scorer.advance(disiDoc); } else { disiDoc = docIdSetIterator.advance(scorerDoc); } } return scorerDoc; What do you think?
        Hide
        Paul Elschot added a comment -

        Well, since you asked: could you try and put all conjunctions on doc ids in a single place?
        That would also prepare for things like LUCENE-1252 .

        The tests for boolean queries are quite comprehensive nowadays. So when all tests pass, don't worry too much.
        In case there is a bug left, it will be so much of corner case that it can be fixed at another issue.

        Show
        Paul Elschot added a comment - Well, since you asked: could you try and put all conjunctions on doc ids in a single place? That would also prepare for things like LUCENE-1252 . The tests for boolean queries are quite comprehensive nowadays. So when all tests pass, don't worry too much. In case there is a bug left, it will be so much of corner case that it can be fixed at another issue.
        Hide
        Michael McCandless added a comment -

        Leaving the code as-is means that someone thought of a case where a sub-scorer will have more documents to collect, but still after the 2nd code segment the bucketTable was empty.

        This would happen when none of the sub-scorers had docIDs in the
        current bucket range.

        EG say the first docID of all sub-scorers is 3000,
        and BucketTable.SIZE is 2048. On the first time through part (2), no
        sub-scorer will add to the bucket, so on finishing part (2),
        bucketTable.first will still be null but more will be true, and so the
        whole loop would (and, should) keep repeating.

        I think?

        So in fact your change should break that case.... can you add a
        breaking unit test showing this?

        Also, I wonder if this can be removed, by init'ing the sub-scorers up
        front:

        if (doc == -1) {
          doc = scorer.nextDoc();
        }
        
        Show
        Michael McCandless added a comment - Leaving the code as-is means that someone thought of a case where a sub-scorer will have more documents to collect, but still after the 2nd code segment the bucketTable was empty. This would happen when none of the sub-scorers had docIDs in the current bucket range. EG say the first docID of all sub-scorers is 3000, and BucketTable.SIZE is 2048. On the first time through part (2), no sub-scorer will add to the bucket, so on finishing part (2), bucketTable.first will still be null but more will be true, and so the whole loop would (and, should) keep repeating. I think? So in fact your change should break that case.... can you add a breaking unit test showing this? Also, I wonder if this can be removed, by init'ing the sub-scorers up front: if (doc == -1) { doc = scorer.nextDoc(); }
        Hide
        Michael McCandless added a comment -

        There are Scorers which cannot implement that efficiently.

        Which Scorers do this slowly?

        Show
        Michael McCandless added a comment - There are Scorers which cannot implement that efficiently. Which Scorers do this slowly?
        Hide
        Michael McCandless added a comment -

        The changes to FilteredQuery look good!

        Show
        Michael McCandless added a comment - The changes to FilteredQuery look good!
        Hide
        Shai Erera added a comment -

        Which Scorers do this slowly?

        TermScorer, ValueSourceQuery which uses TermDocs.skipTo ... just two examples. I'm not sure how efficient TermDocs is.

        We could state in the javadocs that Integer.MAX_VALUE may be passed and if you think you cannot impl advance(MAX_VAL) efficiently, always check if that's what was passed?

        The changes to FilteredQuery look good!

        If we agree that calling advance(MAX_VAL) is ok, I'll enable both fixes.

        Show
        Shai Erera added a comment - Which Scorers do this slowly? TermScorer, ValueSourceQuery which uses TermDocs.skipTo ... just two examples. I'm not sure how efficient TermDocs is. We could state in the javadocs that Integer.MAX_VALUE may be passed and if you think you cannot impl advance(MAX_VAL) efficiently, always check if that's what was passed? The changes to FilteredQuery look good! If we agree that calling advance(MAX_VAL) is ok, I'll enable both fixes.
        Hide
        Michael McCandless added a comment -

        I'm not sure how efficient TermDocs is.

        TermDocs.skipTo(Integer.MAX_VALUE) should be quite fast (it's using multi-level skipping)?

        We could state in the javadocs that Integer.MAX_VALUE may be passed and if you think you cannot impl advance(MAX_VAL) efficiently, always check if that's what was passed?

        Let's do that, but I don't think the scorers that use TermDocs.skipTo need the extra check.

        Show
        Michael McCandless added a comment - I'm not sure how efficient TermDocs is. TermDocs.skipTo(Integer.MAX_VALUE) should be quite fast (it's using multi-level skipping)? We could state in the javadocs that Integer.MAX_VALUE may be passed and if you think you cannot impl advance(MAX_VAL) efficiently, always check if that's what was passed? Let's do that, but I don't think the scorers that use TermDocs.skipTo need the extra check.
        Hide
        Shai Erera added a comment -

        Let's do that, but I don't think the scorers that use TermDocs.skipTo need the extra check.

        Ok will do. So it means that none of the current DISIs/Scorers should do this check? I'll double-check to make sure and post back if I find something suspicious.

        So in fact your change should break that case.... can you add a breaking unit test showing this?

        I will look into it.

        Also, I wonder if this can be removed, by init'ing the sub-scorers up front

        Will check this too.

        Show
        Shai Erera added a comment - Let's do that, but I don't think the scorers that use TermDocs.skipTo need the extra check. Ok will do. So it means that none of the current DISIs/Scorers should do this check? I'll double-check to make sure and post back if I find something suspicious. So in fact your change should break that case.... can you add a breaking unit test showing this? I will look into it. Also, I wonder if this can be removed, by init'ing the sub-scorers up front Will check this too.
        Hide
        Shai Erera added a comment -

        I added testEmptyBucketWithMoreDocs in TestBooleanScorer which creates a BooleanScorer that returns 3000 the first time nextDoc is called (or advance with target <= 3000) and then NO_MORE_DOCS. It asserts that calling nextDoc() once returns 3000 and the second time returns NO_MORE_DOCS.

        Indeed, with the change I've made above it fails. So it's good that this test is there now. It makes BS/BS2 one more testcase-proof. That means I cannot remove 'more' ...

        The tests for boolean queries are quite comprehensive nowadays. So when all tests pass, don't worry too much.

        Paul, I guess we should always worry . Anyway, now we can say that sentence with a bit more confidence .

        Also, I wonder if this can be removed, by init'ing the sub-scorers up front.

        How about if I call nextDoc() when addScorer is called? I tried it and all tests pass, which at least means there is no test case that fails (see above comment). If I do this, I can also check if the result is NO_MORE_DOCS and don't add it to the sub scorers list in the first place.

        What do you think?

        BTW, funny thing just happened - turns out we deprecated all of DISIs methods, but not DISI itself. This is actually good since it allows us to keep DISI and not deprecate Scorer and all the public DISIs that extend DISI. And it saves us the effort of finding alternative names ... Just a thought.

        Another thought of an optimization. Somewhere up this issue, we discussed adding start() to DISI, just to get rid of firstTime in DisjunctionMaxScorer and ConjunctionScorer. At least for DMS, I think I can remove add(Scorer) and add to its ctor an array of Scorer[]. DMS is used by DMQ, and the number of scorers is knows in advance.
        Also, DMS can be changed quite a bit, not using ArrayList but an array. ArrayList get/set methods do range checks every time you call them, and we're calling them quite a lot.

        Show
        Shai Erera added a comment - I added testEmptyBucketWithMoreDocs in TestBooleanScorer which creates a BooleanScorer that returns 3000 the first time nextDoc is called (or advance with target <= 3000) and then NO_MORE_DOCS. It asserts that calling nextDoc() once returns 3000 and the second time returns NO_MORE_DOCS. Indeed, with the change I've made above it fails. So it's good that this test is there now. It makes BS/BS2 one more testcase-proof. That means I cannot remove 'more' ... The tests for boolean queries are quite comprehensive nowadays. So when all tests pass, don't worry too much. Paul, I guess we should always worry . Anyway, now we can say that sentence with a bit more confidence . Also, I wonder if this can be removed, by init'ing the sub-scorers up front. How about if I call nextDoc() when addScorer is called? I tried it and all tests pass, which at least means there is no test case that fails (see above comment). If I do this, I can also check if the result is NO_MORE_DOCS and don't add it to the sub scorers list in the first place. What do you think? BTW, funny thing just happened - turns out we deprecated all of DISIs methods, but not DISI itself. This is actually good since it allows us to keep DISI and not deprecate Scorer and all the public DISIs that extend DISI. And it saves us the effort of finding alternative names ... Just a thought. Another thought of an optimization. Somewhere up this issue, we discussed adding start() to DISI, just to get rid of firstTime in DisjunctionMaxScorer and ConjunctionScorer. At least for DMS, I think I can remove add(Scorer) and add to its ctor an array of Scorer[]. DMS is used by DMQ, and the number of scorers is knows in advance. Also, DMS can be changed quite a bit, not using ArrayList but an array. ArrayList get/set methods do range checks every time you call them, and we're calling them quite a lot.
        Hide
        Shai Erera added a comment -

        Also, since BS.add() is called by BS2 only, and BS is package-private and instantiated by BS2 only, I can remove add, and pass to BS ctor two iterators (for prohibited and optional). That will allow us to compute coordFactor up front and remove the check from score() and score(Collector, int).

        What do you think?

        Show
        Shai Erera added a comment - Also, since BS.add() is called by BS2 only, and BS is package-private and instantiated by BS2 only, I can remove add, and pass to BS ctor two iterators (for prohibited and optional). That will allow us to compute coordFactor up front and remove the check from score() and score(Collector, int). What do you think?
        Hide
        Paul Elschot added a comment - - edited

        I'd also prefer to have constructors with for example a List argument for the subscorers, and use an array internally. This is easily done when the parser allows postorder query construction, and all reasonable parsers can do that.
        When all subscorers are given in the constructor, most (or even all) of the checks on firstTime private booleans in the various scorers can be removed.

        One minor disadvantage of removing such firstTime booleans is that during scorer construction the index may have to be accessed because some constructors will use the next() method on their subscorers.
        So, when BooleanScorer2 was added, I did not like the really the add() method and firstTime booleans either, but I left them in because in that way the next() method would not be called on subscorers at construction time.
        Iirc I used constructor List arguments in the surround package, and that's still my preference.

        Show
        Paul Elschot added a comment - - edited I'd also prefer to have constructors with for example a List argument for the subscorers, and use an array internally. This is easily done when the parser allows postorder query construction, and all reasonable parsers can do that. When all subscorers are given in the constructor, most (or even all) of the checks on firstTime private booleans in the various scorers can be removed. One minor disadvantage of removing such firstTime booleans is that during scorer construction the index may have to be accessed because some constructors will use the next() method on their subscorers. So, when BooleanScorer2 was added, I did not like the really the add() method and firstTime booleans either, but I left them in because in that way the next() method would not be called on subscorers at construction time. Iirc I used constructor List arguments in the surround package, and that's still my preference.
        Hide
        Shai Erera added a comment -

        One minor disadvantage of removing such firstTime booleans is that during scorer construction the index may have to be accessed because some constructors will use the next() method on their subscorers.

        I don't see any difference between instantiating BS and immediately after that calling add several times to instantiating BS with the scorers. Accessing the index, to me, seems to be identical.

        In fact, by removing add() and passing the scorers in the ctor we can also get rid of redundant code in add(). Currently, it checks if the scorer is required/prohibited/both, but both does not happen in reality and in each add only required or prohibited is set.

        In general I think we should review such scorers and make a decision on a case-by-case basis. For BS, I'd like to change it since I don't see any back-compat issue, and it should clear the code a bit. I'll go ahead and do it, and if I run into any obstacles I'll post back.

        Show
        Shai Erera added a comment - One minor disadvantage of removing such firstTime booleans is that during scorer construction the index may have to be accessed because some constructors will use the next() method on their subscorers. I don't see any difference between instantiating BS and immediately after that calling add several times to instantiating BS with the scorers. Accessing the index, to me, seems to be identical. In fact, by removing add() and passing the scorers in the ctor we can also get rid of redundant code in add(). Currently, it checks if the scorer is required/prohibited/both, but both does not happen in reality and in each add only required or prohibited is set. In general I think we should review such scorers and make a decision on a case-by-case basis. For BS, I'd like to change it since I don't see any back-compat issue, and it should clear the code a bit. I'll go ahead and do it, and if I run into any obstacles I'll post back.
        Hide
        Michael McCandless added a comment -

        I added testEmptyBucketWithMoreDocs in TestBooleanScorer which creates a BooleanScorer that returns 3000 the first time nextDoc is called

        Excellent!

        How about if I call nextDoc() when addScorer is called? I tried it and all tests pass, which at least means there is no test case that fails (see above comment). If I do this, I can also check if the result is NO_MORE_DOCS and don't add it to the sub scorers list in the first place.

        That sounds reasonable.

        Another thought of an optimization. Somewhere up this issue, we discussed adding start() to DISI, just to get rid of firstTime in DisjunctionMaxScorer and ConjunctionScorer.

        I think having doc() return -1 before next/advance have been called
        was also needed (or maybe just helpful?) for this.

        Also, thinking more on this, isn't DISI.start() redundant? (Since
        such init'ing could be done in Weight.scorer(), when the Scorer is
        created)? We're not allowed to reuse a DISI, so...

        Oh, I see: it's not redundant for scorers that do incremental
        construction (BS, BS2). But I think we should fix such cases to
        accept all sub-queries up front, then DISI.start() is redundant?

        Also, DMS can be changed quite a bit, not using ArrayList but an array.

        These sound like good optimizations too.

        Also, since BS.add() is called by BS2 only, and BS is package-private and instantiated by BS2 only, I can remove add, and pass to BS ctor two iterators (for prohibited and optional). That will allow us to compute coordFactor up front and remove the check from score() and score(Collector, int).

        This sounds good, as well as switching BS2 to take all its sub-queries
        up front.

        Show
        Michael McCandless added a comment - I added testEmptyBucketWithMoreDocs in TestBooleanScorer which creates a BooleanScorer that returns 3000 the first time nextDoc is called Excellent! How about if I call nextDoc() when addScorer is called? I tried it and all tests pass, which at least means there is no test case that fails (see above comment). If I do this, I can also check if the result is NO_MORE_DOCS and don't add it to the sub scorers list in the first place. That sounds reasonable. Another thought of an optimization. Somewhere up this issue, we discussed adding start() to DISI, just to get rid of firstTime in DisjunctionMaxScorer and ConjunctionScorer. I think having doc() return -1 before next/advance have been called was also needed (or maybe just helpful?) for this. Also, thinking more on this, isn't DISI.start() redundant? (Since such init'ing could be done in Weight.scorer(), when the Scorer is created)? We're not allowed to reuse a DISI, so... Oh, I see: it's not redundant for scorers that do incremental construction (BS, BS2). But I think we should fix such cases to accept all sub-queries up front, then DISI.start() is redundant? Also, DMS can be changed quite a bit, not using ArrayList but an array. These sound like good optimizations too. Also, since BS.add() is called by BS2 only, and BS is package-private and instantiated by BS2 only, I can remove add, and pass to BS ctor two iterators (for prohibited and optional). That will allow us to compute coordFactor up front and remove the check from score() and score(Collector, int). This sounds good, as well as switching BS2 to take all its sub-queries up front.
        Hide
        Shai Erera added a comment -

        I prefer to defer DISI.start/finish until it's actually needed. Maybe even not part of this issue. For now, I'll finish the optimizations we've discussed and post a patch.

        BTW Mike, docID()'s contract states that is should return -1 if nextDoc()/advance() weren't called.

        Also, I've done the changes to BS and the code looks much more simpler. Working on DMS now.

        Show
        Shai Erera added a comment - I prefer to defer DISI.start/finish until it's actually needed. Maybe even not part of this issue. For now, I'll finish the optimizations we've discussed and post a patch. BTW Mike, docID()'s contract states that is should return -1 if nextDoc()/advance() weren't called. Also, I've done the changes to BS and the code looks much more simpler. Working on DMS now.
        Hide
        Michael McCandless added a comment -

        I prefer to defer DISI.start/finish until it's actually needed.

        I agree. I'm thinking at this point that neither should be needed. start() should be done on creating the Scorer, and finish() should not be needed because Scorers ought to be lightweight enough to not need closeable resources. If one makes a custom Scorer one can always close it outside of Lucene.

        BTW Mike, docID()'s contract states that is should return -1 if nextDoc()/advance() weren't called.

        Right, so this should mean the "firstTime" logic should be removable from all core Scorers.

        Also, I've done the changes to BS and the code looks much more simpler. Working on DMS now.

        Fabulous Can't wait to see it!

        Show
        Michael McCandless added a comment - I prefer to defer DISI.start/finish until it's actually needed. I agree. I'm thinking at this point that neither should be needed. start() should be done on creating the Scorer, and finish() should not be needed because Scorers ought to be lightweight enough to not need closeable resources. If one makes a custom Scorer one can always close it outside of Lucene. BTW Mike, docID()'s contract states that is should return -1 if nextDoc()/advance() weren't called. Right, so this should mean the "firstTime" logic should be removable from all core Scorers. Also, I've done the changes to BS and the code looks much more simpler. Working on DMS now. Fabulous Can't wait to see it!
        Hide
        Shai Erera added a comment -

        Patch includes the new docID on DISI as well as a bunch of improvements to scorers, such as BooleanScorer, DisjunctionMaxScorer and ConjunctionScorer.

        All tests pass.

        BTW, I wasn't able to completely get rid of firstTime in ConjunctionScorer, even though I moved everything to the ctor. Maybe I'm missing something ...

        Show
        Shai Erera added a comment - Patch includes the new docID on DISI as well as a bunch of improvements to scorers, such as BooleanScorer, DisjunctionMaxScorer and ConjunctionScorer. All tests pass. BTW, I wasn't able to completely get rid of firstTime in ConjunctionScorer, even though I moved everything to the ctor. Maybe I'm missing something ...
        Hide
        Michael McCandless added a comment -

        I'm seeing some patch hunks failing – can you svn up to current trunk & repost the patch? Thanks.

        Show
        Michael McCandless added a comment - I'm seeing some patch hunks failing – can you svn up to current trunk & repost the patch? Thanks.
        Hide
        Shai Erera added a comment -

        Can you try now?

        Show
        Shai Erera added a comment - Can you try now?
        Hide
        Michael McCandless added a comment -

        Much better – I'll look at the patch. Thanks.

        Show
        Michael McCandless added a comment - Much better – I'll look at the patch. Thanks.
        Hide
        Michael McCandless added a comment -

        Shai can I make some small tweaks to the patch & post another iteration?

        Show
        Michael McCandless added a comment - Shai can I make some small tweaks to the patch & post another iteration?
        Hide
        Shai Erera added a comment -

        Did you seriously ask that?

        Of course !

        Show
        Shai Erera added a comment - Did you seriously ask that? Of course !
        Hide
        Michael McCandless added a comment -

        Well I don't want to have both of us working on it and then have to merge.... we need a better fast switching/assigned to mechanism...

        Show
        Michael McCandless added a comment - Well I don't want to have both of us working on it and then have to merge.... we need a better fast switching/assigned to mechanism...
        Hide
        Shai Erera added a comment -

        It's ok. I've been staring at this code for so long, it definitely could use fresh eyes.

        Besides, I've finished all what I planned to do. So I'll just wait for your updated patch, review it, and if I'll want to add anything I'll let you know.

        One thing I noticed from the other issue we've both posted patches to, is that the order of the classes in the patch don't match between us, so it's harder to compare your patch with mine. But if those are really tweaks, it should show up more easily.

        Show
        Shai Erera added a comment - It's ok. I've been staring at this code for so long, it definitely could use fresh eyes. Besides, I've finished all what I planned to do. So I'll just wait for your updated patch, review it, and if I'll want to add anything I'll let you know. One thing I noticed from the other issue we've both posted patches to, is that the order of the classes in the patch don't match between us, so it's harder to compare your patch with mine. But if those are really tweaks , it should show up more easily.
        Hide
        Michael McCandless added a comment -

        Patch is looking good!

        OK I attached new patch... note that patch applies to latest
        back-compat tag, so you need to run "ant download-tag" before applying
        it. Comments/questions:

        • Init'd doc = -1 in BooleanScorer
        • Can you add an "assert scorer.docID() == -1" in IndexSearcher
          right when it gets the scorer? Then we can tease out where we
          failed to init to -1.
        • Hmm... maybe we permit DISI.docID() to also return NO_MORE_DOCS
          after the DISI is created? EG we do this in ConjunctionScorer.
          (Or, maybe, any docID that's < the first real docID...hmmm.)
        • Made DocsWriter.applyDeletes a bit more efficient – no need to
          check for the sentinel (it's "congruent" w/ the existing check).
          Though it did require an upcast to long, so if we do any add
          arithmetic w/ doc where doc could be NO_MORE_DOCS we must be
          careful to upcast.
        • I think we can remove the firstTime from ReqExclScorer, since
          docID() is defined to return -1 at start?
        • Made BS.coordFactors final
        • Fixed "Shai Erera vs. Mike McCandless" -> "Shai Erera
          via Mike McCandless"
        • The addition of this:
              if (docNr == -1) { // should be if nextDoc() was not called yet
                docNr = nextDoc();
              }
          

          in BooleanScorer2.score(Collector, int) makes me a bit nervous –
          I think one is supposed to have called nextDoc prior to calling
          this? Also, this was added to the same method in Scorer.

        • BS2.nextDoc is still needing to check if it's supposed to call
          initCountingSumScorer? Can we do this in ctor?
        • Actually, you did get rid of firstTime from ConjunctionScorer?
          Maybe you meant BS2? (Oh, or maybe you meant the addition of
          "} else if (lastDoc == -1) {" in ConjunctionScorer? So... if we
          had a way to do that sorting (in the ctor) without invoking
          nextDoc on each sub-scorer, then you could eliminate that extra
          check right?
        • Put some "nocommit" questions in the patch
        Show
        Michael McCandless added a comment - Patch is looking good! OK I attached new patch... note that patch applies to latest back-compat tag, so you need to run "ant download-tag" before applying it. Comments/questions: Init'd doc = -1 in BooleanScorer Can you add an "assert scorer.docID() == -1" in IndexSearcher right when it gets the scorer? Then we can tease out where we failed to init to -1. Hmm... maybe we permit DISI.docID() to also return NO_MORE_DOCS after the DISI is created? EG we do this in ConjunctionScorer. (Or, maybe, any docID that's < the first real docID...hmmm.) Made DocsWriter.applyDeletes a bit more efficient – no need to check for the sentinel (it's "congruent" w/ the existing check). Though it did require an upcast to long, so if we do any add arithmetic w/ doc where doc could be NO_MORE_DOCS we must be careful to upcast. I think we can remove the firstTime from ReqExclScorer, since docID() is defined to return -1 at start? Made BS.coordFactors final Fixed "Shai Erera vs. Mike McCandless" -> "Shai Erera via Mike McCandless" The addition of this: if (docNr == -1) { // should be if nextDoc() was not called yet docNr = nextDoc(); } in BooleanScorer2.score(Collector, int) makes me a bit nervous – I think one is supposed to have called nextDoc prior to calling this? Also, this was added to the same method in Scorer. BS2.nextDoc is still needing to check if it's supposed to call initCountingSumScorer? Can we do this in ctor? Actually, you did get rid of firstTime from ConjunctionScorer? Maybe you meant BS2? (Oh, or maybe you meant the addition of "} else if (lastDoc == -1) {" in ConjunctionScorer? So... if we had a way to do that sorting (in the ctor) without invoking nextDoc on each sub-scorer, then you could eliminate that extra check right? Put some "nocommit" questions in the patch
        Hide
        Shai Erera added a comment -

        Fixed "Shai Erera vs. Mike McCandless" -> "Shai Erera via Mike McCandless"

        ha ha ha . Sorry ...

        I'll take a look at the comments a bit later. Regarding ConjunctionScorer, you're right. If we had a way to sort on the scorers w/o invoking nextDoc(), we could remove that check from nextDoc().

        Show
        Shai Erera added a comment - Fixed "Shai Erera vs. Mike McCandless" -> "Shai Erera via Mike McCandless" ha ha ha . Sorry ... I'll take a look at the comments a bit later. Regarding ConjunctionScorer, you're right. If we had a way to sort on the scorers w/o invoking nextDoc(), we could remove that check from nextDoc().
        Hide
        Michael McCandless added a comment -

        It's OK.

        Regarding ConjunctionScorer, you're right. If we had a way to sort on the scorers w/o invoking nextDoc(), we could remove that check from nextDoc()

        OK we may need to hold off on that until we add something like approxCount() to Scorer, or something the measures approximate cost.

        Show
        Michael McCandless added a comment - It's OK. Regarding ConjunctionScorer, you're right. If we had a way to sort on the scorers w/o invoking nextDoc(), we could remove that check from nextDoc() OK we may need to hold off on that until we add something like approxCount() to Scorer, or something the measures approximate cost.
        Hide
        Shai Erera added a comment -

        Regarding the nocommit in ConjunctionScorer's ctor - I think the problem with moving doNext() to the end of the ctor is the order of the scorers that will be iterated on. I.e., currently the scorers are first advanced to their first doc, then sorted based on their docID, then advanced to the doc ID that all agree on, in the order of the sort.

        If you move doNext() to the end of the method, then the scorers are sorted based on their docID, and then immediately re-sorted (based on their sparseness, which is a heuristic applied based on their first docID).

        If I understand correctly, the problem with calling doNext() after the re-sort is this: assume you have 5 scorers, whose first docs are 1, 2, 3, 5, 5 respectively. Sorting leaves the array as is (or changes it to be in that order). If you call doNext() after array.sort, then you advance all the first scorers to 5 (or a larger doc ID they all agree on). However, if you do the re-sort, the order will be 5, 3, 2, 1, 5 and then if you call doNext(), it will stop immediately, since the first scorer's docs equals the last one. So you break an invariant, that after calling doNext() all scorers are on the same doc ID.

        That's at least how I understand it. I hope I didn't write too much to explain the obvious (to others).

        So I'll delete the nocommit tag.

        Regarding nocommit in ReqExclScorer, you're right. I'll remove firstTime entirely.

        Can you add an "assert scorer.docID() == -1" in IndexSearcher right when it gets the scorer?

        Done.

        BS2.nextDoc is still needing to check if it's supposed to call initCountingSumScorer? Can we do this in ctor?

        It can, if I delete add() and do all the work in the ctor. I'll need to pass 3 arrays (Scorer[], boolean[] /required/, boolean[] /prohibited/) though, or pass the list of Weights, Clauses and IndexReader. I'm not too fan of either. I check which is the lesser evil.

        BooleanScorer2.score(Collector, int) makes me a bit nervous

        I didn't think it's such a big deal since it's one 'if' before scoring starts. It's not like it's called for every score(). Do you think we should resolve it?

        Hmm... maybe we permit DISI.docID() to also return NO_MORE_DOCS after the DISI is created? EG we do this in ConjunctionScorer. (Or, maybe, any docID that's < the first real docID...hmmm.)

        I do not fully follow you. By allowing to return -1, we already allow any DISI to return a doc ID that's < the first real doc ID. And allowing to return NO_MORE_DOCS looks strange to me .. I mean imagine a code which creates a DISI and calls doc() and gets back NO_MORE_DOCS .. that'd be strange, won't it?

        Show
        Shai Erera added a comment - Regarding the nocommit in ConjunctionScorer's ctor - I think the problem with moving doNext() to the end of the ctor is the order of the scorers that will be iterated on. I.e., currently the scorers are first advanced to their first doc, then sorted based on their docID, then advanced to the doc ID that all agree on, in the order of the sort. If you move doNext() to the end of the method, then the scorers are sorted based on their docID, and then immediately re-sorted (based on their sparseness, which is a heuristic applied based on their first docID). If I understand correctly, the problem with calling doNext() after the re-sort is this: assume you have 5 scorers, whose first docs are 1, 2, 3, 5, 5 respectively. Sorting leaves the array as is (or changes it to be in that order). If you call doNext() after array.sort, then you advance all the first scorers to 5 (or a larger doc ID they all agree on). However, if you do the re-sort, the order will be 5, 3, 2, 1, 5 and then if you call doNext(), it will stop immediately, since the first scorer's docs equals the last one. So you break an invariant, that after calling doNext() all scorers are on the same doc ID. That's at least how I understand it. I hope I didn't write too much to explain the obvious (to others). So I'll delete the nocommit tag. Regarding nocommit in ReqExclScorer, you're right. I'll remove firstTime entirely. Can you add an "assert scorer.docID() == -1" in IndexSearcher right when it gets the scorer? Done. BS2.nextDoc is still needing to check if it's supposed to call initCountingSumScorer? Can we do this in ctor? It can, if I delete add() and do all the work in the ctor. I'll need to pass 3 arrays (Scorer[], boolean[] / required /, boolean[] / prohibited /) though, or pass the list of Weights, Clauses and IndexReader. I'm not too fan of either. I check which is the lesser evil. BooleanScorer2.score(Collector, int) makes me a bit nervous I didn't think it's such a big deal since it's one 'if' before scoring starts. It's not like it's called for every score(). Do you think we should resolve it? Hmm... maybe we permit DISI.docID() to also return NO_MORE_DOCS after the DISI is created? EG we do this in ConjunctionScorer. (Or, maybe, any docID that's < the first real docID...hmmm.) I do not fully follow you. By allowing to return -1, we already allow any DISI to return a doc ID that's < the first real doc ID. And allowing to return NO_MORE_DOCS looks strange to me .. I mean imagine a code which creates a DISI and calls doc() and gets back NO_MORE_DOCS .. that'd be strange, won't it?
        Hide
        Michael McCandless added a comment -

        Regarding the nocommit in ConjunctionScorer's ctor - I think the problem with moving doNext() to the end of the ctor is the order of the scorers that will be iterated on. I.e., currently the scorers are first advanced to their first doc, then sorted based on their docID, then advanced to the doc ID that all agree on, in the order of the sort.
        If you move doNext() to the end of the method, then the scorers are sorted based on their docID, and then immediately re-sorted (based on their sparseness, which is a heuristic applied based on their first docID).

        If I understand correctly, the problem with calling doNext() after the re-sort is this: assume you have 5 scorers, whose first docs are 1, 2, 3, 5, 5 respectively. Sorting leaves the array as is (or changes it to be in that order). If you call doNext() after array.sort, then you advance all the first scorers to 5 (or a larger doc ID they all agree on). However, if you do the re-sort, the order will be 5, 3, 2, 1, 5 and then if you call doNext(), it will stop immediately, since the first scorer's docs equals the last one. So you break an invariant, that after calling doNext() all scorers are on the same doc ID.

        That's at least how I understand it. I hope I didn't write too much to explain the obvious (to others).

        So I'll delete the nocommit tag.

        Ahh – good explanation! Phew. This must be what leads to the two
        test failures. So leave it where it is (maybe add a comment
        explaining it cannot be moved down, lest anyone else gets tempted).

        > Can you add an "assert scorer.docID() == -1" in IndexSearcher right when it gets the scorer?

        Done.

        OK, except we may delete it depending on the relaxing (below)...

        > BS2.nextDoc is still needing to check if it's supposed to call initCountingSumScorer? Can we do this in ctor?

        It can, if I delete add() and do all the work in the ctor.

        I think we should (lacking a DISI.start).

        I'll need to pass 3 arrays (Scorer[], boolean[] /required/, boolean[] /prohibited/) though, or pass the list of Weights, Clauses and IndexReader. I'm not too fan of either. I check which is the lesser evil.

        Hmm – I don't have a stronger lesser-of-evils preference. I suppose,
        instead, we could add a "start" only to BS2, which BQ would call once
        it's done adding? Or, simply call initCountingSumScorer from BQ and
        don't do the check per nextDoc/advance.

        > BooleanScorer2.score(Collector, int) makes me a bit nervous

        I didn't think it's such a big deal since it's one 'if' before scoring starts. It's not like it's called for every score(). Do you think we should resolve it?

        What specifically made me nervous that it was even necessary to add
        this (having to conditionally next wasn't needed before) in the first
        place. If you remove it, does something actually break? Like what
        caused it to be added? (Because I want to go explore/understand
        that cause).

        > Hmm... maybe we permit DISI.docID() to also return NO_MORE_DOCS after the DISI is created? EG we do this in ConjunctionScorer. (Or, maybe, any docID that's < the first real docID...hmmm.)

        I do not fully follow you. By allowing to return -1, we already allow any DISI to return a doc ID that's < the first real doc ID. And allowing to return NO_MORE_DOCS looks strange to me .. I mean imagine a code which creates a DISI and calls doc() and gets back NO_MORE_DOCS .. that'd be strange, won't it?

        I'm thinking it's OK to call docID before next, and that all we
        require is that call return a docID less than its first real docID.

        And, if you know up-front you have no docs, returning NO_MORE_DOCS up
        front is also OK.

        Ie this is a relaxation over "you must return -1 before nextDoc has
        been called".

        (EG I think the last patch would return NO_MORE_DOCS from docID() in
        ConjunctionScorer if it determines in ctor that no docs match).

        Show
        Michael McCandless added a comment - Regarding the nocommit in ConjunctionScorer's ctor - I think the problem with moving doNext() to the end of the ctor is the order of the scorers that will be iterated on. I.e., currently the scorers are first advanced to their first doc, then sorted based on their docID, then advanced to the doc ID that all agree on, in the order of the sort. If you move doNext() to the end of the method, then the scorers are sorted based on their docID, and then immediately re-sorted (based on their sparseness, which is a heuristic applied based on their first docID). If I understand correctly, the problem with calling doNext() after the re-sort is this: assume you have 5 scorers, whose first docs are 1, 2, 3, 5, 5 respectively. Sorting leaves the array as is (or changes it to be in that order). If you call doNext() after array.sort, then you advance all the first scorers to 5 (or a larger doc ID they all agree on). However, if you do the re-sort, the order will be 5, 3, 2, 1, 5 and then if you call doNext(), it will stop immediately, since the first scorer's docs equals the last one. So you break an invariant, that after calling doNext() all scorers are on the same doc ID. That's at least how I understand it. I hope I didn't write too much to explain the obvious (to others). So I'll delete the nocommit tag. Ahh – good explanation! Phew. This must be what leads to the two test failures. So leave it where it is (maybe add a comment explaining it cannot be moved down, lest anyone else gets tempted). > Can you add an "assert scorer.docID() == -1" in IndexSearcher right when it gets the scorer? Done. OK, except we may delete it depending on the relaxing (below)... > BS2.nextDoc is still needing to check if it's supposed to call initCountingSumScorer? Can we do this in ctor? It can, if I delete add() and do all the work in the ctor. I think we should (lacking a DISI.start). I'll need to pass 3 arrays (Scorer[], boolean[] /required/, boolean[] /prohibited/) though, or pass the list of Weights, Clauses and IndexReader. I'm not too fan of either. I check which is the lesser evil. Hmm – I don't have a stronger lesser-of-evils preference. I suppose, instead, we could add a "start" only to BS2, which BQ would call once it's done adding? Or, simply call initCountingSumScorer from BQ and don't do the check per nextDoc/advance. > BooleanScorer2.score(Collector, int) makes me a bit nervous I didn't think it's such a big deal since it's one 'if' before scoring starts. It's not like it's called for every score(). Do you think we should resolve it? What specifically made me nervous that it was even necessary to add this (having to conditionally next wasn't needed before) in the first place. If you remove it, does something actually break? Like what caused it to be added? (Because I want to go explore/understand that cause). > Hmm... maybe we permit DISI.docID() to also return NO_MORE_DOCS after the DISI is created? EG we do this in ConjunctionScorer. (Or, maybe, any docID that's < the first real docID...hmmm.) I do not fully follow you. By allowing to return -1, we already allow any DISI to return a doc ID that's < the first real doc ID. And allowing to return NO_MORE_DOCS looks strange to me .. I mean imagine a code which creates a DISI and calls doc() and gets back NO_MORE_DOCS .. that'd be strange, won't it? I'm thinking it's OK to call docID before next, and that all we require is that call return a docID less than its first real docID. And, if you know up-front you have no docs, returning NO_MORE_DOCS up front is also OK. Ie this is a relaxation over "you must return -1 before nextDoc has been called". (EG I think the last patch would return NO_MORE_DOCS from docID() in ConjunctionScorer if it determines in ctor that no docs match).
        Hide
        Shai Erera added a comment -

        BS2.nextDoc is still needing to check if it's supposed to call initCountingSumScorer? Can we do this in ctor?

        After I made add() private, created a ScorerClauseWrapper (in BQ) and passed to BS2 a list of SCW (to pass in one call Scorer, req, prohib), I couldn't still call initCountingSumScorer in the ctor. Then it reminded me that we've had this discussion before - it's related to being able to ask for topScorer() or not. Reason is, if I call initCount...() in the ctor, it advances the sub scorers. If the scorer is then used as a topScorer, they may be advanced again, if BS is used.

        So I thought, let's not call nextDoc() in BS's ctor, but that leads to other problems, since the scorers passed may have called nextDoc() themselves (DisjunctionSumScorer) or may not (ReqExclScorer).

        The decision back then was to leave it as-is, and handle it in LUCENE-1630, when we'll be able to ask for topScorer. What do you think?

        Like what caused it to be added?

        Two lines below it, where it will be sent to Collector for collection, since it's < max. This hits AIOOBE for some Collectors.

        I think the last patch would return NO_MORE_DOCS from docID() in ConjunctionScorer if it determines in ctor that no docs match

        Actually, now that you write it I do notice I broke that invariant in CS. If there is even one scorer that doesn't have any docs, lastDoc is already set to NO_MORE_DOCS and calling docID before nextDoc will return NO_MORE_DOCS and not -1.

        So I think it's a fair relaxation. Just to be clear - this is just a relaxation you're talking about right? This shouldn't affect any of the existing scorers.

        So if we're on the same page, I'll document that relaxation and remove the assert call I've added to IndexSearcher. But I don't think this should actually change scorers.

        Show
        Shai Erera added a comment - BS2.nextDoc is still needing to check if it's supposed to call initCountingSumScorer? Can we do this in ctor? After I made add() private, created a ScorerClauseWrapper (in BQ) and passed to BS2 a list of SCW (to pass in one call Scorer, req, prohib), I couldn't still call initCountingSumScorer in the ctor. Then it reminded me that we've had this discussion before - it's related to being able to ask for topScorer() or not. Reason is, if I call initCount...() in the ctor, it advances the sub scorers. If the scorer is then used as a topScorer, they may be advanced again, if BS is used. So I thought, let's not call nextDoc() in BS's ctor, but that leads to other problems, since the scorers passed may have called nextDoc() themselves (DisjunctionSumScorer) or may not (ReqExclScorer). The decision back then was to leave it as-is, and handle it in LUCENE-1630 , when we'll be able to ask for topScorer. What do you think? Like what caused it to be added? Two lines below it, where it will be sent to Collector for collection, since it's < max. This hits AIOOBE for some Collectors. I think the last patch would return NO_MORE_DOCS from docID() in ConjunctionScorer if it determines in ctor that no docs match Actually, now that you write it I do notice I broke that invariant in CS. If there is even one scorer that doesn't have any docs, lastDoc is already set to NO_MORE_DOCS and calling docID before nextDoc will return NO_MORE_DOCS and not -1. So I think it's a fair relaxation. Just to be clear - this is just a relaxation you're talking about right? This shouldn't affect any of the existing scorers. So if we're on the same page, I'll document that relaxation and remove the assert call I've added to IndexSearcher. But I don't think this should actually change scorers.
        Hide
        Shai Erera added a comment -

        Deprecated SpanScorer.firstTime and removed the use of it. (in general, I think that SpanScorer should be reviewed, since it's a public class that's documented that it's for inheritance), but I'm not sure all methods should be open for inheritance.
        Also, we may want to add nextDoc and advance() to Spans. But that's a different issue.

        I noticed I accidentally changed the signature of setFreqCurrentDoc, which is protected. So I reverted the change and added JustCompileSpanScorer which overrides that method, so we'll discover that safely in the future (test-tag still passes).

        I also changed the documentation of docID, stating the relaxed policy (in DISI and CHANGES).

        Besides, I've fixed all of your other comments, except for BS2 initCountingSumScorer which I don't think we can do yet.

        In general, can we do any other optimization in a different issue? This one already contains a lot of changes, and was opened primarily for introducing the new methods. We have LUCENE-1652 for further optimizations, even though we've covered most (or all) of them here, so perhaps we should cancel it?

        Show
        Shai Erera added a comment - Deprecated SpanScorer.firstTime and removed the use of it. (in general, I think that SpanScorer should be reviewed, since it's a public class that's documented that it's for inheritance), but I'm not sure all methods should be open for inheritance. Also, we may want to add nextDoc and advance() to Spans. But that's a different issue. I noticed I accidentally changed the signature of setFreqCurrentDoc, which is protected. So I reverted the change and added JustCompileSpanScorer which overrides that method, so we'll discover that safely in the future (test-tag still passes). I also changed the documentation of docID, stating the relaxed policy (in DISI and CHANGES). Besides, I've fixed all of your other comments, except for BS2 initCountingSumScorer which I don't think we can do yet. In general, can we do any other optimization in a different issue? This one already contains a lot of changes, and was opened primarily for introducing the new methods. We have LUCENE-1652 for further optimizations, even though we've covered most (or all) of them here, so perhaps we should cancel it?
        Hide
        Michael McCandless added a comment -

        After I made add() private, created a ScorerClauseWrapper (in BQ) and passed to BS2 a list of SCW (to pass in one call Scorer, req, prohib), I couldn't still call initCountingSumScorer in the ctor. Then it reminded me that we've had this discussion before - it's related to being able to ask for topScorer() or not. Reason is, if I call initCount...() in the ctor, it advances the sub scorers. If the scorer is then used as a topScorer, they may be advanced again, if BS is used.

        So I thought, let's not call nextDoc() in BS's ctor, but that leads to other problems, since the scorers passed may have called nextDoc() themselves (DisjunctionSumScorer) or may not (ReqExclScorer).

        The decision back then was to leave it as-is, and handle it in LUCENE-1630, when we'll be able to ask for topScorer. What do you think?

        Sigh. I wonder if you could record that next had been called, in the ScorerClauseWrapper, and then BS wouldn't re-do the next in its ctor? That added if would only be in the ctor.

        But, yeah I agree: let's wind down this issue and push this (and other) optimizations into LUCENE-1652.

        I noticed I accidentally changed the signature of setFreqCurrentDoc, which is protected. So I reverted the change and added JustCompileSpanScorer which overrides that method, so we'll discover that safely in the future (test-tag still passes).

        Woops, OK good.

        I'll have a look through the current patch. I think we are close!

        Show
        Michael McCandless added a comment - After I made add() private, created a ScorerClauseWrapper (in BQ) and passed to BS2 a list of SCW (to pass in one call Scorer, req, prohib), I couldn't still call initCountingSumScorer in the ctor. Then it reminded me that we've had this discussion before - it's related to being able to ask for topScorer() or not. Reason is, if I call initCount...() in the ctor, it advances the sub scorers. If the scorer is then used as a topScorer, they may be advanced again, if BS is used. So I thought, let's not call nextDoc() in BS's ctor, but that leads to other problems, since the scorers passed may have called nextDoc() themselves (DisjunctionSumScorer) or may not (ReqExclScorer). The decision back then was to leave it as-is, and handle it in LUCENE-1630 , when we'll be able to ask for topScorer. What do you think? Sigh. I wonder if you could record that next had been called, in the ScorerClauseWrapper, and then BS wouldn't re-do the next in its ctor? That added if would only be in the ctor. But, yeah I agree: let's wind down this issue and push this (and other) optimizations into LUCENE-1652 . I noticed I accidentally changed the signature of setFreqCurrentDoc, which is protected. So I reverted the change and added JustCompileSpanScorer which overrides that method, so we'll discover that safely in the future (test-tag still passes). Woops, OK good. I'll have a look through the current patch. I think we are close!
        Hide
        Shai Erera added a comment -

        Mike, I think we may have relaxed the policy of docID() a bit too much. IMO, it should be clear what this method returns, and saying that it should return any doc ID which is smaller than the first available, or NO_MORE_DOCS if it knows there are not docs in the set, is problematic.

        Consider the code in BS2 which checks if it should call nextDoc() before the while loop (the score(Collector, int) method). If a DISI decided to return '2' simply because the first doc available is '3', then we may incorrectly collect '2'.

        So I say we relax the policy in this way: "an implementation can return -1 if it knows the iterator is not exhausted, or NO_MORE_DOCS otherwise".

        Show
        Shai Erera added a comment - Mike, I think we may have relaxed the policy of docID() a bit too much. IMO, it should be clear what this method returns, and saying that it should return any doc ID which is smaller than the first available, or NO_MORE_DOCS if it knows there are not docs in the set, is problematic. Consider the code in BS2 which checks if it should call nextDoc() before the while loop (the score(Collector, int) method). If a DISI decided to return '2' simply because the first doc available is '3', then we may incorrectly collect '2'. So I say we relax the policy in this way: "an implementation can return -1 if it knows the iterator is not exhausted, or NO_MORE_DOCS otherwise".
        Hide
        Michael McCandless added a comment -

        Consider the code in BS2 which checks if it should call nextDoc() before the while loop (the score(Collector, int) method). If a DISI decided to return '2' simply because the first doc available is '3', then we may incorrectly collect '2'.

        Actually, I'm not liking that particular added code anyway

        Ie the contract of that method (I believe) is that nextDoc must already have been called. I don't like adding ambiguity into the API, and so instead of the check I'd rather switch it to an "assert docID != -1", ie, assert that nextDoc was in fact called. And then if that assert trips up, we need to go understand the root cause for that.

        So I say we relax the policy in this way: "an implementation can return -1 if it knows the iterator is not exhausted, or NO_MORE_DOCS otherwise".

        OK, let's do that. Though: it need not return NO_MORE_DOCS up front, right? Ie, if it is able to determine up front it has no docs to iterate, great, else, it can return -1 until nextDoc() is called at which point it returns NO_MORE_DOCS?

        Show
        Michael McCandless added a comment - Consider the code in BS2 which checks if it should call nextDoc() before the while loop (the score(Collector, int) method). If a DISI decided to return '2' simply because the first doc available is '3', then we may incorrectly collect '2'. Actually, I'm not liking that particular added code anyway Ie the contract of that method (I believe) is that nextDoc must already have been called. I don't like adding ambiguity into the API, and so instead of the check I'd rather switch it to an "assert docID != -1", ie, assert that nextDoc was in fact called. And then if that assert trips up, we need to go understand the root cause for that. So I say we relax the policy in this way: "an implementation can return -1 if it knows the iterator is not exhausted, or NO_MORE_DOCS otherwise". OK, let's do that. Though: it need not return NO_MORE_DOCS up front, right? Ie, if it is able to determine up front it has no docs to iterate, great, else, it can return -1 until nextDoc() is called at which point it returns NO_MORE_DOCS?
        Hide
        Shai Erera added a comment -

        I don't like that code also. But if we allow to return any value, except -1 or NO_MORE_DOCS before the iteration started, someone will have very hard time trying to write such code (to determine if the iterator has started).

        The current contract already specifies what this method should return when nextDoc or advance were not called. We just need to make it more explicit:

        • Return -1 if the iteration did not start, and there are documents to return
        • Return NO_MORE_DOCS if there are no more docs to return (whether the iteration started or not).
        • Return the doc ID on any other case..

        Note that I also wrote that this method should not throw any exception, but I think of relaxing that either, and say "it is better if the implementation does not throw any exception in case there are no more documents to return". The reason is, we cannot force "don't throw exception" in the code ... What do you think?

        I will update the patch if you agree to these changes.

        Show
        Shai Erera added a comment - I don't like that code also. But if we allow to return any value, except -1 or NO_MORE_DOCS before the iteration started, someone will have very hard time trying to write such code (to determine if the iterator has started). The current contract already specifies what this method should return when nextDoc or advance were not called. We just need to make it more explicit: Return -1 if the iteration did not start, and there are documents to return Return NO_MORE_DOCS if there are no more docs to return (whether the iteration started or not). Return the doc ID on any other case.. Note that I also wrote that this method should not throw any exception, but I think of relaxing that either, and say "it is better if the implementation does not throw any exception in case there are no more documents to return". The reason is, we cannot force "don't throw exception" in the code ... What do you think? I will update the patch if you agree to these changes.
        Hide
        Michael McCandless added a comment -

        Let's decouple the two things we are talking about...

        First off, Scorer.score(Collector, int) (and also BooleanScorer2)
        require that nextDoc has already been called. So we should not add
        the "if" into those methods; if anything, we should do the opposite
        (add an assert).

        Second, the semantics of calling docID() before nextDoc/advance have
        been called... on this I think your proposal is good, except: in the
        event that the Scorer will not match any docs, it's also fine if it
        returns -1. Ie we do not require all Scorers to up-front determine
        whether they will match docs or not. Said another way, if a Scorer
        will not match any docs, docID() may return either -1 or NO_MORE_DOCS
        before nextDoc/advance have been called.

        Note that I also wrote that this method should not throw any exception, but I think of relaxing that either, and say "it is better if the implementation does not throw any exception in case there are no more documents to return". The reason is, we cannot force "don't throw exception" in the code ... What do you think?

        I don't think we need to allow for docID() to throw an exception?
        (doc() doesn't today and I haven't heard complaints, I think).

        If there are no more documents to return, docID() should return
        NO_MORE_DOCS.

        Show
        Michael McCandless added a comment - Let's decouple the two things we are talking about... First off, Scorer.score(Collector, int) (and also BooleanScorer2) require that nextDoc has already been called. So we should not add the "if" into those methods; if anything, we should do the opposite (add an assert). Second, the semantics of calling docID() before nextDoc/advance have been called... on this I think your proposal is good, except: in the event that the Scorer will not match any docs, it's also fine if it returns -1. Ie we do not require all Scorers to up-front determine whether they will match docs or not. Said another way, if a Scorer will not match any docs, docID() may return either -1 or NO_MORE_DOCS before nextDoc/advance have been called. Note that I also wrote that this method should not throw any exception, but I think of relaxing that either, and say "it is better if the implementation does not throw any exception in case there are no more documents to return". The reason is, we cannot force "don't throw exception" in the code ... What do you think? I don't think we need to allow for docID() to throw an exception? (doc() doesn't today and I haven't heard complaints, I think). If there are no more documents to return, docID() should return NO_MORE_DOCS.
        Hide
        Shai Erera added a comment -

        First off, Scorer.score(Collector, int) (and also BooleanScorer2) require that nextDoc has already been called.

        Agreed. I'll take a look at the code and check if that's possible. I thought to do it, but since Scorer is public, and score(Collector, int) is protected, I didn't feel that making this assumption is a good thing and might break back-compat (i.e., if someone overrode Scorer and call this method w/o calling nextDoc() first).

        I'll change the semantics as you propose, saying that an impl is free to return -1 or NO_MORE_DOCS in case it knows up front there are no more docs, before nextDoc or advance were called.

        Regarding the exceptions, I meant runtime exceptions. The method does not declare to throw anything. But I've seen cases, like ReqExclScorer where NPE may be thrown, or AIOOBE. That's what I would like to document - that impl should try to avoid it. Since I don't have a way in Java to restrict that, I'd like that to appear in the javadocs.

        Show
        Shai Erera added a comment - First off, Scorer.score(Collector, int) (and also BooleanScorer2) require that nextDoc has already been called. Agreed. I'll take a look at the code and check if that's possible. I thought to do it, but since Scorer is public, and score(Collector, int) is protected, I didn't feel that making this assumption is a good thing and might break back-compat (i.e., if someone overrode Scorer and call this method w/o calling nextDoc() first). I'll change the semantics as you propose, saying that an impl is free to return -1 or NO_MORE_DOCS in case it knows up front there are no more docs, before nextDoc or advance were called. Regarding the exceptions, I meant runtime exceptions. The method does not declare to throw anything. But I've seen cases, like ReqExclScorer where NPE may be thrown, or AIOOBE. That's what I would like to document - that impl should try to avoid it. Since I don't have a way in Java to restrict that, I'd like that to appear in the javadocs.
        Hide
        Michael McCandless added a comment -

        Agreed. I'll take a look at the code and check if that's possible. I thought to do it, but since Scorer is public, and score(Collector, int) is protected, I didn't feel that making this assumption is a good thing and might break back-compat (i.e., if someone overrode Scorer and call this method w/o calling nextDoc() first).

        Well the javadoc clearly states the requirement ("next must be called"). And if external Scorers are violating this, it'll cause problems if their used as sub-scorers by BooleanScorer.

        But I've seen cases, like ReqExclScorer where NPE may be thrown, or AIOOBE. That's what I would like to document - that impl should try to avoid it. Since I don't have a way in Java to restrict that, I'd like that to appear in the javadocs.

        Under what cases are NPE & AIOOBE thrown by ReqExclScorer? (That doesn't seem right).

        I don't think we need to document that you shouldn't throw RuntimeExceptions – that's pretty much par for the course, right? Or maybe I'm missing something... does something go horribly wrong in particular if an exception is thrown by docID()?

        Show
        Michael McCandless added a comment - Agreed. I'll take a look at the code and check if that's possible. I thought to do it, but since Scorer is public, and score(Collector, int) is protected, I didn't feel that making this assumption is a good thing and might break back-compat (i.e., if someone overrode Scorer and call this method w/o calling nextDoc() first). Well the javadoc clearly states the requirement ("next must be called"). And if external Scorers are violating this, it'll cause problems if their used as sub-scorers by BooleanScorer. But I've seen cases, like ReqExclScorer where NPE may be thrown, or AIOOBE. That's what I would like to document - that impl should try to avoid it. Since I don't have a way in Java to restrict that, I'd like that to appear in the javadocs. Under what cases are NPE & AIOOBE thrown by ReqExclScorer? (That doesn't seem right). I don't think we need to document that you shouldn't throw RuntimeExceptions – that's pretty much par for the course, right? Or maybe I'm missing something... does something go horribly wrong in particular if an exception is thrown by docID()?
        Hide
        Shai Erera added a comment -

        Under what cases are NPE & AIOOBE thrown by ReqExclScorer?

        Previously, ReqExclScorer had this code:

        public int doc() {
          return reqScorer.doc(); // reqScorer may be null when next() or skipTo() already return false
        }
        

        The current patch already fixes it to not rely on reqScorer to be or not be null (we've also discussed this issue somewhere up this thread).

        I don't think we need to document that you shouldn't throw RuntimeExceptions

        Yeah ... I guess you're right. When I read the documentation it does sound kind of silly (as if we're begging "please don't throw RuntimeException"). I'll remove that. I just wanted to encourage implementers to not rely on an object which may or may not be null (or index going out of range).

        Well the javadoc clearly states the requirement ("next must be called")

        You're right Mike, I didn't read the javadocs (shame on me ). I'll fix the code then.

        Show
        Shai Erera added a comment - Under what cases are NPE & AIOOBE thrown by ReqExclScorer? Previously, ReqExclScorer had this code: public int doc() { return reqScorer.doc(); // reqScorer may be null when next() or skipTo() already return false } The current patch already fixes it to not rely on reqScorer to be or not be null (we've also discussed this issue somewhere up this thread). I don't think we need to document that you shouldn't throw RuntimeExceptions Yeah ... I guess you're right. When I read the documentation it does sound kind of silly (as if we're begging "please don't throw RuntimeException"). I'll remove that. I just wanted to encourage implementers to not rely on an object which may or may not be null (or index going out of range). Well the javadoc clearly states the requirement ("next must be called") You're right Mike, I didn't read the javadocs (shame on me ). I'll fix the code then.
        Hide
        Shai Erera added a comment -

        DISI.docID documentation (and CHANGES) changed. Also the code in Scorer and BS2 score(Collector, int) does not check for -1.

        Show
        Shai Erera added a comment - DISI.docID documentation (and CHANGES) changed. Also the code in Scorer and BS2 score(Collector, int) does not check for -1.
        Hide
        Shai Erera added a comment -

        I was wondering if instead of adding an assert to score(Collector, int) on the doc ID not being -1, we create another method score(Collector, int, int /* firstDocID */) and deprecate the current one.
        It will behave exactly as score(Collector, int), only I think it makes it more explicit that we require nextDoc() to be called before. The documentation can state that the method expects the first doc ID and not a negative number.

        Besides, the code today does: (1) nextDoc() and (2) score(Collector, int) which then calls docID(). If we allow to pass the first doc ID, we can use the value that's returned from nextDoc() (e.g., score(Collector, int, nextDoc())).

        What do you think?

        Show
        Shai Erera added a comment - I was wondering if instead of adding an assert to score(Collector, int) on the doc ID not being -1, we create another method score(Collector, int, int /* firstDocID */) and deprecate the current one. It will behave exactly as score(Collector, int), only I think it makes it more explicit that we require nextDoc() to be called before. The documentation can state that the method expects the first doc ID and not a negative number. Besides, the code today does: (1) nextDoc() and (2) score(Collector, int) which then calls docID(). If we allow to pass the first doc ID, we can use the value that's returned from nextDoc() (e.g., score(Collector, int, nextDoc())). What do you think?
        Hide
        Michael McCandless added a comment -

        we create another method score(Collector, int, int /* firstDocID */) and deprecate the current one.

        I think that makes sense!

        Show
        Michael McCandless added a comment - we create another method score(Collector, int, int /* firstDocID */) and deprecate the current one. I think that makes sense!
        Hide
        Michael McCandless added a comment -

        It seems like we lost this at some point (I couldn't find it in the latest patch)?

        > Can you add an "assert scorer.docID() == -1" in IndexSearcher right when it gets the scorer?

        Done.

        When I add assert scorer.docID() == -1 || scorer.docID() == Scorer.NO_MORE_DOCS; in doSearch, after we get the scorer, all tests pass. Was there something that went wrong when you added it?

        Show
        Michael McCandless added a comment - It seems like we lost this at some point (I couldn't find it in the latest patch)? > Can you add an "assert scorer.docID() == -1" in IndexSearcher right when it gets the scorer? Done. When I add assert scorer.docID() == -1 || scorer.docID() == Scorer.NO_MORE_DOCS; in doSearch, after we get the scorer, all tests pass. Was there something that went wrong when you added it?
        Hide
        Shai Erera added a comment -

        Ok I'll add this method.

        Regarding that assert, you wrote the following somewhere up this thread:

        > Can you add an "assert scorer.docID() == -1" in IndexSearcher right when it gets the scorer?

        Done.

        OK, except we may delete it depending on the relaxing (below)...

        So I removed it . Anyway I'll add it back.

        Show
        Shai Erera added a comment - Ok I'll add this method. Regarding that assert, you wrote the following somewhere up this thread: > Can you add an "assert scorer.docID() == -1" in IndexSearcher right when it gets the scorer? Done. OK, except we may delete it depending on the relaxing (below)... So I removed it . Anyway I'll add it back.
        Hide
        Michael McCandless added a comment -

        So I removed it . Anyway I'll add it back.

        Ahh, OK But we didn't end up relaxing the semantics, ie we are back to "docID() must return -1 or NO_MORE_DOCS before nextDoc/advance have been called"... so let's just add it back.

        Show
        Michael McCandless added a comment - So I removed it . Anyway I'll add it back. Ahh, OK But we didn't end up relaxing the semantics, ie we are back to "docID() must return -1 or NO_MORE_DOCS before nextDoc/advance have been called"... so let's just add it back.
        Hide
        Shai Erera added a comment -

        Added Scorer.score(Collector, int, int) and deprecated the other one, with an impl:

        return score(collector, int, docID());
        

        I also added the assert to IndexSearcher.doSearch.

        Do you think we should add 'assert firstDocID >= 0' to the new score(Collector, int, int) implementations? It seems redundant to me ...

        Show
        Shai Erera added a comment - Added Scorer.score(Collector, int, int) and deprecated the other one, with an impl: return score(collector, int , docID()); I also added the assert to IndexSearcher.doSearch. Do you think we should add 'assert firstDocID >= 0' to the new score(Collector, int, int) implementations? It seems redundant to me ...
        Hide
        Michael McCandless added a comment -

        Do you think we should add 'assert firstDocID >= 0' to the new score(Collector, int, int) implementations? It seems redundant to me ...

        Agreed, I think it's redundant. Let's skip it.

        Show
        Michael McCandless added a comment - Do you think we should add 'assert firstDocID >= 0' to the new score(Collector, int, int) implementations? It seems redundant to me ... Agreed, I think it's redundant. Let's skip it.
        Hide
        Shai Erera added a comment -

        Ok then I guess we're ready for the final (hopefully ) iteration.

        Show
        Shai Erera added a comment - Ok then I guess we're ready for the final (hopefully ) iteration.
        Hide
        Michael McCandless added a comment -

        Yes... I'm reviewing & I'll post any changes...

        Show
        Michael McCandless added a comment - Yes... I'm reviewing & I'll post any changes...
        Hide
        Michael McCandless added a comment -

        OK I think we're getting close... I attached patch w/ these
        changes/questions:

        • I think we should require that nextDoc/advance not be called again
          once NO_MORE_DOCS has already been returned? This would save
          checks, eg in ConjunctionScorer & DisjunctionMaxScorer &
          ReqExclScorer at least, for "did I already return NO_MORE_DOCS" on
          every nextDoc call. Some things already seem to require this (eg
          DocIdBitSet's DISI).
        • I think the new code for applying a filter during searching (~line
          282 of IndexSearcher) can be made more efficient in the "if
          (scorerDoc == filterDoc)" case. Inside there, we ask
          filterDocIter to do nextDoc(), at which point we know scorerDoc is
          < filterDoc, and so it'd be better to call scorer.advance, there,
          instead of wrapping around the loop and having to do the if at the
          top?
        • Fixed bug in how "more" is computed in BooleanScorer (we were
          directly assigning to more instead of or'ing in the new value)
        • Various cosmetic changes – removed unused vars, imports; made
          some attrs final.
        • [As a separate issue, it seems like NonMatchingScorer should be
          removed, and we should fix BooleanQuery.scorer to instead return
          null in such cases]
        Show
        Michael McCandless added a comment - OK I think we're getting close... I attached patch w/ these changes/questions: I think we should require that nextDoc/advance not be called again once NO_MORE_DOCS has already been returned? This would save checks, eg in ConjunctionScorer & DisjunctionMaxScorer & ReqExclScorer at least, for "did I already return NO_MORE_DOCS" on every nextDoc call. Some things already seem to require this (eg DocIdBitSet's DISI). I think the new code for applying a filter during searching (~line 282 of IndexSearcher) can be made more efficient in the "if (scorerDoc == filterDoc)" case. Inside there, we ask filterDocIter to do nextDoc(), at which point we know scorerDoc is < filterDoc, and so it'd be better to call scorer.advance, there, instead of wrapping around the loop and having to do the if at the top? Fixed bug in how "more" is computed in BooleanScorer (we were directly assigning to more instead of or'ing in the new value) Various cosmetic changes – removed unused vars, imports; made some attrs final. [As a separate issue, it seems like NonMatchingScorer should be removed, and we should fix BooleanQuery.scorer to instead return null in such cases]
        Hide
        Shai Erera added a comment -

        I think we should require that nextDoc/advance not be called again once NO_MORE_DOCS has already been returned?

        So you mean add something like this to the javadocs "after NO_MORE_DOCS was returned, you should not call this method again, or it may result in unpredicted behavior"?

        Some things already seem to require this (eg DocIdBitSet's DISI).

        You mean that this will remove the d == -1 check?

        I think the new code for applying a filter during searching (~line 282 of IndexSearcher) ...

        So you mean to change the last line here:

        if (scorerDoc == filterDoc) { // permitted by filter
                collector.collect(scorerDoc);
                if ((filterDoc = filterDocIdIterator.nextDoc()) == DocIdSetIterator.NO_MORE_DOCS) break;
        

        to

        if ((filterDoc = filterDocIdIterator.nextDoc()) == DocIdSetIterator.NO_MORE_DOCS) {
          break;
        } else if ((scorerDoc = scorer.advance(filterDoc)) == DocIdSetIterator.NO_MORE_DOCS) {
          break;
        }
        

        I don't see what it will give us since after the loop wraps around, we check anyway if filterDoc > scoreDoc.

        But .. I think the while loop can be changed to:

        int filterDoc = filterDocIdIterator.nextDoc();
        int scorerDoc = scorer.advance(filterDoc);
        if (filterDoc == DocIdSetIterator.NO_MORE_DOCS
            || scorerDoc == DocIdSetIterator.NO_MORE_DOCS) {
          return;
        }
        
        collector.setScorer(scorer);
        while (true) {
          // scorerDoc >= filterDoc
          if (scorerDoc == filterDoc) { // permitted by filter
            collector.collect(scorerDoc);
            if ((filterDoc = filterDocIdIterator.nextDoc()) == DocIdSetIterator.NO_MORE_DOCS) break;
          } else if ((filterDoc = filterDocIdIterator.advance(scorerDoc)) == DocIdSetIterator.NO_MORE_DOCS) break;
        
          // The above code may have moved filterDoc beyond scorerDoc, so advance scorerDoc
          if ((scorerDoc = scorer.advance(filterDoc)) == DocIdSetIterator.NO_MORE_DOCS) break;
        }
        

        What do you think?

        Also Mike - the patch you posted is 152KB, while my last patch is 201KB. It's hard to compare our patches since the order of the classes is different, so until I apply the patch and check it, I wanted to make sure you included all the changes in the patch. (comparing our previous 2 patches from May 27 - they were of same size, so the last one is a bit fishy).

        Show
        Shai Erera added a comment - I think we should require that nextDoc/advance not be called again once NO_MORE_DOCS has already been returned? So you mean add something like this to the javadocs "after NO_MORE_DOCS was returned, you should not call this method again, or it may result in unpredicted behavior"? Some things already seem to require this (eg DocIdBitSet's DISI). You mean that this will remove the d == -1 check? I think the new code for applying a filter during searching (~line 282 of IndexSearcher) ... So you mean to change the last line here: if (scorerDoc == filterDoc) { // permitted by filter collector.collect(scorerDoc); if ((filterDoc = filterDocIdIterator.nextDoc()) == DocIdSetIterator.NO_MORE_DOCS) break ; to if ((filterDoc = filterDocIdIterator.nextDoc()) == DocIdSetIterator.NO_MORE_DOCS) { break ; } else if ((scorerDoc = scorer.advance(filterDoc)) == DocIdSetIterator.NO_MORE_DOCS) { break ; } I don't see what it will give us since after the loop wraps around, we check anyway if filterDoc > scoreDoc. But .. I think the while loop can be changed to: int filterDoc = filterDocIdIterator.nextDoc(); int scorerDoc = scorer.advance(filterDoc); if (filterDoc == DocIdSetIterator.NO_MORE_DOCS || scorerDoc == DocIdSetIterator.NO_MORE_DOCS) { return ; } collector.setScorer(scorer); while ( true ) { // scorerDoc >= filterDoc if (scorerDoc == filterDoc) { // permitted by filter collector.collect(scorerDoc); if ((filterDoc = filterDocIdIterator.nextDoc()) == DocIdSetIterator.NO_MORE_DOCS) break ; } else if ((filterDoc = filterDocIdIterator.advance(scorerDoc)) == DocIdSetIterator.NO_MORE_DOCS) break ; // The above code may have moved filterDoc beyond scorerDoc, so advance scorerDoc if ((scorerDoc = scorer.advance(filterDoc)) == DocIdSetIterator.NO_MORE_DOCS) break ; } What do you think? Also Mike - the patch you posted is 152KB, while my last patch is 201KB. It's hard to compare our patches since the order of the classes is different, so until I apply the patch and check it, I wanted to make sure you included all the changes in the patch. (comparing our previous 2 patches from May 27 - they were of same size, so the last one is a bit fishy).
        Hide
        Michael McCandless added a comment -

        So you mean add something like this to the javadocs "after NO_MORE_DOCS was returned, you should not call this method again, or it may result in unpredicted behavior"?

        Right. And, optimizing the core Scorers based on this (ie, to stop
        checking if they had already returned NO_MORE_DOCS, in
        nextDoc/advance).

        You mean that this will remove the d == -1 check?

        No, I think that check must remain, even once we switch to the new
        semantics. What I meant was, currently, if I call nextDoc() on
        DocIdBitSet's DISI after nextDoc had already returned NO_MORE_DOCS, it
        looks like I'll get into trouble because that will then call
        BitSet.nextSetBit(NO_MORE_DOCS+1), ie,
        BitSet.nextSetBit(Integer.MIN_VALUE), which'll throw an exception.
        Ie, this DISI already requires that you don't call nextDoc() again
        after it's returned NO_MORE_DOCS.

        So you mean to change the last line here:

        Actually, I meant in the case where filterDoc == scorerDoc, you ought
        to advance both filter & scorer, not just filter (you know at that
        point that both must advance).

        I think the new while loop is buggy: if scorerDoc is not == filterDoc,
        and you ask filterDocIdIter to advance to scorerDoc, it may advance to
        become ==, yet you then illegally advance scorerDoc to filterDoc?

        Also Mike - the patch you posted is 152KB, while my last patch is 201KB. It's hard to compare our patches since the order of the classes is different, so until I apply the patch and check it, I wanted to make sure you included all the changes in the patch.

        I'm pretty sure this is OK: it's because I changed the eol-style to
        native (in my local checkout), which then caused "svn diff" to produce
        more reasonable diffs. In your patch, there are several files where
        the entire file is deleted, and then the entire new file is added,
        because the eol-style wasn't set. If you "grep Index: patch.txt | wc"
        of yours & mine, they should the same.

        Show
        Michael McCandless added a comment - So you mean add something like this to the javadocs "after NO_MORE_DOCS was returned, you should not call this method again, or it may result in unpredicted behavior"? Right. And, optimizing the core Scorers based on this (ie, to stop checking if they had already returned NO_MORE_DOCS, in nextDoc/advance). You mean that this will remove the d == -1 check? No, I think that check must remain, even once we switch to the new semantics. What I meant was, currently, if I call nextDoc() on DocIdBitSet's DISI after nextDoc had already returned NO_MORE_DOCS, it looks like I'll get into trouble because that will then call BitSet.nextSetBit(NO_MORE_DOCS+1), ie, BitSet.nextSetBit(Integer.MIN_VALUE), which'll throw an exception. Ie, this DISI already requires that you don't call nextDoc() again after it's returned NO_MORE_DOCS. So you mean to change the last line here: Actually, I meant in the case where filterDoc == scorerDoc, you ought to advance both filter & scorer, not just filter (you know at that point that both must advance). I think the new while loop is buggy: if scorerDoc is not == filterDoc, and you ask filterDocIdIter to advance to scorerDoc, it may advance to become ==, yet you then illegally advance scorerDoc to filterDoc? Also Mike - the patch you posted is 152KB, while my last patch is 201KB. It's hard to compare our patches since the order of the classes is different, so until I apply the patch and check it, I wanted to make sure you included all the changes in the patch. I'm pretty sure this is OK: it's because I changed the eol-style to native (in my local checkout), which then caused "svn diff" to produce more reasonable diffs. In your patch, there are several files where the entire file is deleted, and then the entire new file is added, because the eol-style wasn't set. If you "grep Index: patch.txt | wc" of yours & mine, they should the same.
        Hide
        Shai Erera added a comment -

        I think the new while loop is buggy:

        You're right. But then, I don't see the benefit of adding that advance() call, since in the while loop you'd still need to check whether filterDoc > scorerDoc. So why do it? Is it just a matter of 'not relying on the above instruction, since at that point I know both should be advanced'?

        I changed the eol-style to native

        Can you explain me what that means? Maybe I should also change it in my eclipse?

        Show
        Shai Erera added a comment - I think the new while loop is buggy: You're right. But then, I don't see the benefit of adding that advance() call, since in the while loop you'd still need to check whether filterDoc > scorerDoc. So why do it? Is it just a matter of 'not relying on the above instruction, since at that point I know both should be advanced'? I changed the eol-style to native Can you explain me what that means? Maybe I should also change it in my eclipse?
        Hide
        Shai Erera added a comment -

        Actually, Mike, removing those check from ConjunctionScorer is a bit problematic. After I changed CS to initialize in the ctor, it may realize one of the Scorers does not have any more documents, and so set lastDoc to be NO_MORE_DOCS.
        Then, nextDoc() is called by BS2.score(Collector), which yields down the stack to TermScorer attempting to score Integer.MAX_VALUE and AIOOBE is thrown, or for some other query to ReqExclScorer to throw NPE.

        So I think the solution is to make sure Scorers which do not have a nextDoc() should not be passed to CS in the first place. And remove this check from CS ctor.

        Show
        Shai Erera added a comment - Actually, Mike, removing those check from ConjunctionScorer is a bit problematic. After I changed CS to initialize in the ctor, it may realize one of the Scorers does not have any more documents, and so set lastDoc to be NO_MORE_DOCS. Then, nextDoc() is called by BS2.score(Collector), which yields down the stack to TermScorer attempting to score Integer.MAX_VALUE and AIOOBE is thrown, or for some other query to ReqExclScorer to throw NPE. So I think the solution is to make sure Scorers which do not have a nextDoc() should not be passed to CS in the first place. And remove this check from CS ctor.
        Hide
        Shai Erera added a comment -

        So I think the solution is to make sure Scorers which do not have a nextDoc() should not be passed to CS in the first place. And remove this check from CS ctor.

        I'm afraid that's impossible too, since CS advances all Scorers in its ctor to the first agreed-upon doc ID, which may result in NO_MORE_DOCS right upfront.

        What if we document that you shouldn't call nextDoc() after it return NO_MORE_DOCS for now, and then maybe we do proceed (in another issue) with the DISI.start() method. Then CS can implement start() and return false if there are no more docs?

        I also tried changing BS2.score(Collector) to first check the docID of countingSumScorer, but that is not good either, since a Scorer is allowed to return -1 or NO_MORE_DOCS before nextDoc() was called ...

        What do you think?

        Show
        Shai Erera added a comment - So I think the solution is to make sure Scorers which do not have a nextDoc() should not be passed to CS in the first place. And remove this check from CS ctor. I'm afraid that's impossible too, since CS advances all Scorers in its ctor to the first agreed-upon doc ID, which may result in NO_MORE_DOCS right upfront. What if we document that you shouldn't call nextDoc() after it return NO_MORE_DOCS for now, and then maybe we do proceed (in another issue) with the DISI.start() method. Then CS can implement start() and return false if there are no more docs? I also tried changing BS2.score(Collector) to first check the docID of countingSumScorer, but that is not good either, since a Scorer is allowed to return -1 or NO_MORE_DOCS before nextDoc() was called ... What do you think?
        Hide
        Michael McCandless added a comment -

        But then, I don't see the benefit of adding that advance() call, since in the while loop you'd still need to check whether filterDoc > scorerDoc. So why do it? Is it just a matter of 'not relying on the above instruction, since at that point I know both should be advanced'?

        Since you know both should be advanced, why loop back and do an
        unecessary if check? Ie, I the CPU will do more work (computing a
        redundant if) if we don't advance both.

        However, my solution (also advancing the scorer) is still not right,
        because on cycling back you know scorerDoc >= filterDoc, so we've also
        wasted an if check there... hmmm.

        So actually let's just stick w/ your current approach, since it's a
        straight conversion of what trunk currently does, except can we stop
        checking NO_MORE_DOCS on each advance/nextDoc, and only check it when
        we're about to collect a doc?

        And, anyway, maybe we shouldn't fret so much about it; we know this
        filter loop needs to change for 2.9 anyway. I think we want something
        like this:

        • If filter is random access, it should be pushed all the way down
          and applied just like deleted docs
        • If filter is "relatively" sparse compared to query, then we should
          to use the (to-be-added) DISI.check() method
        • Else, add filter as clause on BQ. If incoming Query is already a
          BQ, clone that and tack filter on as clause; else, make a BQ and
          add both.

        Can you explain me what that means? Maybe I should also change it in my eclipse?

        We're supposed to do this (tell svn to use native EOL – "svn propset
        svn:eol-style native") for all our sources, to avoid issues like
        this.

        What if we document that you shouldn't call nextDoc() after it return NO_MORE_DOCS for now

        OK, let's do that. We can leave "taking advantage of this" (start
        method) to another issue. But let's be crystal clear on the semantics
        of DISI for this issue.

        Show
        Michael McCandless added a comment - But then, I don't see the benefit of adding that advance() call, since in the while loop you'd still need to check whether filterDoc > scorerDoc. So why do it? Is it just a matter of 'not relying on the above instruction, since at that point I know both should be advanced'? Since you know both should be advanced, why loop back and do an unecessary if check? Ie, I the CPU will do more work (computing a redundant if) if we don't advance both. However, my solution (also advancing the scorer) is still not right, because on cycling back you know scorerDoc >= filterDoc, so we've also wasted an if check there... hmmm. So actually let's just stick w/ your current approach, since it's a straight conversion of what trunk currently does, except can we stop checking NO_MORE_DOCS on each advance/nextDoc, and only check it when we're about to collect a doc? And, anyway, maybe we shouldn't fret so much about it; we know this filter loop needs to change for 2.9 anyway. I think we want something like this: If filter is random access, it should be pushed all the way down and applied just like deleted docs If filter is "relatively" sparse compared to query, then we should to use the (to-be-added) DISI.check() method Else, add filter as clause on BQ. If incoming Query is already a BQ, clone that and tack filter on as clause; else, make a BQ and add both. Can you explain me what that means? Maybe I should also change it in my eclipse? We're supposed to do this (tell svn to use native EOL – "svn propset svn:eol-style native") for all our sources, to avoid issues like this. What if we document that you shouldn't call nextDoc() after it return NO_MORE_DOCS for now OK, let's do that. We can leave "taking advantage of this" (start method) to another issue. But let's be crystal clear on the semantics of DISI for this issue.
        Hide
        Shai Erera added a comment -

        But let's be crystal clear on the semantics of DISI for this issue.

        I'll change the Javadocs. Actually nextDoc() was already documented that way, I'll add it to advance also.

        Since you know both should be advanced, why loop back and do an unecessary if check?

        You do the 'if' check anyway because of scorer.advance() before the loop starts. If we remove scorer.advance() before the loop starts, we can remove the 'if' check also.
        Anyway, I'll try to optimize that code segment.

        Show
        Shai Erera added a comment - But let's be crystal clear on the semantics of DISI for this issue. I'll change the Javadocs. Actually nextDoc() was already documented that way, I'll add it to advance also. Since you know both should be advanced, why loop back and do an unecessary if check? You do the 'if' check anyway because of scorer.advance() before the loop starts. If we remove scorer.advance() before the loop starts, we can remove the 'if' check also. Anyway, I'll try to optimize that code segment.
        Hide
        Shai Erera added a comment -

        Changed IndexSearcher.doSearch a bit. I think it's more efficient now, since it does not check for filterDoc > scorerDoc at the beginning of the iteration, but rather is handled at the 'else' part. (since when we enter the loop scorerDoc >= filterDoc, that check should not have happened right at the beginning).

        I also changed DISI's nextDoc and advance javadocs, to advise against calling them after the iterator has exhausted.

        Show
        Shai Erera added a comment - Changed IndexSearcher.doSearch a bit. I think it's more efficient now, since it does not check for filterDoc > scorerDoc at the beginning of the iteration, but rather is handled at the 'else' part. (since when we enter the loop scorerDoc >= filterDoc, that check should not have happened right at the beginning). I also changed DISI's nextDoc and advance javadocs, to advise against calling them after the iterator has exhausted.
        Hide
        Michael McCandless added a comment -

        Patch looks good! I plan to commit in a day or two...

        Show
        Michael McCandless added a comment - Patch looks good! I plan to commit in a day or two...
        Hide
        Uwe Schindler added a comment -

        Quickly looking over the code:

        Javadocs for DocIdSetIterator still sometimes contain -1 instead of NO_MORE_DOCS

        +    Scorer.score(Collector, int) was deprecated in favor of 
        +    Scorer.score(Collector, int, int) which take an extra 'firstDocID' 
        +    parameter. This is meant to ensure nextDoc() is called before this method.
        

        Collector is a new class in 2.9, so no need to deprecate (in the code it seems correct).

        Will look into it more detailed tomorrow.

        Show
        Uwe Schindler added a comment - Quickly looking over the code: Javadocs for DocIdSetIterator still sometimes contain -1 instead of NO_MORE_DOCS + Scorer.score(Collector, int ) was deprecated in favor of + Scorer.score(Collector, int , int ) which take an extra 'firstDocID' + parameter. This is meant to ensure nextDoc() is called before this method. Collector is a new class in 2.9, so no need to deprecate (in the code it seems correct). Will look into it more detailed tomorrow.
        Hide
        Shai Erera added a comment -

        Updated to 20090601a tag.
        I also removed the comment on Scorer from CHANGES. Thanks Uwe - this deprecation stuff really got over me . I deleted the method from Scorer as well.

        Uwe, regarding the -1 in DISI - -1 is a valid value for docID() in case nextDoc/advance were not called yet. That's why you see it in the javadocs. I did however fix advance's sample code in the javadocs to return NO_MORE_DOCS instead of -1. If that's what you meant, then thanks for spotting that.

        Show
        Shai Erera added a comment - Updated to 20090601a tag. I also removed the comment on Scorer from CHANGES. Thanks Uwe - this deprecation stuff really got over me . I deleted the method from Scorer as well. Uwe, regarding the -1 in DISI - -1 is a valid value for docID() in case nextDoc/advance were not called yet. That's why you see it in the javadocs. I did however fix advance's sample code in the javadocs to return NO_MORE_DOCS instead of -1. If that's what you meant, then thanks for spotting that.
        Hide
        Uwe Schindler added a comment -

        Uwe, regarding the -1 in DISI - -1 is a valid value for docID() in case nextDoc/advance were not called yet. That's why you see it in the javadocs. I did however fix advance's sample code in the javadocs to return NO_MORE_DOCS instead of -1. If that's what you meant, then thanks for spotting that.

        I meant the sample code that explains how it is internally implemented for backwards compatibility (if not overwritten). So you fixed the right one, I think.

        There is one other thing with the deprecation: When we deprectated the Filters getBits() method we un-abstracted this method and added the new one with the note, that it will be made abstract and the old bits() removed.

        You kept the old methods skipTo() and next() abstract, but deprecated. One must now always implement next() and skipTo() but gets a deprecated warning because of that. So I would do it in the same way like bits() for filters. Un-abstract it and implement it with throwing an UnsupportedOperationException(). The new advance() and others are calling this method (as it is now). If somebody overwrites this abstract class, he can leave out the deprecated one, will get no warning. As the method is not called anymore by Lucene he will never see a UOE. In 3.0 it can easily removed without problems for any users.

        Show
        Uwe Schindler added a comment - Uwe, regarding the -1 in DISI - -1 is a valid value for docID() in case nextDoc/advance were not called yet. That's why you see it in the javadocs. I did however fix advance's sample code in the javadocs to return NO_MORE_DOCS instead of -1. If that's what you meant, then thanks for spotting that. I meant the sample code that explains how it is internally implemented for backwards compatibility (if not overwritten). So you fixed the right one, I think. There is one other thing with the deprecation: When we deprectated the Filters getBits() method we un-abstracted this method and added the new one with the note, that it will be made abstract and the old bits() removed. You kept the old methods skipTo() and next() abstract, but deprecated. One must now always implement next() and skipTo() but gets a deprecated warning because of that. So I would do it in the same way like bits() for filters. Un-abstract it and implement it with throwing an UnsupportedOperationException(). The new advance() and others are calling this method (as it is now). If somebody overwrites this abstract class, he can leave out the deprecated one, will get no warning. As the method is not called anymore by Lucene he will never see a UOE. In 3.0 it can easily removed without problems for any users.
        Hide
        Shai Erera added a comment -

        Thanks Uwe - that's a good idea. I've implemented all 3 to throw UOE as well as document that in their javadocs. I kept nextDoc() as is, i.e., calling doc() instead of docID(), since that's required for back-compat. Since all DISIs should have their doc() implemented, that shouldn't be a problem.

        Show
        Shai Erera added a comment - Thanks Uwe - that's a good idea. I've implemented all 3 to throw UOE as well as document that in their javadocs. I kept nextDoc() as is, i.e., calling doc() instead of docID(), since that's required for back-compat. Since all DISIs should have their doc() implemented, that shouldn't be a problem.
        Hide
        Uwe Schindler added a comment - - edited

        I kept nextDoc() as is, i.e., calling doc() instead of docID(), since that's required for back-compat.

        But doc() is also deprecated, why should they have implemented it? This makes no sense for me. Non-deprecated-code should always call non-deprectaed methods, so docID().

        EDIT: Ah sorry, you are right. new iterators would override this method and not call doc() and next().

        Since all DISIs should have their doc() implemented, that shouldn't be a problem.

        This should be "Since all DISIs should have their docID() implemented, that shouldn't be a problem.", then it makes sense to me.

        Show
        Uwe Schindler added a comment - - edited I kept nextDoc() as is, i.e., calling doc() instead of docID(), since that's required for back-compat. But doc() is also deprecated, why should they have implemented it? This makes no sense for me. Non-deprecated-code should always call non-deprectaed methods, so docID(). EDIT: Ah sorry, you are right. new iterators would override this method and not call doc() and next(). Since all DISIs should have their doc() implemented, that shouldn't be a problem. This should be "Since all DISIs should have their docID() implemented, that shouldn't be a problem.", then it makes sense to me.
        Hide
        Shai Erera added a comment -

        This is all to support jar drop-in ability. So say you wrote your own DISI with doc(), next() and skipTo() and now pass it to 2.9. The code in 2.9 will call nextDoc(), but you have implemented docID(), just doc(). Therefore nextDoc() cannot call docID(), same as it calls next() which is also deprecated.

        We cannot implement docID() by calling doc() either, since docID() defines new semantics, which existing doc() implementations may not adhere to.

        Of course, if someone chooses to change his/her code, he/she should override all 3 and provide his/her own implementation, especially since all 3 of them will become abstract in 3.0.

        I actually meant to say "Since all DISIs should have their doc() implemented ..." - that's the jar drop-in ability side of the story. Not all DISIs will have their docID() implemented. In fact none of them will, unless someone upgrades his code.

        Right?

        Show
        Shai Erera added a comment - This is all to support jar drop-in ability. So say you wrote your own DISI with doc(), next() and skipTo() and now pass it to 2.9. The code in 2.9 will call nextDoc(), but you have implemented docID(), just doc(). Therefore nextDoc() cannot call docID(), same as it calls next() which is also deprecated. We cannot implement docID() by calling doc() either, since docID() defines new semantics, which existing doc() implementations may not adhere to. Of course, if someone chooses to change his/her code, he/she should override all 3 and provide his/her own implementation, especially since all 3 of them will become abstract in 3.0. I actually meant to say "Since all DISIs should have their doc() implemented ..." - that's the jar drop-in ability side of the story. Not all DISIs will have their docID() implemented. In fact none of them will, unless someone upgrades his code. Right?
        Hide
        Uwe Schindler added a comment -

        Yes! Sorry for the false alarm, I did not think to the end.

        For completely new impls of DISI, one can leave the deprecated methods as they are and is also fine (and will get no deprecations), because current code will not call the old methods anymore. If he would drop this new class into an old Lucene it will throw UOE, but this is not what he should do.

        Show
        Uwe Schindler added a comment - Yes! Sorry for the false alarm, I did not think to the end. For completely new impls of DISI, one can leave the deprecated methods as they are and is also fine (and will get no deprecations), because current code will not call the old methods anymore. If he would drop this new class into an old Lucene it will throw UOE, but this is not what he should do.
        Hide
        Michael McCandless added a comment -

        Latest patch looks good. I plan to commit in a day or two. Phew! Thank Shai.

        Show
        Michael McCandless added a comment - Latest patch looks good. I plan to commit in a day or two. Phew! Thank Shai.
        Hide
        Michael McCandless added a comment -

        Thanks Shai!

        Show
        Michael McCandless added a comment - Thanks Shai!
        Hide
        Uwe Schindler added a comment -

        As Yonik noted on java-user, the backwards pattern is wrongly implemented for DocIdSetIterator.advance(). It should simply call skipTo, so that iterators only implementing skipTo() work correct when updated to Lucene 2.9.

        The method advance should look similar to nextDoc(), only next() replaced by skipTo()

        Uwe

        Show
        Uwe Schindler added a comment - As Yonik noted on java-user, the backwards pattern is wrongly implemented for DocIdSetIterator.advance(). It should simply call skipTo, so that iterators only implementing skipTo() work correct when updated to Lucene 2.9. The method advance should look similar to nextDoc(), only next() replaced by skipTo() Uwe
        Hide
        Uwe Schindler added a comment -

        This patch fixes this. All tests, especially test-tag pass.

        Show
        Uwe Schindler added a comment - This patch fixes this. All tests, especially test-tag pass.
        Hide
        Uwe Schindler added a comment -

        Here is a second possibility, maybe better. The javadocs of advance note, that it may be called with NO_MORE_DOCS as parameter. skipTo may not be able to handle this. So I added an extra check to the BW-compatiility method. Is this better, or is the first patch ok?

        I see no difference. Maybe yonik could try this out with Solr?

        Show
        Uwe Schindler added a comment - Here is a second possibility, maybe better. The javadocs of advance note, that it may be called with NO_MORE_DOCS as parameter. skipTo may not be able to handle this. So I added an extra check to the BW-compatiility method. Is this better, or is the first patch ok? I see no difference. Maybe yonik could try this out with Solr?
        Hide
        Michael McCandless added a comment -

        Whoa I agree: advance should fallback to skipTo. What a doozie. Crazy we didn't catch this the first time around.

        I think the 2nd patch is safest. There may be DISIs out there for which skipTo(Integer.MAX_VALUE) is costly and/or throws an errant exception. I'll commit soon.

        Show
        Michael McCandless added a comment - Whoa I agree: advance should fallback to skipTo. What a doozie. Crazy we didn't catch this the first time around. I think the 2nd patch is safest. There may be DISIs out there for which skipTo(Integer.MAX_VALUE) is costly and/or throws an errant exception. I'll commit soon.

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Shai Erera
          • Votes:
            1 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development