Lucene - Core
  1. Lucene - Core
  2. LUCENE-1593

Optimizations to TopScoreDocCollector and TopFieldCollector

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.9
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code to remove unnecessary checks. The plan is:

      1. Ensure that IndexSearcher returns segements in increasing doc Id order, instead of numDocs().
      2. Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs will always have larger ids and therefore cannot compete.
      3. Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) and remove the check if reusableSD == null.
      4. Also move to use "changing top" and then call adjustTop(), in case we update the queue.
      5. some methods in Sort explicitly add SortField.FIELD_DOC as a "tie breaker" for the last SortField. But, doing so should not be necessary (since we already break ties by docID), and is in fact less efficient (once the above optimization is in).
      6. Investigate PQ - can we deprecate insert() and have only insertWithOverflow()? Add a addDummyObjects method which will populate the queue without "arranging" it, just store the objects in the array (this can be used to pre-populate sentinel values)?

      I will post a patch as well as some perf measurements as soon as I have them.

      1. PerfTest.java
        4 kB
        Shai Erera
      2. LUCENE-1593.patch
        61 kB
        Shai Erera
      3. LUCENE-1593.patch
        109 kB
        Shai Erera
      4. LUCENE-1593.patch
        101 kB
        Shai Erera
      5. LUCENE-1593.patch
        101 kB
        Shai Erera
      6. LUCENE-1593.patch
        92 kB
        Michael McCandless

        Activity

        Hide
        Michael McCandless added a comment -

        Thanks Shai!

        Show
        Michael McCandless added a comment - Thanks Shai!
        Hide
        Shai Erera added a comment -

        MultiSearcher.search was creating too big an array of ScoreDocs, and was incorrectly (because sentinels were not used) avoiding HitQueue.size().

        Oh right ... I forgot to roll that back since HitQueue is initialized in those cases to not pre-populate with sentinel values.

        Show
        Shai Erera added a comment - MultiSearcher.search was creating too big an array of ScoreDocs, and was incorrectly (because sentinels were not used) avoiding HitQueue.size(). Oh right ... I forgot to roll that back since HitQueue is initialized in those cases to not pre-populate with sentinel values.
        Hide
        Michael McCandless added a comment -

        OK I made some tiny fixes:

        • Added CHANGES entry explaining that Sort no longer tacks on
          SortField.FIELD_DOC since that tie break is already handled
          internally
        • MultiSearcher.search was creating too big an array of ScoreDocs,
          and was incorrectly (because sentinels were not used) avoiding
          HitQueue.size().
        • Renamed IndexSearcher.sortedStarts -> docStarts (they are no
          longer sorted)
        • Made BS2.initCountingSumScorer private again
        • Small whitespace fixes

        I think it's ready to commit! I'll wait a day or two.

        Show
        Michael McCandless added a comment - OK I made some tiny fixes: Added CHANGES entry explaining that Sort no longer tacks on SortField.FIELD_DOC since that tie break is already handled internally MultiSearcher.search was creating too big an array of ScoreDocs, and was incorrectly (because sentinels were not used) avoiding HitQueue.size(). Renamed IndexSearcher.sortedStarts -> docStarts (they are no longer sorted) Made BS2.initCountingSumScorer private again Small whitespace fixes I think it's ready to commit! I'll wait a day or two.
        Hide
        Michael McCandless added a comment -

        I don't think we should do anything in TFC for now. It will only save one 'if' and adding sentinel values is not so trivial. Maybe leave it for a specializer code?

        OK I agree, let's not do this one for now.

        New patch looks good – I'll review it some more and then wait a few days and commit. Thanks Shai!

        Show
        Michael McCandless added a comment - I don't think we should do anything in TFC for now. It will only save one 'if' and adding sentinel values is not so trivial. Maybe leave it for a specializer code? OK I agree, let's not do this one for now. New patch looks good – I'll review it some more and then wait a few days and commit. Thanks Shai!
        Hide
        Shai Erera added a comment -
        • Removed the leftover references to Query#scoresDocInOrder.
        • Removed the text from CHANGES
        • I don't think we should do anything in TFC for now. It will only save one 'if' and adding sentinel values is not so trivial. Maybe leave it for a specializer code?
        Show
        Shai Erera added a comment - Removed the leftover references to Query#scoresDocInOrder. Removed the text from CHANGES I don't think we should do anything in TFC for now. It will only save one 'if' and adding sentinel values is not so trivial. Maybe leave it for a specializer code?
        Hide
        Michael McCandless added a comment -

        Patch looks good! I can confirm that all tests pass. (Though the back-compat tag has moved forward – I just carried that part of the patch forward). Thanks Shai.

        Some small comments:

        • No need to call out the API change to HitQueue in CHANGES.txt
          (it's package private)
        • Can't the out-of-order TFC classes do the "pre-populate pqueue w/
          sentinel" optimization?
        • There are some leftover javadocs references to
          Query#scoresDosInOrder (in at least TSDC)
        • I don't think we should this yet, but..: it'd be conceivably
          possible at indexing time to record if a given field ever uses the
          sentinel value for that field's type. If it doesn't, we can also
          safely pre-fill the queue w/ sentinels even for in-order scoring.
          (There are many barriers to doing this optimization today in
          Lucene).
        Show
        Michael McCandless added a comment - Patch looks good! I can confirm that all tests pass. (Though the back-compat tag has moved forward – I just carried that part of the patch forward). Thanks Shai. Some small comments: No need to call out the API change to HitQueue in CHANGES.txt (it's package private) Can't the out-of-order TFC classes do the "pre-populate pqueue w/ sentinel" optimization? There are some leftover javadocs references to Query#scoresDosInOrder (in at least TSDC) I don't think we should this yet, but..: it'd be conceivably possible at indexing time to record if a given field ever uses the sentinel value for that field's type. If it doesn't, we can also safely pre-fill the queue w/ sentinels even for in-order scoring. (There are many barriers to doing this optimization today in Lucene).
        Hide
        Shai Erera added a comment -

        Patch includes all discussed changes, and defaults TSDC and TFC to out-of-order collection. It also covers the changes to the tag.

        Note that currently BS and BS2 check if they should init in next()/skipTo/score - I will fix it in the other issue after I separate between them (i.e., not having BS2 instantiate BS), via a topScorer() or something.

        All tests pass

        Show
        Shai Erera added a comment - Patch includes all discussed changes, and defaults TSDC and TFC to out-of-order collection. It also covers the changes to the tag. Note that currently BS and BS2 check if they should init in next()/skipTo/score - I will fix it in the other issue after I separate between them (i.e., not having BS2 instantiate BS), via a topScorer() or something. All tests pass
        Hide
        Michael McCandless added a comment -

        I assume the possibility to guarantee always in-order scorer will remain after all these changes?

        Correct.

        Show
        Michael McCandless added a comment - I assume the possibility to guarantee always in-order scorer will remain after all these changes? Correct.
        Hide
        Earwin Burrfoot added a comment -

        So for this issue: create the "always in-order" optimization in TSDC/TFC, but leave it "dark" (IndexSearcher always asks for "out of order" collector). In the 2nd issue, make the switch from interface to abstract base class, and add methods so we can track in/out of order Scorer, and finally hook the two up (use an in-order Collector when the returned Scorer is always in-order).

        I assume the possibility to guarantee always in-order scorer will remain after all these changes?

        Show
        Earwin Burrfoot added a comment - So for this issue: create the "always in-order" optimization in TSDC/TFC, but leave it "dark" (IndexSearcher always asks for "out of order" collector). In the 2nd issue, make the switch from interface to abstract base class, and add methods so we can track in/out of order Scorer, and finally hook the two up (use an in-order Collector when the returned Scorer is always in-order). I assume the possibility to guarantee always in-order scorer will remain after all these changes?
        Hide
        Michael McCandless added a comment -

        I don't mind doing that. I also volunteer to open the next issue and take care of it. Is that what you had in mind?

        OK that sounds great!

        So for this issue: create the "always in-order" optimization in TSDC/TFC, but leave it "dark" (IndexSearcher always asks for "out of order" collector). In the 2nd issue, make the switch from interface to abstract base class, and add methods so we can track in/out of order Scorer, and finally hook the two up (use an in-order Collector when the returned Scorer is always in-order).

        So the current patch on this issue needs to be redone, right? (Eg to remove Query.scoresDocsInOrder, etc.).

        Show
        Michael McCandless added a comment - I don't mind doing that. I also volunteer to open the next issue and take care of it. Is that what you had in mind? OK that sounds great! So for this issue: create the "always in-order" optimization in TSDC/TFC, but leave it "dark" (IndexSearcher always asks for "out of order" collector). In the 2nd issue, make the switch from interface to abstract base class, and add methods so we can track in/out of order Scorer, and finally hook the two up (use an in-order Collector when the returned Scorer is always in-order). So the current patch on this issue needs to be redone, right? (Eg to remove Query.scoresDocsInOrder, etc.).
        Hide
        Shai Erera added a comment -

        shouldn't we do the interface -> abstract class migration (of Weight & Searchable) under a separate issue?

        I think I asked it already but haven't received a concrete answer.

        This issue moved TSDC and TFC to always assume "in-order" collection of documents. If BS is used, they will break for the documents that compare equally to the top of the queue. Therefore we wanted to be able to create the right variant of TFC/TSDC depending on the Scorer we're going to get.

        I offered somewhere above (or at least intended to) that we keep the changes I've done to TSDC and TFC (allowing to create in/out-of order variant) and in IndexSearcher always ask for out-of-order, then in a separate issue make all these changes and really take advantage of the in-order variants unless BS is used (or any other future Scorer for that matter).

        I don't mind doing that. I also volunteer to open the next issue and take care of it. Is that what you had in mind?

        Show
        Shai Erera added a comment - shouldn't we do the interface -> abstract class migration (of Weight & Searchable) under a separate issue? I think I asked it already but haven't received a concrete answer. This issue moved TSDC and TFC to always assume "in-order" collection of documents. If BS is used, they will break for the documents that compare equally to the top of the queue. Therefore we wanted to be able to create the right variant of TFC/TSDC depending on the Scorer we're going to get. I offered somewhere above (or at least intended to) that we keep the changes I've done to TSDC and TFC (allowing to create in/out-of order variant) and in IndexSearcher always ask for out-of-order, then in a separate issue make all these changes and really take advantage of the in-order variants unless BS is used (or any other future Scorer for that matter). I don't mind doing that. I also volunteer to open the next issue and take care of it. Is that what you had in mind?
        Hide
        Michael McCandless added a comment -

        Shai your summary of what needs to be done looks right. But:
        shouldn't we do the interface -> abstract class migration (of Weight &
        Searchable) under a separate issue? Ie, under that issue no real
        functional change to Lucene is happening. Then in this issue we can
        make the optimizations?

        I just ran a quick perf test (best of 5 runs, Linux, JDK 1.6) of the
        query 1 OR 2 on a large Wikipedia index. Using BS instead of BS2
        gives a 27% speedup (2.2 QPS -> 2.8). I'd really like for Lucene to
        be able to use BS automatically when it can. In fact, I think we should
        move more scorers to out-of-order, if we can get these kinds of
        performance gains.

        These changes go beyond that, though, and also enable a separate
        optimization whereby the Collector knows it doesn't have to break ties
        by docID. TFC would then use that to gain performance for all in-order
        scorers.

        Some of these changes were discussed elsewhere already, e.g. deprecating Weight and Searchable and make them abstract classes for easier such changes in the future.

        In fact I think most of the work above is for this (and not the
        optimizations), and I think migration from interfaces -> abstract
        classes is important (for 2.9).

        Show
        Michael McCandless added a comment - Shai your summary of what needs to be done looks right. But: shouldn't we do the interface -> abstract class migration (of Weight & Searchable) under a separate issue? Ie, under that issue no real functional change to Lucene is happening. Then in this issue we can make the optimizations? I just ran a quick perf test (best of 5 runs, Linux, JDK 1.6) of the query 1 OR 2 on a large Wikipedia index. Using BS instead of BS2 gives a 27% speedup (2.2 QPS -> 2.8). I'd really like for Lucene to be able to use BS automatically when it can. In fact, I think we should move more scorers to out-of-order, if we can get these kinds of performance gains. These changes go beyond that, though, and also enable a separate optimization whereby the Collector knows it doesn't have to break ties by docID. TFC would then use that to gain performance for all in-order scorers. Some of these changes were discussed elsewhere already, e.g. deprecating Weight and Searchable and make them abstract classes for easier such changes in the future. In fact I think most of the work above is for this (and not the optimizations), and I think migration from interfaces -> abstract classes is important (for 2.9).
        Hide
        Yonik Seeley added a comment -

        Or ... we can stop using BS at all and saying all scorers must work in-order.

        Well, BS is the odd man out... It's not currently used unless the user specifically sets it up and it doesn't implement skipTo() (although the latter could be fixed presumably).

        >> Another option to consider is use of a thread local to pass this info. A bit of a hack, but it would be more localized.
        > where would you set it?

        Haven't seriously thought about it, but It could be set inside IndexSearcher.search() method and checked in BooleanWeight.scorer().

        Show
        Yonik Seeley added a comment - Or ... we can stop using BS at all and saying all scorers must work in-order. Well, BS is the odd man out... It's not currently used unless the user specifically sets it up and it doesn't implement skipTo() (although the latter could be fixed presumably). >> Another option to consider is use of a thread local to pass this info. A bit of a hack, but it would be more localized. > where would you set it? Haven't seriously thought about it, but It could be set inside IndexSearcher.search() method and checked in BooleanWeight.scorer().
        Hide
        Shai Erera added a comment -

        Whew - that's a lot of change just to sometimes allow BooleanScorer instead of BooleanScorer2

        Some of these changes were discussed elsewhere already, e.g. deprecating Weight and Searchable and make them abstract classes for easier such changes in the future.

        Also, it's not just about creating BS2 or BS sometimes ... it's about the changes in this issue which moved to assume at first in-order documents collection, and thus did not break ties on doc Ids at the Collector level. In order to allow this to work with the current BS, we need to have a way to determine which scorer will be used. Or ... we can stop using BS at all and saying all scorers must work in-order.

        Also, it's not that large of change, just lots of text (my fault). In the end we'll achieve some refactoring, and few more deprecated methods.

        Another option to consider is use of a thread local to pass this info. A bit of a hack, but it would be more localized.

        I'm not sure I understand this - where would you set it? On IndexSearcher ctor? search methods (which means changes to the interfaces)? On Scorer (which is most local I can think of)?

        Show
        Shai Erera added a comment - Whew - that's a lot of change just to sometimes allow BooleanScorer instead of BooleanScorer2 Some of these changes were discussed elsewhere already, e.g. deprecating Weight and Searchable and make them abstract classes for easier such changes in the future. Also, it's not just about creating BS2 or BS sometimes ... it's about the changes in this issue which moved to assume at first in-order documents collection, and thus did not break ties on doc Ids at the Collector level. In order to allow this to work with the current BS, we need to have a way to determine which scorer will be used. Or ... we can stop using BS at all and saying all scorers must work in-order. Also, it's not that large of change, just lots of text (my fault). In the end we'll achieve some refactoring, and few more deprecated methods. Another option to consider is use of a thread local to pass this info. A bit of a hack, but it would be more localized. I'm not sure I understand this - where would you set it? On IndexSearcher ctor? search methods (which means changes to the interfaces)? On Scorer (which is most local I can think of)?
        Hide
        Yonik Seeley added a comment -

        Whew - that's a lot of change just to sometimes allow BooleanScorer instead of BooleanScorer2!
        Another option to consider is use of a thread local to pass this info. A bit of a hack, but it would be more localized.

        Show
        Yonik Seeley added a comment - Whew - that's a lot of change just to sometimes allow BooleanScorer instead of BooleanScorer2! Another option to consider is use of a thread local to pass this info. A bit of a hack, but it would be more localized.
        Hide
        Shai Erera added a comment -

        Ok will do that. I also would like to summarize what the latest posts here:

        1. Deprecate Weight and create QueryWeight (abstract class) with a new scorer(reader, scoreDocsInOrder), replacing the current scorer(reader) method. QueryWeight implements Weight, while score(reader) calls score(reader, false /* out-of-order */) and scorer(reader, scoreDocsInOrder) is defined abstract.
          • Also add QueryWeightWrapper to wrap a given Weight implementation. This one will also be deprecated, as well as package-private.
          • Add to Query variants of createWeight and weight which return QueryWeight. For now, I prefer to add a default impl which wraps the Weight variant instead of overriding in all Query extensions, and in 3.0 when we remove the Weight variants - override in all extending classes.
        2. Add to Scorer isOutOfOrder with a default to false, and override in BS to true.
        3. Modify BooleanWeight to extend QueryWeight and implement the new scorer method to return BS2 or BS based on the number of required scorers and setAllowOutOfOrder.
        4. Add to Collector an abstract acceptsDocsOutOfOrder which returns true/false.
          • Use it in IndexSearcher.search methods, that accept a Collector, in order to create the appropriate Scorer, using the new QueryWeight.
          • Provide a static create method to TFC and TSDC which accept this as an argument and creates the proper instance.
          • Wherever we create a Collector (TSDC or TFC), always ask for out-of-order Scorer and check on the resulting Scorer isOutOfOrder(), so that we can create the optimized Collector instance.
        5. Modify IndexSearcher to use all of the above logic.

        The only class I'm worried about, and would like to verify with you, is Searchable. If we want to deprecate all the search methods on IndexSearcher, Searcher and Searchable which accept Weight and add new ones which accept QueryWeight, we must do the following:

        • Deprecate Searchable in favor of Searcher.
        • Add to Searcher the new QueryWeight variants. Here we have two choices: (1) break back-compat and add them as abstract (like we've done with the new Collector method) or (2) add them with a default impl to call the Weight versions, documenting these will become abstract in 3.0.
        • Have Searcher extend UnicastRemoteObject and have RemoteSearchable extend Searcher. That's the part I'm a little bit worried about - Searchable implements java.rmi.Remote, which means there could be an implementation out there which implements Searchable and extends something different than UnicastRemoteObject, like Activeable. I think there is very small chance this has actually happened, but would like to confirm with you guys first.
        • Add a deprecated, package-private, SearchableWrapper which extends Searcher and delegates all calls to the Searchable member.
        • Deprecate all uses of Searchable and add Searcher instead, defaulting the old ones to use SearchableWrapper.
        • Make all the necessary changes to IndexSearcher, MultiSearcher etc. regarding overriding these new methods.

        I really hope I covered everything in this summary.

        Show
        Shai Erera added a comment - Ok will do that. I also would like to summarize what the latest posts here: Deprecate Weight and create QueryWeight (abstract class) with a new scorer(reader, scoreDocsInOrder), replacing the current scorer(reader) method. QueryWeight implements Weight, while score(reader) calls score(reader, false /* out-of-order */) and scorer(reader, scoreDocsInOrder) is defined abstract. Also add QueryWeightWrapper to wrap a given Weight implementation. This one will also be deprecated, as well as package-private. Add to Query variants of createWeight and weight which return QueryWeight. For now, I prefer to add a default impl which wraps the Weight variant instead of overriding in all Query extensions, and in 3.0 when we remove the Weight variants - override in all extending classes. Add to Scorer isOutOfOrder with a default to false, and override in BS to true. Modify BooleanWeight to extend QueryWeight and implement the new scorer method to return BS2 or BS based on the number of required scorers and setAllowOutOfOrder. Add to Collector an abstract acceptsDocsOutOfOrder which returns true/false. Use it in IndexSearcher.search methods, that accept a Collector, in order to create the appropriate Scorer, using the new QueryWeight. Provide a static create method to TFC and TSDC which accept this as an argument and creates the proper instance. Wherever we create a Collector (TSDC or TFC), always ask for out-of-order Scorer and check on the resulting Scorer isOutOfOrder(), so that we can create the optimized Collector instance. Modify IndexSearcher to use all of the above logic. The only class I'm worried about, and would like to verify with you, is Searchable. If we want to deprecate all the search methods on IndexSearcher, Searcher and Searchable which accept Weight and add new ones which accept QueryWeight, we must do the following: Deprecate Searchable in favor of Searcher. Add to Searcher the new QueryWeight variants. Here we have two choices: (1) break back-compat and add them as abstract (like we've done with the new Collector method) or (2) add them with a default impl to call the Weight versions, documenting these will become abstract in 3.0. Have Searcher extend UnicastRemoteObject and have RemoteSearchable extend Searcher. That's the part I'm a little bit worried about - Searchable implements java.rmi.Remote, which means there could be an implementation out there which implements Searchable and extends something different than UnicastRemoteObject, like Activeable. I think there is very small chance this has actually happened, but would like to confirm with you guys first. Add a deprecated, package-private, SearchableWrapper which extends Searcher and delegates all calls to the Searchable member. Deprecate all uses of Searchable and add Searcher instead, defaulting the old ones to use SearchableWrapper. Make all the necessary changes to IndexSearcher, MultiSearcher etc. regarding overriding these new methods. I really hope I covered everything in this summary.
        Hide
        Michael McCandless added a comment -

        Actually, if you request to "sort-by-score" and ask for a scoring Collector, the score() method will be hit twice

        OK I see – I think we should make a ScoreCachingWrapperScorer when FIELD_SCORE is one of the SortFields.

        I think this may come up fairly often... eg if one sorts by field X and then score, as the tie breaker.

        Also, we can always create a ScoringNoMaxScore collector in such cases, since if we're going to compute the score, why not save it?

        I don't think we should do this? (Ie, the API shouldn't do "magic" for you under-the-hood).

        Show
        Michael McCandless added a comment - Actually, if you request to "sort-by-score" and ask for a scoring Collector, the score() method will be hit twice OK I see – I think we should make a ScoreCachingWrapperScorer when FIELD_SCORE is one of the SortFields. I think this may come up fairly often... eg if one sorts by field X and then score, as the tie breaker. Also, we can always create a ScoringNoMaxScore collector in such cases, since if we're going to compute the score, why not save it? I don't think we should do this? (Ie, the API shouldn't do "magic" for you under-the-hood).
        Hide
        Shai Erera added a comment -

        Actually, if you request to "sort-by-score" and ask for a scoring Collector, the score() method will be hit twice - once in tfc.collect(), which does not use a caching scorer. and 2nd in RelComp.copy()/compareTo(), which does use a caching scorer. If we want to handle it, although that is somewhat more of an edge case, I suggest that we check in TFC.create() whether any of the scorers is of type SortField.FIELD_SCORE and if so wrap the scorer given to setScorer with ScoreCachingWrapperScorer, and remove that wrapping from RelevanceComparator. That way, both Collector and Comparator will use the same caching scorer.

        Also, we can always create a ScoringNoMaxScore collector in such cases, since if we're going to compute the score, why not save it? I'm not sure about it since it will violate the API, i.e. you asked for a non-scoring collector and get a scoring one just because one of your sort fields was of type "sort-by-score". But then again, it is really an edge case, and I'm not sure why would someone want to do it.

        Show
        Shai Erera added a comment - Actually, if you request to "sort-by-score" and ask for a scoring Collector, the score() method will be hit twice - once in tfc.collect(), which does not use a caching scorer. and 2nd in RelComp.copy()/compareTo(), which does use a caching scorer. If we want to handle it, although that is somewhat more of an edge case, I suggest that we check in TFC.create() whether any of the scorers is of type SortField.FIELD_SCORE and if so wrap the scorer given to setScorer with ScoreCachingWrapperScorer, and remove that wrapping from RelevanceComparator. That way, both Collector and Comparator will use the same caching scorer. Also, we can always create a ScoringNoMaxScore collector in such cases, since if we're going to compute the score, why not save it? I'm not sure about it since it will violate the API, i.e. you asked for a non-scoring collector and get a scoring one just because one of your sort fields was of type "sort-by-score". But then again, it is really an edge case, and I'm not sure why would someone want to do it.
        Hide
        Michael McCandless added a comment -

        But I think it's strange that you'll ask to sort by SCORE, and call scorer.score() twice, incurring the overhead of each call.

        So this indeed must be what Marvin was referring to? But the 2nd .score() call hits the cache and returns quickly? I wouldn't worry about that.

        Making all other field comparators pay the price of passing around unused Float.NEG_INF scores is also wasteful.

        Show
        Michael McCandless added a comment - But I think it's strange that you'll ask to sort by SCORE, and call scorer.score() twice, incurring the overhead of each call. So this indeed must be what Marvin was referring to? But the 2nd .score() call hits the cache and returns quickly? I wouldn't worry about that. Making all other field comparators pay the price of passing around unused Float.NEG_INF scores is also wasteful.
        Hide
        Shai Erera added a comment -

        I don't think we should add score back into the method signatures? Most comparators don't need the score.

        It's more for efficiency than if they need it or not. We've added setScorer to FieldComparator with a default impl which does nothing. So in a sense we've already introduced Scorer, although currently the Comps don't know about it. But I think it's strange that you'll ask to sort by SCORE, and call scorer.score() twice, incurring the overhead of each call.

        Show
        Shai Erera added a comment - I don't think we should add score back into the method signatures? Most comparators don't need the score. It's more for efficiency than if they need it or not. We've added setScorer to FieldComparator with a default impl which does nothing. So in a sense we've already introduced Scorer, although currently the Comps don't know about it. But I think it's strange that you'll ask to sort by SCORE, and call scorer.score() twice, incurring the overhead of each call.
        Hide
        Michael McCandless added a comment -

        maybe we can change TFC.create() to check the doc fields and if one of them is RELEVANCE return a ScoringNoMaxScore collector version, and then we should be safe with adding score back to those methods signature?

        I don't think we should add score back into the method signatures? Most comparators don't need the score.

        Show
        Michael McCandless added a comment - maybe we can change TFC.create() to check the doc fields and if one of them is RELEVANCE return a ScoringNoMaxScore collector version, and then we should be safe with adding score back to those methods signature? I don't think we should add score back into the method signatures? Most comparators don't need the score.
        Hide
        Michael McCandless added a comment -

        Do you propose it for back-compat reasons or simply because it makes sense. Collector is not released yet so we can define that method abstract.

        Woops, it was for back-compat (I forgot Collector isn't released). So let's simply make it an abstract method?

        Show
        Michael McCandless added a comment - Do you propose it for back-compat reasons or simply because it makes sense. Collector is not released yet so we can define that method abstract. Woops, it was for back-compat (I forgot Collector isn't released). So let's simply make it an abstract method?
        Hide
        Shai Erera added a comment -

        And default Collector.acceptsDocsOutOfOrder should return false.

        Do you propose it for back-compat reasons or simply because it makes sense. Collector is not released yet so we can define that method abstract.

        Thoughts on collapsing down all of these classes to 1 or 2 for Lucy in a post to the Lucy dev list entitled "SortCollector".

        I read it, but I'm not sure I agree with everything that you write there. I need to re-read it more carefully though before I can comment on it. One thing that caught my eye is that you write "I found one additional inefficiency in the Lucene implementation: score() is called twice for "competitive" docs". Where exactly did you see it? I checked TFC's code again and score() is never called twice. RelevanceComparator wraps the given Scorer with a ScoreCachingWrapperScorer, so the score() calls return almost immediately, without computing any scores.

        This was a tradeoff we've made because of the TFC instances that don't compute documents scores, and so we removed the score parameter from FieldComparator.copy() and compareBottom(). We could have added it back and pass in the not-scoring versions Float.NEG_INF, but that will not work well, since we should really compute the document's score if one of the SortField is RELEVANCE ... hmm - maybe we can change TFC.create() to check the doc fields and if one of them is RELEVANCE return a ScoringNoMaxScore collector version, and then we should be safe with adding score back to those methods signature?

        Show
        Shai Erera added a comment - And default Collector.acceptsDocsOutOfOrder should return false. Do you propose it for back-compat reasons or simply because it makes sense. Collector is not released yet so we can define that method abstract. Thoughts on collapsing down all of these classes to 1 or 2 for Lucy in a post to the Lucy dev list entitled "SortCollector". I read it, but I'm not sure I agree with everything that you write there. I need to re-read it more carefully though before I can comment on it. One thing that caught my eye is that you write "I found one additional inefficiency in the Lucene implementation: score() is called twice for "competitive" docs". Where exactly did you see it? I checked TFC's code again and score() is never called twice. RelevanceComparator wraps the given Scorer with a ScoreCachingWrapperScorer, so the score() calls return almost immediately, without computing any scores. This was a tradeoff we've made because of the TFC instances that don't compute documents scores, and so we removed the score parameter from FieldComparator.copy() and compareBottom(). We could have added it back and pass in the not-scoring versions Float.NEG_INF, but that will not work well, since we should really compute the document's score if one of the SortField is RELEVANCE ... hmm - maybe we can change TFC.create() to check the doc fields and if one of them is RELEVANCE return a ScoringNoMaxScore collector version, and then we should be safe with adding score back to those methods signature?
        Hide
        Marvin Humphrey added a comment -

        > I think I'd lean towards the 12 impls now.

        Thoughts on collapsing down all of these classes to 1 or 2 for Lucy in a post
        to the Lucy dev list entitled "SortCollector".

        Show
        Marvin Humphrey added a comment - > I think I'd lean towards the 12 impls now. Thoughts on collapsing down all of these classes to 1 or 2 for Lucy in a post to the Lucy dev list entitled "SortCollector" .
        Hide
        Michael McCandless added a comment -

        > Yonik does Solr have any Scorers that iterate on docs out of order? Or is BooleanScorer the only one we all know about?

        Nope. BooleanScorer is the only one I know about. And it's sort of special too... it's not like BooleanScorer can accept out-of-order scorers as sub-scorers itself - the ids need to be delivered in the range of the current bucket. IMO custom out-of-order scorers aren't supported in Lucene.

        Actually BS can accept out-of-order sub-scorers? They just have to implement the Scorer.score(Collector, int maxDoc)? So, yes, they have to stay w/in the requested bracket, but inside there they can do things out of order – the collector is an instance of BolleanScorerCollector (hmm – mispelled – I'll fix) which happily accepts out of order but within bracket docs.

        But it's good to know that out-of-order scorers are not generally supported even if Lucene uses one internally for better BooleanQuery (OR) performance.

        Show
        Michael McCandless added a comment - > Yonik does Solr have any Scorers that iterate on docs out of order? Or is BooleanScorer the only one we all know about? Nope. BooleanScorer is the only one I know about. And it's sort of special too... it's not like BooleanScorer can accept out-of-order scorers as sub-scorers itself - the ids need to be delivered in the range of the current bucket. IMO custom out-of-order scorers aren't supported in Lucene. Actually BS can accept out-of-order sub-scorers? They just have to implement the Scorer.score(Collector, int maxDoc)? So, yes, they have to stay w/in the requested bracket, but inside there they can do things out of order – the collector is an instance of BolleanScorerCollector (hmm – mispelled – I'll fix) which happily accepts out of order but within bracket docs. But it's good to know that out-of-order scorers are not generally supported even if Lucene uses one internally for better BooleanQuery (OR) performance.
        Hide
        Michael McCandless added a comment -

        Unless we assume that OUT_OF_ORDER covers DONT_CARE either

        I think this is the case? Ie a boolean suffices.

        For Collector that boolean means "can accept docs out of order".
        For the Scorer it means "might deliver docs out of order".

        Where Collector is given as argument, ask it if it about orderness and create the appropriate Scorer.

        Good. And default Collector.acceptsDocsOutOfOrder should return
        false.

        Where we create our own Collector (i.e. TFC and TSDC) decide on our own what is better. Maybe always ask out-of-order? That way a Query which doesn't only supports in-order without any optimization for out-of-order can return that in-order collector. I didn't think of it initially, but Mike is right - every in-order scorer is also an out-of-order scorer, so this should be fine.

        I think this is good, though we should 1) ask the Scorer for an
        out-of-order Scorer, but then once we get the resulting scorer back we
        should 2) ask that instance if in fact it will ever return
        out-of-order (all except BS will not), and then 3) pick a collector
        that's optimized for in-order collection if the scorer always
        returns in-order docs.

        The big problem is the fact that we get Scorers per segment, but
        Collector once. Actually it may not be a problem: maybe for the first
        segment we do the logic above, but then for subsequent segments we
        explictly ask for an in-order Scorer if the first one was in-order?
        Ie we can enforce homogeneity ourselves?

        This would require deferring creating the Collector until we've seen
        the Scorer for the first segment.

        Show
        Michael McCandless added a comment - Unless we assume that OUT_OF_ORDER covers DONT_CARE either I think this is the case? Ie a boolean suffices. For Collector that boolean means "can accept docs out of order". For the Scorer it means "might deliver docs out of order". Where Collector is given as argument, ask it if it about orderness and create the appropriate Scorer. Good. And default Collector.acceptsDocsOutOfOrder should return false. Where we create our own Collector (i.e. TFC and TSDC) decide on our own what is better. Maybe always ask out-of-order? That way a Query which doesn't only supports in-order without any optimization for out-of-order can return that in-order collector. I didn't think of it initially, but Mike is right - every in-order scorer is also an out-of-order scorer, so this should be fine. I think this is good, though we should 1) ask the Scorer for an out-of-order Scorer, but then once we get the resulting scorer back we should 2) ask that instance if in fact it will ever return out-of-order (all except BS will not), and then 3) pick a collector that's optimized for in-order collection if the scorer always returns in-order docs. The big problem is the fact that we get Scorers per segment, but Collector once. Actually it may not be a problem: maybe for the first segment we do the logic above, but then for subsequent segments we explictly ask for an in-order Scorer if the first one was in-order? Ie we can enforce homogeneity ourselves? This would require deferring creating the Collector until we've seen the Scorer for the first segment.
        Hide
        Shai Erera added a comment -

        add an API to Collector for it to declare if it can handle out-of-order collection, and then ask for the right Scorer.

        Maybe instead add docsOrderSupportedMode() which returns IN_ORDER, OUT_OF_ORDER, DONT_CARE? I.e., instead of a boolean allow a Collector to say "I don't really care" (like Mike has pointed out, I think, somewhere above) and let the Scorer creation code decide which one to create in case it knows any better. I.e., if we know that BS performs better than BS2, and we get a Collector saying DONT_CARE, we can always return BS.
        Unless we assume that OUT_OF_ORDER covers DONT_CARE either, in which case we can leave it as returning boolean and document that if a Collector can support OUT_OF_ORDER, it should always say so, giving the Scorer creator code a chance to decide what is the best Scorer to return.

        In IndexSearcher we can then:

        1. Where Collector is given as argument, ask it if it about orderness and create the appropriate Scorer.
        2. Where we create our own Collector (i.e. TFC and TSDC) decide on our own what is better. Maybe always ask out-of-order? That way a Query which doesn't only supports in-order without any optimization for out-of-order can return that in-order collector. I didn't think of it initially, but Mike is right - every in-order scorer is also an out-of-order scorer, so this should be fine.

        I like the approach of deprecating Weight and creating an abstract class, though that requires deprecating Searchable and creating an AbstractSearchable as well. Weight can be wrapped with an AbstractWeightWrapper and passed to the AbstractWeight methods (much like we do with AbstractHitCollector from LUCENE-1575), defaulting its scorer(inOrder) method to call scorer()?

        This I guess should be done in the scope of that issue, or I revert the changes done to Query (adding scoresDocsInOrder()), but keep those done to TFC and TSDC, and make that optimization in a different issue, which will handle Weight/Searchable and the rest of the changes proposed here?

        Show
        Shai Erera added a comment - add an API to Collector for it to declare if it can handle out-of-order collection, and then ask for the right Scorer. Maybe instead add docsOrderSupportedMode() which returns IN_ORDER, OUT_OF_ORDER, DONT_CARE? I.e., instead of a boolean allow a Collector to say "I don't really care" (like Mike has pointed out, I think, somewhere above) and let the Scorer creation code decide which one to create in case it knows any better. I.e., if we know that BS performs better than BS2, and we get a Collector saying DONT_CARE, we can always return BS. Unless we assume that OUT_OF_ORDER covers DONT_CARE either, in which case we can leave it as returning boolean and document that if a Collector can support OUT_OF_ORDER, it should always say so, giving the Scorer creator code a chance to decide what is the best Scorer to return. In IndexSearcher we can then: Where Collector is given as argument, ask it if it about orderness and create the appropriate Scorer. Where we create our own Collector (i.e. TFC and TSDC) decide on our own what is better. Maybe always ask out-of-order? That way a Query which doesn't only supports in-order without any optimization for out-of-order can return that in-order collector. I didn't think of it initially, but Mike is right - every in-order scorer is also an out-of-order scorer, so this should be fine. I like the approach of deprecating Weight and creating an abstract class, though that requires deprecating Searchable and creating an AbstractSearchable as well. Weight can be wrapped with an AbstractWeightWrapper and passed to the AbstractWeight methods (much like we do with AbstractHitCollector from LUCENE-1575 ), defaulting its scorer(inOrder) method to call scorer()? This I guess should be done in the scope of that issue, or I revert the changes done to Query (adding scoresDocsInOrder()), but keep those done to TFC and TSDC, and make that optimization in a different issue, which will handle Weight/Searchable and the rest of the changes proposed here?
        Hide
        Yonik Seeley added a comment -

        Yonik does Solr have any Scorers that iterate on docs out of order? Or is BooleanScorer the only one we all know about?

        Nope. BooleanScorer is the only one I know about. And it's sort of special too... it's not like BooleanScorer can accept out-of-order scorers as sub-scorers itself - the ids need to be delivered in the range of the current bucket. IMO custom out-of-order scorers aren't supported in Lucene.

        Show
        Yonik Seeley added a comment - Yonik does Solr have any Scorers that iterate on docs out of order? Or is BooleanScorer the only one we all know about? Nope. BooleanScorer is the only one I know about. And it's sort of special too... it's not like BooleanScorer can accept out-of-order scorers as sub-scorers itself - the ids need to be delivered in the range of the current bucket. IMO custom out-of-order scorers aren't supported in Lucene.
        Hide
        Marvin Humphrey added a comment -

        I made Weight a subclass of Query and all of a sudden Searcher method
        signatures got easier to manage.

        PS: Is this a good place to discuss why having rambling conversations in the bug tracker is a bad idea, or should I open a new issue?

        Show
        Marvin Humphrey added a comment - I made Weight a subclass of Query and all of a sudden Searcher method signatures got easier to manage. PS: Is this a good place to discuss why having rambling conversations in the bug tracker is a bad idea , or should I open a new issue?
        Hide
        Michael McCandless added a comment -

        BooleanScorer could be automatically used when appropriate

        If we do this (and I think we should – good perf gains, though I haven't tested just how good, recently), then we should deprecate setAllowDocsOutOfOrder in favor of Weight.scorer(boolean allowDocsOutOfOrder). And make it clear that internally Lucene may ask for either scorer, depending on the collector.

        Show
        Michael McCandless added a comment - BooleanScorer could be automatically used when appropriate If we do this (and I think we should – good perf gains, though I haven't tested just how good, recently), then we should deprecate setAllowDocsOutOfOrder in favor of Weight.scorer(boolean allowDocsOutOfOrder). And make it clear that internally Lucene may ask for either scorer, depending on the collector.
        Hide
        Michael McCandless added a comment -

        Yonik does Solr have any Scorers that iterate on docs out of order? Or is BooleanScorer the only one we all know about?

        Show
        Michael McCandless added a comment - Yonik does Solr have any Scorers that iterate on docs out of order? Or is BooleanScorer the only one we all know about?
        Hide
        Michael McCandless added a comment -

        IndexSearcher creates the Collector before it obtains a Scorer. Therefore all it has at hand is the Weight. Since Weight is an interface, we can't change it, so I added it to Query with a default of false.

        In early iterations on LUCENE-1483, we allowed Collector.setNextReader
        to return a new Collector on the possibility that a new segment might
        require different collector.

        We could consider going back to that... and allowing the builtin
        collectors to receive a Scorer on creation, which they could interact
        with to figure out in/out of order types of issues. We could then
        also enrich setNextReader a bit to also receive a Scorer, so that if
        somehow the Scorer for the next segment switched to be in-order vs
        out-of-order, the Collector could properly "respond".

        Or we could require "homogeneity" for Scorer across all segments
        (which'd be quite a bit simpler).

        Why not reverse the flow of information and tell the Weight.scorer() method if an out-of-order scorer is acceptable via some flags or a context object. This is also not backward compatible because Weight is an interface, so perhaps this optimization will just have to wait.

        I tentatively like this approach, ie add an API to Collector for it to
        declare if it can handle out-of-order collection, and then ask for the
        right Scorer.

        But still internal creation of Collectors could go both ways, and so
        we should retain the freedom to optimize (the BooleanScorer example
        above).

        Show
        Michael McCandless added a comment - IndexSearcher creates the Collector before it obtains a Scorer. Therefore all it has at hand is the Weight. Since Weight is an interface, we can't change it, so I added it to Query with a default of false. In early iterations on LUCENE-1483 , we allowed Collector.setNextReader to return a new Collector on the possibility that a new segment might require different collector. We could consider going back to that... and allowing the builtin collectors to receive a Scorer on creation, which they could interact with to figure out in/out of order types of issues. We could then also enrich setNextReader a bit to also receive a Scorer, so that if somehow the Scorer for the next segment switched to be in-order vs out-of-order, the Collector could properly "respond". Or we could require "homogeneity" for Scorer across all segments (which'd be quite a bit simpler). Why not reverse the flow of information and tell the Weight.scorer() method if an out-of-order scorer is acceptable via some flags or a context object. This is also not backward compatible because Weight is an interface, so perhaps this optimization will just have to wait. I tentatively like this approach, ie add an API to Collector for it to declare if it can handle out-of-order collection, and then ask for the right Scorer. But still internal creation of Collectors could go both ways, and so we should retain the freedom to optimize (the BooleanScorer example above).
        Hide
        Michael McCandless added a comment -

        This is also not backward compatible because Weight is an interface, so perhaps this optimization will just have to wait.

        Yonik would you suggest we migrate Weight to be an abstract class
        instead? (This is also being discussed in a separate thread on
        java-dev, if you want to respond there...).

        Show
        Michael McCandless added a comment - This is also not backward compatible because Weight is an interface, so perhaps this optimization will just have to wait. Yonik would you suggest we migrate Weight to be an abstract class instead? (This is also being discussed in a separate thread on java-dev, if you want to respond there...).
        Hide
        Michael McCandless added a comment -

        One further optimization can be enabled if we can properly "mate"
        out-of-orderness between Scorer & Collector: BooleanScorer could be
        automatically used when appropriate.

        Today, one must call "BooleanQuery.setAllowDocsOutOfOrder" which is
        rather silly (it's very much an "under the hood" detail of how the
        Scorer interacts w/ the Collector). The vast majority of time it's
        Lucene that creates the collector, and so now that we can create
        Collectors that either do or do not care if docs arrive out of order,
        we should allow BooleanScorer when we can.

        Though that means we have two ways to score a BooleanQuery:

        • Use BooleanScorer2 w/ a Collector that doesn't fallback to docID
          to break ties
        • Use BooleanScorer w/ a Collector that does fallback

        We'd need to test which is most performant (I'm guessing the 2nd
        one).

        So maybe we should in fact add a "acceptsDocsOutOfOrder" to
        Collector.

        Show
        Michael McCandless added a comment - One further optimization can be enabled if we can properly "mate" out-of-orderness between Scorer & Collector: BooleanScorer could be automatically used when appropriate. Today, one must call "BooleanQuery.setAllowDocsOutOfOrder" which is rather silly (it's very much an "under the hood" detail of how the Scorer interacts w/ the Collector). The vast majority of time it's Lucene that creates the collector, and so now that we can create Collectors that either do or do not care if docs arrive out of order, we should allow BooleanScorer when we can. Though that means we have two ways to score a BooleanQuery: Use BooleanScorer2 w/ a Collector that doesn't fallback to docID to break ties Use BooleanScorer w/ a Collector that does fallback We'd need to test which is most performant (I'm guessing the 2nd one). So maybe we should in fact add a "acceptsDocsOutOfOrder" to Collector.
        Hide
        Michael McCandless added a comment -

        What will TermQuery do?

        Oh: it's fine to return an in-order scorer, always. It's just that if
        a Query wants to use an out-of-order scorer, it should also implement
        an in-order one. Ie, there'd be a "mating process" to match the
        scorer to the collector.

        That that there might be a Collector out there that requires docs in order is not something I think we should handle. Reason is, there wasn't any guarantee until today that docs are returned in order. So how can somehow write a Collector which has a hard assumption on that? Maybe only if he used a Query which he knows always scores in order, such as TQ, but then I don't think this guy will have a problem since TQ returns true.

        And if that someone needs docs in order, but the query at hand returns docs out of order, then I'd say tough luck ? I mean, maybe with BQ we can ensure in/out of order on request, but if there will be a query which returns docs in random, or based on other criteria which causes it to return out of order, what good will forcing it to return docs in order do? I'd say that you should just use a different query in that case?

        Well... we have to be careful. EG say we had some great optimization
        for iterating over matches to PhraseQuery, but it returned docs out of
        order. In that case, I think we'd preserve the in-order Scorer as
        well?

        But I'm not sure in practice when one would want to use an out-of-order non-top iterator.

        One case might be a random access filter AND'd w/ a BooleanQuery. In
        that case I could ask for a BooleanScorer to return a DISI whose next
        is allowed to return docs out of order, because 1) my filter doesn't
        care and 2) my collector doesn't care.

        Though, we are thinking about pushing random access filters all the
        way down to the TermScorer, so this is example isn't realistic in that
        future... but it still feels like "out of order iteration" and "I'm
        top scorer or not" are orthogonal concepts.

        Show
        Michael McCandless added a comment - What will TermQuery do? Oh: it's fine to return an in-order scorer, always. It's just that if a Query wants to use an out-of-order scorer, it should also implement an in-order one. Ie, there'd be a "mating process" to match the scorer to the collector. That that there might be a Collector out there that requires docs in order is not something I think we should handle. Reason is, there wasn't any guarantee until today that docs are returned in order. So how can somehow write a Collector which has a hard assumption on that? Maybe only if he used a Query which he knows always scores in order, such as TQ, but then I don't think this guy will have a problem since TQ returns true. And if that someone needs docs in order, but the query at hand returns docs out of order, then I'd say tough luck ? I mean, maybe with BQ we can ensure in/out of order on request, but if there will be a query which returns docs in random, or based on other criteria which causes it to return out of order, what good will forcing it to return docs in order do? I'd say that you should just use a different query in that case? Well... we have to be careful. EG say we had some great optimization for iterating over matches to PhraseQuery, but it returned docs out of order. In that case, I think we'd preserve the in-order Scorer as well? But I'm not sure in practice when one would want to use an out-of-order non-top iterator. One case might be a random access filter AND'd w/ a BooleanQuery. In that case I could ask for a BooleanScorer to return a DISI whose next is allowed to return docs out of order, because 1) my filter doesn't care and 2) my collector doesn't care. Though, we are thinking about pushing random access filters all the way down to the TermScorer, so this is example isn't realistic in that future... but it still feels like "out of order iteration" and "I'm top scorer or not" are orthogonal concepts.
        Hide
        Yonik Seeley added a comment -

        Query objects are relatively abstract. Weights are created only with respect to a Searcher, and Scorers are created only from within that context with respect to an IndexReader. It really seems like we should maintain this separation and avoid putting implementation details into the Query object (or the Weight object for that matter).

        A user might want to know what Collector implementation to create before calling search(Query, Collector)

        Having to create a certain type of collector sounds error prone.
        Why not reverse the flow of information and tell the Weight.scorer() method if an out-of-order scorer is acceptable via some flags or a context object. This is also not backward compatible because Weight is an interface, so perhaps this optimization will just have to wait.

        Show
        Yonik Seeley added a comment - Query objects are relatively abstract. Weights are created only with respect to a Searcher, and Scorers are created only from within that context with respect to an IndexReader. It really seems like we should maintain this separation and avoid putting implementation details into the Query object (or the Weight object for that matter). A user might want to know what Collector implementation to create before calling search(Query, Collector) Having to create a certain type of collector sounds error prone. Why not reverse the flow of information and tell the Weight.scorer() method if an out-of-order scorer is acceptable via some flags or a context object. This is also not backward compatible because Weight is an interface, so perhaps this optimization will just have to wait.
        Hide
        Shai Erera added a comment -

        docsInOrder() would be an implementation detail ... should be on the Scorer/DocIdSetIterator rather than the Query or Weight, right?

        There are two problems with that:

        1. IndexSearcher creates the Collector before it obtains a Scorer. Therefore all it has at hand is the Weight. Since Weight is an interface, we can't change it, so I added it to Query with a default of false.
        2. A user might want to know what Collector implementation to create before calling search(Query, Collector), and I don't think we should ask the users to call query.weight.scorer() just to obtain that information.

        So I understand why at the end it's a Scorer attribute, but Scorers really belong to Queries and so this can be viewed also as a Query attribute.

        Show
        Shai Erera added a comment - docsInOrder() would be an implementation detail ... should be on the Scorer/DocIdSetIterator rather than the Query or Weight, right? There are two problems with that: IndexSearcher creates the Collector before it obtains a Scorer. Therefore all it has at hand is the Weight. Since Weight is an interface, we can't change it, so I added it to Query with a default of false. A user might want to know what Collector implementation to create before calling search(Query, Collector), and I don't think we should ask the users to call query.weight.scorer() just to obtain that information. So I understand why at the end it's a Scorer attribute, but Scorers really belong to Queries and so this can be viewed also as a Query attribute.
        Hide
        Yonik Seeley added a comment -

        docsInOrder() would be an implementation detail (and could actually vary per reader or per segment) and should be on the Scorer/DocIdSetIterator rather than the Query or Weight, right?

        Show
        Yonik Seeley added a comment - docsInOrder() would be an implementation detail (and could actually vary per reader or per segment) and should be on the Scorer/DocIdSetIterator rather than the Query or Weight, right?
        Hide
        Shai Erera added a comment -

        BTW, I wonder if instead of "Query.scoresDocsInOrder" we should allow one to ask the Query for either/or?

        I'm afraid this might mean a larger change. What will TermQuery do? Today it returns true, and does not have any implementation that can return docs out-of-order. So what should TQ do when outOfOrderScorer is called? Just return what inOrderScorer returns, or throw an exception?

        That that there might be a Collector out there that requires docs in order is not something I think we should handle. Reason is, there wasn't any guarantee until today that docs are returned in order. So how can somehow write a Collector which has a hard assumption on that? Maybe only if he used a Query which he knows always scores in order, such as TQ, but then I don't think this guy will have a problem since TQ returns true.

        And if that someone needs docs in order, but the query at hand returns docs out of order, then I'd say tough luck ? I mean, maybe with BQ we can ensure in/out of order on request, but if there will be a query which returns docs in random, or based on other criteria which causes it to return out of order, what good will forcing it to return docs in order do? I'd say that you should just use a different query in that case?

        But I'm not sure in practice when one would want to use an out-of-order non-top iterator.

        I agree. I think that iteration on Scorer is dictated to be in order because it extends DISI with next() and skipTo() methods which don't imply in any way they can return something out of order (besides next() maybe, but it will be hard to use such next() with a skipTo()).

        Show
        Shai Erera added a comment - BTW, I wonder if instead of "Query.scoresDocsInOrder" we should allow one to ask the Query for either/or? I'm afraid this might mean a larger change. What will TermQuery do? Today it returns true, and does not have any implementation that can return docs out-of-order. So what should TQ do when outOfOrderScorer is called? Just return what inOrderScorer returns, or throw an exception? That that there might be a Collector out there that requires docs in order is not something I think we should handle. Reason is, there wasn't any guarantee until today that docs are returned in order. So how can somehow write a Collector which has a hard assumption on that? Maybe only if he used a Query which he knows always scores in order, such as TQ, but then I don't think this guy will have a problem since TQ returns true. And if that someone needs docs in order, but the query at hand returns docs out of order, then I'd say tough luck ? I mean, maybe with BQ we can ensure in/out of order on request, but if there will be a query which returns docs in random, or based on other criteria which causes it to return out of order, what good will forcing it to return docs in order do? I'd say that you should just use a different query in that case? But I'm not sure in practice when one would want to use an out-of-order non-top iterator. I agree. I think that iteration on Scorer is dictated to be in order because it extends DISI with next() and skipTo() methods which don't imply in any way they can return something out of order (besides next() maybe, but it will be hard to use such next() with a skipTo()).
        Hide
        Michael McCandless added a comment -

        BTW, I wonder if instead of "Query.scoresDocsInOrder" we should allow one to ask the Query for either/or?

        Ie, a BooleanQuery can produce a scorer that scores docs in order; it's just lower performance.

        Sure, our top doc collectors can accept "in order" or "out of order" collection, but perhaps one has a collector out there that must get the docs in order, so shouldn't we be able to ask the Query to give us docs "always in order" or "doesn't have to be in order"?

        Also: I wonder if we would ever want to allow for non-top-scorer usage that does not return docs in order? Ie, next() would be allowed to yield docs out of order. Obviously this is not allowed today... but we are now mixing "top vs not-top" with "out-of-order vs in-order", where maybe they should be independent? But I'm not sure in practice when one would want to use an out-of-order non-top iterator.

        Show
        Michael McCandless added a comment - BTW, I wonder if instead of "Query.scoresDocsInOrder" we should allow one to ask the Query for either/or? Ie, a BooleanQuery can produce a scorer that scores docs in order; it's just lower performance. Sure, our top doc collectors can accept "in order" or "out of order" collection, but perhaps one has a collector out there that must get the docs in order, so shouldn't we be able to ask the Query to give us docs "always in order" or "doesn't have to be in order"? Also: I wonder if we would ever want to allow for non-top-scorer usage that does not return docs in order? Ie, next() would be allowed to yield docs out of order. Obviously this is not allowed today... but we are now mixing "top vs not-top" with "out-of-order vs in-order", where maybe they should be independent? But I'm not sure in practice when one would want to use an out-of-order non-top iterator.
        Hide
        Michael McCandless added a comment -

        Not in this issue though, right?

        Right: I'm back into the mode of throwing out all future improvements I know of, to help guide us in picking the right next step. These would all be done in separate issues, and many of them would not be done "today" but still we should try not to preclude them for "tomorrow".

        I like the idea of having Scorer be able to tell why a doc was matched. But I think we should make sure that if a user is not interested in this information, then he should not incur any overhead by it, such as aggregating information in-memory or doing any extra computations. Something like we've done for TopFieldCollector with tracking document scores and maxScore.

        Exactly, and I think/hope this'd be achievable.

        Show
        Michael McCandless added a comment - Not in this issue though, right? Right: I'm back into the mode of throwing out all future improvements I know of, to help guide us in picking the right next step. These would all be done in separate issues, and many of them would not be done "today" but still we should try not to preclude them for "tomorrow". I like the idea of having Scorer be able to tell why a doc was matched. But I think we should make sure that if a user is not interested in this information, then he should not incur any overhead by it, such as aggregating information in-memory or doing any extra computations. Something like we've done for TopFieldCollector with tracking document scores and maxScore. Exactly, and I think/hope this'd be achievable.
        Hide
        Shai Erera added a comment -

        Another "things we should improve about the Scorer API".

        Not in this issue though, right?

        I like the idea of having Scorer be able to tell why a doc was matched. But I think we should make sure that if a user is not interested in this information, then he should not incur any overhead by it, such as aggregating information in-memory or doing any extra computations. Something like we've done for TopFieldCollector with tracking document scores and maxScore.

        Show
        Shai Erera added a comment - Another "things we should improve about the Scorer API". Not in this issue though, right? I like the idea of having Scorer be able to tell why a doc was matched. But I think we should make sure that if a user is not interested in this information, then he should not incur any overhead by it, such as aggregating information in-memory or doing any extra computations. Something like we've done for TopFieldCollector with tracking document scores and maxScore.
        Hide
        Shai Erera added a comment -

        Patch includes:

        1. New scoresDocsInOrder to Query
          • Default to false
          • Override in extensions to return true, except in BQ which still returns false until we resolve how BQ is used explicitly (top-score vs. not). In some queries that delegate the work, I used the delegatee or return true if all sub-queries return true.
        2. Changed TopFieldCollector and TopScoreDocCollector to take a docsScoredInOrder parameter and create the appropriate instance (breaking ties by doc Id or not).
        3. Added TestTopScoreDocCollector and a test case to TestSort which test out-of-order collection (they trigger the use of BooleanScorer, though whether document collection happens truly out of order I cannot tell).
        4. Updates to CHANGES

        All tests pass, including test-tag. BTW, the patch also includes the fix to TestSort in tag, but without the fix for MultiSearcher and ParallelMultiSearcher on tag as I'm not sure if we should back-port the fix as well.

        Show
        Shai Erera added a comment - Patch includes: New scoresDocsInOrder to Query Default to false Override in extensions to return true, except in BQ which still returns false until we resolve how BQ is used explicitly (top-score vs. not). In some queries that delegate the work, I used the delegatee or return true if all sub-queries return true. Changed TopFieldCollector and TopScoreDocCollector to take a docsScoredInOrder parameter and create the appropriate instance (breaking ties by doc Id or not). Added TestTopScoreDocCollector and a test case to TestSort which test out-of-order collection (they trigger the use of BooleanScorer, though whether document collection happens truly out of order I cannot tell). Updates to CHANGES All tests pass, including test-tag. BTW, the patch also includes the fix to TestSort in tag, but without the fix for MultiSearcher and ParallelMultiSearcher on tag as I'm not sure if we should back-port the fix as well.
        Hide
        Michael McCandless added a comment -

        Another "things we should improve about the Scorer API":

        Enrich Scorer API to optionally provide more details on positions that
        caused a match to occur.

        This would improve highlighting (LUCENE-1522) since we'd know exactly
        why a match occurred (single source) rather than trying to
        reverse-engineer the match.

        It'd also address a number of requests over time by users on "how can
        I get details on why this doc matched?".

        I think if we did this, the *SpanQuery would be able to share much
        more w/ their "normal" counterparts; this was discussed @
        http://www.nabble.com/Re%3A-Make-TermScorer-non-final-p22577575.html.
        Ie we would have a single TermQuery, just as efficient as the one
        today, but it would expose a "getMatches" (say) that enumerates all
        positions that matched.

        Then, if one wanted these details for every hit on in the topN, we
        could make an IndexReader impl that wraps TermVectors for the docs in
        the topN (since TermVectors are basically a single-doc inverted
        index), run the query on it, and request the match details per doc.

        Show
        Michael McCandless added a comment - Another "things we should improve about the Scorer API": Enrich Scorer API to optionally provide more details on positions that caused a match to occur. This would improve highlighting ( LUCENE-1522 ) since we'd know exactly why a match occurred (single source) rather than trying to reverse-engineer the match. It'd also address a number of requests over time by users on "how can I get details on why this doc matched?". I think if we did this, the *SpanQuery would be able to share much more w/ their "normal" counterparts; this was discussed @ http://www.nabble.com/Re%3A-Make-TermScorer-non-final-p22577575.html . Ie we would have a single TermQuery, just as efficient as the one today, but it would expose a "getMatches" (say) that enumerates all positions that matched. Then, if one wanted these details for every hit on in the topN, we could make an IndexReader impl that wraps TermVectors for the docs in the topN (since TermVectors are basically a single-doc inverted index), run the query on it, and request the match details per doc.
        Hide
        Michael McCandless added a comment -

        I think we should have an issue handling interfaces deprecation in general for 2.9, since just deprecating Weight does not solve it. You'd have to deprecate Searchable.search* methods which accept Weight, but Searchable is an interface, so you might want to deprecate it entirely and create an AbstractSearchable? That I think also deserves its own thread, don't you think?

        Yes, and this presumably depends on the outcome of the first "how much can change in 3.0" thread.

        I thought that perhaps we can make the following change

        Once again I'm lacking clarity.... there are many related possible
        improvements to searching:

        • This "top" vs "not-top" scorer difference being more explicit
        • Merging Query/Filter (LUCENE-1518), allowing Filter as a clause to
          BooleanQuery (LUCENE-1345): it still feels like Query should be a
          subclass of Filter, since Query "simply" adds scoring to a
          Filter.
        • Pushing random-access filters down to the TermScorers, and
          pre-multiplying in deletes when posible (LUCENE-1536)
        • Similarly pushing "bottomValue" down to TermScorers for
          field-sorted searching
        • Have a single query make a "cheap" and "expensive" scorer so that
          all "cheap" scorers are checked first and only if they pass are
          expensive ones checked (LUCENE-1252)
        • The possible "Scorer.check" (LUCENE-1614) to test if a doc passes
          w/o next'ing
        • For AND scoring, picking carefully in what order to test the
          iterators, maybe also choosing when to use "check" instead of
          "advance" for some.
        • "Multiplying out" compound queries. EG +X (A OR B) makes a nested
          BooleanQuery; multiplying it out and then somehow sharing a single
          iterator for X's TermScorer, should give better performance.
          Other "structural" optimizations could apply.
        • Far-out, and not really affecting APIs, but still related: source
          code specialization (LUCENE-1594) to get speedups

        I'm not yet sure what steps to take now (and how) vs later...

        Show
        Michael McCandless added a comment - I think we should have an issue handling interfaces deprecation in general for 2.9, since just deprecating Weight does not solve it. You'd have to deprecate Searchable.search* methods which accept Weight, but Searchable is an interface, so you might want to deprecate it entirely and create an AbstractSearchable? That I think also deserves its own thread, don't you think? Yes, and this presumably depends on the outcome of the first "how much can change in 3.0" thread. I thought that perhaps we can make the following change Once again I'm lacking clarity.... there are many related possible improvements to searching: This "top" vs "not-top" scorer difference being more explicit Merging Query/Filter ( LUCENE-1518 ), allowing Filter as a clause to BooleanQuery ( LUCENE-1345 ): it still feels like Query should be a subclass of Filter, since Query "simply" adds scoring to a Filter. Pushing random-access filters down to the TermScorers, and pre-multiplying in deletes when posible ( LUCENE-1536 ) Similarly pushing "bottomValue" down to TermScorers for field-sorted searching Have a single query make a "cheap" and "expensive" scorer so that all "cheap" scorers are checked first and only if they pass are expensive ones checked ( LUCENE-1252 ) The possible "Scorer.check" ( LUCENE-1614 ) to test if a doc passes w/o next'ing For AND scoring, picking carefully in what order to test the iterators, maybe also choosing when to use "check" instead of "advance" for some. "Multiplying out" compound queries. EG +X (A OR B) makes a nested BooleanQuery; multiplying it out and then somehow sharing a single iterator for X's TermScorer, should give better performance. Other "structural" optimizations could apply. Far-out, and not really affecting APIs, but still related: source code specialization ( LUCENE-1594 ) to get speedups I'm not yet sure what steps to take now (and how) vs later...
        Hide
        Shai Erera added a comment -

        bq, I'm not sure we can make such a change even in 3.0. Ie, all that's "special" about 3.0 is we get to remove deprecated APIs, and begin using Java 1.5 language features.

        I'd like to discuss that in a separate thread, where it will have more visibility ... I'm a bit puzzled by what 3.0 means, but it should be discussed outside the scope of this issue.

        So maybe we can make a new abstract class called AbstractWeight ...

        I think we should have an issue handling interfaces deprecation in general for 2.9, since just deprecating Weight does not solve it. You'd have to deprecate Searchable.search* methods which accept Weight, but Searchable is an interface, so you might want to deprecate it entirely and create an AbstractSearchable? That I think also deserves its own thread, don't you think?

        When I thought about the ambiguity that we have in BS2 between score(Collector) and next()/skipTo() and the proposal to have topScorer() and scorer(), I thought that perhaps we can make the following change (we'd have to solve the Weight-interface problem first):

        1. Define on Weight a score(IndexReader, Collector) API which will be called instead of the topScorer() proposal.
        2. Keep the scorer(IndexReader) API - this should be used for iterating over the Scorer.
        3. Make Scorer.score(Collector) package-private so that it can still be used by Weight.score(IndexReader, Collector), but not by anyone else. That will effectively remove that API from Scorer, but still keep the impl there so we make the least amount of changes to the current Scorers and Weights.
          • We should document that it should not be used, even from inside Lucene's code unless there's a really good reason. Everyone, including Lucene should use the Weight.score(IndexReader, Collector) API.

        That should present a clean and clear API, i.e. topScorer() and scorer() might not be understood well, and we'd need to document their usage clearly, and we don't have a way to enforce that once topScorer() is called, score(Collector) will be the only method that's used and not next()/skipTo().

        Show
        Shai Erera added a comment - bq, I'm not sure we can make such a change even in 3.0. Ie, all that's "special" about 3.0 is we get to remove deprecated APIs, and begin using Java 1.5 language features. I'd like to discuss that in a separate thread, where it will have more visibility ... I'm a bit puzzled by what 3.0 means, but it should be discussed outside the scope of this issue. So maybe we can make a new abstract class called AbstractWeight ... I think we should have an issue handling interfaces deprecation in general for 2.9, since just deprecating Weight does not solve it. You'd have to deprecate Searchable.search* methods which accept Weight, but Searchable is an interface, so you might want to deprecate it entirely and create an AbstractSearchable? That I think also deserves its own thread, don't you think? When I thought about the ambiguity that we have in BS2 between score(Collector) and next()/skipTo() and the proposal to have topScorer() and scorer(), I thought that perhaps we can make the following change (we'd have to solve the Weight-interface problem first): Define on Weight a score(IndexReader, Collector) API which will be called instead of the topScorer() proposal. Keep the scorer(IndexReader) API - this should be used for iterating over the Scorer. Make Scorer.score(Collector) package-private so that it can still be used by Weight.score(IndexReader, Collector), but not by anyone else. That will effectively remove that API from Scorer, but still keep the impl there so we make the least amount of changes to the current Scorers and Weights. We should document that it should not be used, even from inside Lucene's code unless there's a really good reason. Everyone, including Lucene should use the Weight.score(IndexReader, Collector) API. That should present a clean and clear API, i.e. topScorer() and scorer() might not be understood well, and we'd need to document their usage clearly, and we don't have a way to enforce that once topScorer() is called, score(Collector) will be the only method that's used and not next()/skipTo().
        Hide
        Michael McCandless added a comment -

        So I'm now convinced this breaks back-compat.

        Woops, yes it does. Grr.

        The thing is... I'm not sure we can make such a change even in 3.0. Ie, all that's "special" about 3.0 is we get to remove deprecated APIs, and begin using Java 1.5 language features. I'm not sure if a sudden change in runtime behavior ("you must call Scorer.init() before calling next or skipTo") is allowed.

        Maybe we could make a Weight.initializableScorer, that returns a Scorer that requires init() be first called. But since Weight is an interface, we can't change it. So maybe we can make a new abstract class called AbstractWeight (for lack of a better name), implementing Weight. We would deprecate Weight (and remove it at 3.0). We can make a new "get me a Scorer" API in AbstractWeight, eg, require that Scorers returned from there must have "init" called first, pass in an "isTopScorer" boolean, etc. Query would have a "abstractWeight()" method, emulated by wrapping the "weight()" method. Could something crazy like this work....? Maybe we should break out the two goals: this [new] goal is simply to migrate away from Weight as interfaace to AbstractWeight as abstract class, then step 2 is to make the optimizations we are discussing here.

        This is like running in a potato sack race!

        Show
        Michael McCandless added a comment - So I'm now convinced this breaks back-compat. Woops, yes it does. Grr. The thing is... I'm not sure we can make such a change even in 3.0. Ie, all that's "special" about 3.0 is we get to remove deprecated APIs, and begin using Java 1.5 language features. I'm not sure if a sudden change in runtime behavior ("you must call Scorer.init() before calling next or skipTo") is allowed. Maybe we could make a Weight.initializableScorer, that returns a Scorer that requires init() be first called. But since Weight is an interface, we can't change it. So maybe we can make a new abstract class called AbstractWeight (for lack of a better name), implementing Weight. We would deprecate Weight (and remove it at 3.0). We can make a new "get me a Scorer" API in AbstractWeight, eg, require that Scorers returned from there must have "init" called first, pass in an "isTopScorer" boolean, etc. Query would have a "abstractWeight()" method, emulated by wrapping the "weight()" method. Could something crazy like this work....? Maybe we should break out the two goals: this [new] goal is simply to migrate away from Weight as interfaace to AbstractWeight as abstract class, then step 2 is to make the optimizations we are discussing here. This is like running in a potato sack race!
        Hide
        Shai Erera added a comment -

        Ok - so after I posted the last comment I took my dog out and thought about this some more. At first, I thought that I was wrong because BS and BS2 are package-private and so we can still add start() to DISI and take advantage of it in BS and BS2 only, under the assumption that they cannot be instantiated. In 3.0 we'll do the more wider change in DISIs and Scorers.

        But then I realized that someone can do this today:

        BooleanQuery bq = new BooleanQuery();
        
        // add some queries/clauses ...
        
        Scorer s = bq.weight(searcher).scorer(searcher.getIndexReader());
        // s is of type BS2
        while (s.next()) {
          // something ...
        }
        

        So I'm now convinced this breaks back-compat.

        Right?

        Show
        Shai Erera added a comment - Ok - so after I posted the last comment I took my dog out and thought about this some more. At first, I thought that I was wrong because BS and BS2 are package-private and so we can still add start() to DISI and take advantage of it in BS and BS2 only, under the assumption that they cannot be instantiated. In 3.0 we'll do the more wider change in DISIs and Scorers. But then I realized that someone can do this today: BooleanQuery bq = new BooleanQuery(); // add some queries/clauses ... Scorer s = bq.weight(searcher).scorer(searcher.getIndexReader()); // s is of type BS2 while (s.next()) { // something ... } So I'm now convinced this breaks back-compat. Right?
        Hide
        Shai Erera added a comment -

        Hmmm .. I think DISI.start() breaks back-compat, since if we optimize the scorers to not check if they're initializes in next() and skipTo(), then you'll get NPE (or something else will happen). Even if we fix IndexSearcher to call start(), someone may still iterate on a Scorer privately, or in a custom code (I know I do).

        I think this change should go into 3.0 as well, as it's a wider change than I though initially. It affects more than just BS2, but all of its internal classes, as well as some other Scorers. Also, I see in several scorers different TODOs to get rid of that init() check in next() and skipTo(), and so this smells like a wider change.

        Since it breaks back-compat and the change will affect not just BS/BS2, I prefer to leave that optimization out of them for now, and fix it all in 3.0, including the other scorers.

        So we have two issues for 3.0:

        1. Introduce start() in DISI and change all the classes that extend DISI to take advantage of it, as well as all the code that uses DISI to call start().
        2. Introduce topScorer() to Weight, and take advantage of it where it makes sense (currently we know of BW), and change all the code that calls scorer.score(Collector) to request a topScorer() from Weight.

        Since Scorer extends DISI these often look to be the same usage, but I think they are different, with different use cases. What do you think?

        Show
        Shai Erera added a comment - Hmmm .. I think DISI.start() breaks back-compat, since if we optimize the scorers to not check if they're initializes in next() and skipTo(), then you'll get NPE (or something else will happen). Even if we fix IndexSearcher to call start(), someone may still iterate on a Scorer privately, or in a custom code (I know I do). I think this change should go into 3.0 as well, as it's a wider change than I though initially. It affects more than just BS2, but all of its internal classes, as well as some other Scorers. Also, I see in several scorers different TODOs to get rid of that init() check in next() and skipTo(), and so this smells like a wider change. Since it breaks back-compat and the change will affect not just BS/BS2, I prefer to leave that optimization out of them for now, and fix it all in 3.0, including the other scorers. So we have two issues for 3.0: Introduce start() in DISI and change all the classes that extend DISI to take advantage of it, as well as all the code that uses DISI to call start(). Introduce topScorer() to Weight, and take advantage of it where it makes sense (currently we know of BW), and change all the code that calls scorer.score(Collector) to request a topScorer() from Weight. Since Scorer extends DISI these often look to be the same usage, but I think they are different, with different use cases. What do you think?
        Hide
        Michael McCandless added a comment -

        The way I understand it IndexSearcher will call weight.getQuery().scoresDocInOrder() in the search methods that create a Collector, in order to know whether to create an "in-order" Collector or "out-of-order" Collector. At this point it does not know whether it will use the scorer as a top-level or not. Unless we duplicate the logic of doSearch into those methods (i.e. if there is a filter know it'll be used as a top-level Collector), but I really don't like to do that.

        Yeah you're right, it is in two separate places today.

        Though since we are reworking how filters are applied, at that point
        it may very well be in one place.

        Allowing IS as well as any Collector-creating code to create the right Collector instance - in/out-of order. That is achievable by adding scoresDocsInOrder() to Query, defaulting to false (for back-compat) and override in all Query implementations, where it makes sense. For BQ I think it should remain false, with a TODO to change in 3.0 (see second bullet).

        OK let's tentatively move forwards with Query.scoresDocsInOrder.

        Clearly separate between BS and BS2, i.e. have BW create one of them explicitly without wrapping or anything. That is achievable, I think, by adding topScorer() to Weight and call it from IS. Then in BW we do whatever BS2.scorer(Collector) does today, hopefully we can inline it in BW. But that can happen only in 3.0. We then change scoresDocsInOrder to return false only if BQ was set to return docs out of order as well as there are 0 required scorers and < 32 prohibited scorers (the same logic as in BS2.score(Collector).

        OK let's slate this for 3.0, then.

        BTW, #2 above does not mean we cannot optimize initCountingSumScorer - if we add start() to DISI then in BS2 we can override it to initialize CSS, and calling start() from IS.doSearch before it starts iterating. In score(Collector) it will check if it's initialized only once, so it should be ok?

        OK let's move forwards with this too?

        Phew!

        Show
        Michael McCandless added a comment - The way I understand it IndexSearcher will call weight.getQuery().scoresDocInOrder() in the search methods that create a Collector, in order to know whether to create an "in-order" Collector or "out-of-order" Collector. At this point it does not know whether it will use the scorer as a top-level or not. Unless we duplicate the logic of doSearch into those methods (i.e. if there is a filter know it'll be used as a top-level Collector), but I really don't like to do that. Yeah you're right, it is in two separate places today. Though since we are reworking how filters are applied, at that point it may very well be in one place. Allowing IS as well as any Collector-creating code to create the right Collector instance - in/out-of order. That is achievable by adding scoresDocsInOrder() to Query, defaulting to false (for back-compat) and override in all Query implementations, where it makes sense. For BQ I think it should remain false, with a TODO to change in 3.0 (see second bullet). OK let's tentatively move forwards with Query.scoresDocsInOrder. Clearly separate between BS and BS2, i.e. have BW create one of them explicitly without wrapping or anything. That is achievable, I think, by adding topScorer() to Weight and call it from IS. Then in BW we do whatever BS2.scorer(Collector) does today, hopefully we can inline it in BW. But that can happen only in 3.0. We then change scoresDocsInOrder to return false only if BQ was set to return docs out of order as well as there are 0 required scorers and < 32 prohibited scorers (the same logic as in BS2.score(Collector). OK let's slate this for 3.0, then. BTW, #2 above does not mean we cannot optimize initCountingSumScorer - if we add start() to DISI then in BS2 we can override it to initialize CSS, and calling start() from IS.doSearch before it starts iterating. In score(Collector) it will check if it's initialized only once, so it should be ok? OK let's move forwards with this too? Phew!
        Hide
        Shai Erera added a comment -

        But actually: the thing calling scoresDocsInOrder will in fact only be calling that method if it intends to use the scorer as a toplevel scorer

        Are you sure? The way I understand it IndexSearcher will call weight.getQuery().scoresDocInOrder() in the search methods that create a Collector, in order to know whether to create an "in-order" Collector or "out-of-order" Collector. At this point it does not know whether it will use the scorer as a top-level or not. Unless we duplicate the logic of doSearch into those methods (i.e. if there is a filter know it'll be used as a top-level Collector), but I really don't like to do that.

        I still think there are two issues here that need to be addressed separately:

        1. Allowing IS as well as any Collector-creating code to create the right Collector instance - in/out-of order. That is achievable by adding scoresDocsInOrder() to Query, defaulting to false (for back-compat) and override in all Query implementations, where it makes sense. For BQ I think it should remain false, with a TODO to change in 3.0 (see second bullet).
        2. Clearly separate between BS and BS2, i.e. have BW create one of them explicitly without wrapping or anything. That is achievable, I think, by adding topScorer() to Weight and call it from IS. Then in BW we do whatever BS2.scorer(Collector) does today, hopefully we can inline it in BW. But that can happen only in 3.0. We then change scoresDocsInOrder to return false only if BQ was set to return docs out of order as well as there are 0 required scorers and < 32 prohibited scorers (the same logic as in BS2.score(Collector).

        BTW, #2 above does not mean we cannot optimize initCountingSumScorer - if we add start() to DISI then in BS2 we can override it to initialize CSS, and calling start() from IS.doSearch before it starts iterating. In score(Collector) it will check if it's initialized only once, so it should be ok?

        What do you think?

        Show
        Shai Erera added a comment - But actually: the thing calling scoresDocsInOrder will in fact only be calling that method if it intends to use the scorer as a toplevel scorer Are you sure? The way I understand it IndexSearcher will call weight.getQuery().scoresDocInOrder() in the search methods that create a Collector, in order to know whether to create an "in-order" Collector or "out-of-order" Collector. At this point it does not know whether it will use the scorer as a top-level or not. Unless we duplicate the logic of doSearch into those methods (i.e. if there is a filter know it'll be used as a top-level Collector), but I really don't like to do that. I still think there are two issues here that need to be addressed separately: Allowing IS as well as any Collector-creating code to create the right Collector instance - in/out-of order. That is achievable by adding scoresDocsInOrder() to Query, defaulting to false (for back-compat) and override in all Query implementations, where it makes sense. For BQ I think it should remain false, with a TODO to change in 3.0 (see second bullet). Clearly separate between BS and BS2, i.e. have BW create one of them explicitly without wrapping or anything. That is achievable, I think, by adding topScorer() to Weight and call it from IS. Then in BW we do whatever BS2.scorer(Collector) does today, hopefully we can inline it in BW. But that can happen only in 3.0. We then change scoresDocsInOrder to return false only if BQ was set to return docs out of order as well as there are 0 required scorers and < 32 prohibited scorers (the same logic as in BS2.score(Collector). BTW, #2 above does not mean we cannot optimize initCountingSumScorer - if we add start() to DISI then in BS2 we can override it to initialize CSS, and calling start() from IS.doSearch before it starts iterating. In score(Collector) it will check if it's initialized only once, so it should be ok? What do you think?
        Hide
        Michael McCandless added a comment -

        I actually prefer to add a boolean prePopulate to HitQueue's ctor.

        +1

        Show
        Michael McCandless added a comment - I actually prefer to add a boolean prePopulate to HitQueue's ctor. +1
        Hide
        Shai Erera added a comment -

        Good point - can you update HitQueue's javadocs stating this new behavior?

        I actually prefer to add a boolean prePopulate to HitQueue's ctor. Since it's pacakge-private it should be ok and I noticed it is used in several classes like MultiSearcher, ParallelMultiSearcher which rely on its size() value. They just collect the results from the different searchers. I don't think they should be optimized to pre-populate the queue, so instead of changing their code to not rely on size(), I prefer them to init HQ with prePopulate=false.

        If changing the ctor is not accepted, then perhaps add another ctor which accepts that argument?

        Regarding the rest, I need to read them carefully before I'll comment.

        Show
        Shai Erera added a comment - Good point - can you update HitQueue's javadocs stating this new behavior? I actually prefer to add a boolean prePopulate to HitQueue's ctor. Since it's pacakge-private it should be ok and I noticed it is used in several classes like MultiSearcher, ParallelMultiSearcher which rely on its size() value. They just collect the results from the different searchers. I don't think they should be optimized to pre-populate the queue, so instead of changing their code to not rely on size(), I prefer them to init HQ with prePopulate=false. If changing the ctor is not accepted, then perhaps add another ctor which accepts that argument? Regarding the rest, I need to read them carefully before I'll comment.
        Hide
        Michael McCandless added a comment -

        Ok sleeping did help.

        OK...good morning (afternoon)!

        BTW, we should be aware that this means anyone using HitQueue needs to know that upon initialization it's filled with sentinel objects, and that its size() will be maxSize etc. Since HQ is package private I don't have a problem with it.

        Good point – can you update HitQueue's javadocs stating this new
        behavior?

        BTW, IndexSearch.doSearch creates the Scorer, but already receives the Collector as argument, therefore at this point it's too late to make any decisions regarding orderness of docs, no?

        Urgh yeah right. So docsInOrder() should be added to Weight or Query
        I guess. Maybe name it "scoresDocsInOrder()"?

        Add docsInOrder to Weight (it's an interface, therefore just in 3.0)

        Aside: maybe for 3.0 we should do a hard cutover of any remaining
        interfaces, like Weight, Fieldable (if we don't replace it), etc. to
        abstract classes?

        Remember that it may be used by IndexSearcher in two modes: (1) without a filter - BS2.score(Collector), (2) with filter - BS2.next() and skipTo().

        I'd really love to find a way to make this more explicit. EG you ask
        the Weight for a topScorer() vs a scorer(), or something. Clearly the
        caller of .scorer() knows full well how the instance will be used (top
        or not)... we keep struggling with BS/2 because this information is
        not explicit now.

        This would also enable BQ.scorer() to directly return a BS vs BS2,
        rather than the awkward BS2 wrapping a BS internally in its
        score(Collector) method.

        So how about adding a "topScorer()" method, that defaults to scorer()?
        (Ugh, we can't do that until 3.0 since Weight is an interface).

        But actually: the thing calling scoresDocsInOrder will in fact only be
        calling that method if it intends to use the scorer as a toplevel
        scorer (ie, it will call scorer.score(Collector), not
        scorer.next/skipTo)? So BQ.scoresDocsInOrder would first check if
        it's gonna defer to BS (it's a simple OR query) and then check BS's
        static setting.

        In IS we check BQ.getAllowDocsOutOfOrder() and if true we always create out-of-order collectors. That might impact performance if there are no BQ clauses, but I assume it is not used much? And this doesn't break back-compat since that's the only way to instantiate an out-of-order Scorer today (besides creating your own).

        This back compat break worries me a bit. EG maybe Solr has some
        scorers that run out-of-order?

        Also this is not really a clean solution: sure, it's only BQ today
        that does out-of-order scoring, but I can see others doing it in the
        future. I can also see making out-of-order-scoring more common and in
        fact the default (again) for BQ, since it does give good performance
        gains. Maybe other Query scorers should use it too.

        So I think I'd prefer the "add scoresDocsOutOfOrder" to Query. Note
        that this must be called on the rewritten Query.

        And since they can always create the Query and only then create the Collector, if we add that info to Query they should have enough information at hand to create the proper Collector instance.

        Right, adding the method to Query gives expert users the tools needed
        to make their own efficient collectors.

        If we do add it to Query, then I'd like to deprecate BQ's static setter and getter of that attribute and provide a docsInOrder() impl, but we need to resolve how it will know whether it will use BS or BS2.

        OK, you mean make this a non-static setter/getter? I would actually
        prefer to default it to "true", as well. (It's better performance).
        But that should wait for 3.0 (be sure to open a "for 3.0" followon
        issue for this one, too!).

        BTW, even though BS does score docs out of order, it still visits docs
        in order "by bucket". Meaning when visiting docs, you know the docIDs
        are always greater than the last bucket's docIDs.

        This gives us a source of optimization: if we hold onto the "bottom
        value in the pqueue as of the last bucket", and then we can first
        compare that bottom value (with no tie breaking by docID needed) and
        if that competes, then we do the "true, current bottomValue with
        breaking tie by docID" comparison to see if it makes it into the
        queue. But I'm not sure how we'd cleanly model this "docIDs are in
        order by bucket" case of the scorer, and take advantage of that during
        collection. I think it'd require extending the FieldComparator API
        somehow, eg "get me a BottomValueComparator instance as of right
        now".

        This can also be the basis for a separate strong optimization, which
        is down in the TermScorer for a BooleanScorer/2, skip a docID if its
        field value is not competitive. This is a bigger change (way outside
        the scope of this issue, and in fact more related to LUCENE-1536).

        Show
        Michael McCandless added a comment - Ok sleeping did help. OK...good morning (afternoon)! BTW, we should be aware that this means anyone using HitQueue needs to know that upon initialization it's filled with sentinel objects, and that its size() will be maxSize etc. Since HQ is package private I don't have a problem with it. Good point – can you update HitQueue's javadocs stating this new behavior? BTW, IndexSearch.doSearch creates the Scorer, but already receives the Collector as argument, therefore at this point it's too late to make any decisions regarding orderness of docs, no? Urgh yeah right. So docsInOrder() should be added to Weight or Query I guess. Maybe name it "scoresDocsInOrder()"? Add docsInOrder to Weight (it's an interface, therefore just in 3.0) Aside: maybe for 3.0 we should do a hard cutover of any remaining interfaces, like Weight, Fieldable (if we don't replace it), etc. to abstract classes? Remember that it may be used by IndexSearcher in two modes: (1) without a filter - BS2.score(Collector), (2) with filter - BS2.next() and skipTo(). I'd really love to find a way to make this more explicit. EG you ask the Weight for a topScorer() vs a scorer(), or something. Clearly the caller of .scorer() knows full well how the instance will be used (top or not)... we keep struggling with BS/2 because this information is not explicit now. This would also enable BQ.scorer() to directly return a BS vs BS2, rather than the awkward BS2 wrapping a BS internally in its score(Collector) method. So how about adding a "topScorer()" method, that defaults to scorer()? (Ugh, we can't do that until 3.0 since Weight is an interface). But actually: the thing calling scoresDocsInOrder will in fact only be calling that method if it intends to use the scorer as a toplevel scorer (ie, it will call scorer.score(Collector), not scorer.next/skipTo)? So BQ.scoresDocsInOrder would first check if it's gonna defer to BS (it's a simple OR query) and then check BS's static setting. In IS we check BQ.getAllowDocsOutOfOrder() and if true we always create out-of-order collectors. That might impact performance if there are no BQ clauses, but I assume it is not used much? And this doesn't break back-compat since that's the only way to instantiate an out-of-order Scorer today (besides creating your own). This back compat break worries me a bit. EG maybe Solr has some scorers that run out-of-order? Also this is not really a clean solution: sure, it's only BQ today that does out-of-order scoring, but I can see others doing it in the future. I can also see making out-of-order-scoring more common and in fact the default (again) for BQ, since it does give good performance gains. Maybe other Query scorers should use it too. So I think I'd prefer the "add scoresDocsOutOfOrder" to Query. Note that this must be called on the rewritten Query. And since they can always create the Query and only then create the Collector, if we add that info to Query they should have enough information at hand to create the proper Collector instance. Right, adding the method to Query gives expert users the tools needed to make their own efficient collectors. If we do add it to Query, then I'd like to deprecate BQ's static setter and getter of that attribute and provide a docsInOrder() impl, but we need to resolve how it will know whether it will use BS or BS2. OK, you mean make this a non-static setter/getter? I would actually prefer to default it to "true", as well. (It's better performance). But that should wait for 3.0 (be sure to open a "for 3.0" followon issue for this one, too!). BTW, even though BS does score docs out of order, it still visits docs in order "by bucket". Meaning when visiting docs, you know the docIDs are always greater than the last bucket's docIDs. This gives us a source of optimization: if we hold onto the "bottom value in the pqueue as of the last bucket", and then we can first compare that bottom value (with no tie breaking by docID needed) and if that competes, then we do the "true, current bottomValue with breaking tie by docID" comparison to see if it makes it into the queue. But I'm not sure how we'd cleanly model this "docIDs are in order by bucket" case of the scorer, and take advantage of that during collection. I think it'd require extending the FieldComparator API somehow, eg "get me a BottomValueComparator instance as of right now". This can also be the basis for a separate strong optimization, which is down in the TermScorer for a BooleanScorer/2, skip a docID if its field value is not competitive. This is a bigger change (way outside the scope of this issue, and in fact more related to LUCENE-1536 ).
        Hide
        Shai Erera added a comment -

        I think I'd lean towards the 12 impls now. They are tiny classes.

        If we resolve everything else, that should not hold us back. I'll do it.

        We can mull it over some more... sleep on it.

        Ok sleeping did help. Originally I thought that the difference between our thinking is that you think that PQ should know how to construct a sentinel object, while I thought the code which uses PQ should know that. Now I realize both are true - the code which uses PQ, or at least instantiates PQ, already knows how to create those sentinel objects, since it determines which PQ impl to instantiate. I forgot for a moment that PQ is not a concrete class, and anyone using it should create his own specialized PQ, or reuse an existing one, but anyway that specialized PQ should know how to create the sentinel objects and compare them to real objects.

        So I'm ok with it - I'll make the following changes, as you suggest:

        1. Add a protected getSentinelObject() which returns null. Use it in PQ.init() to fill the queue if it's not null.
        2. Make the necessary changes to HitQueue.
        3. Remove the addSentinelObjects from PQ and the code from TSDC.

        BTW, we should be aware that this means anyone using HitQueue needs to know that upon initialization it's filled with sentinel objects, and that its size() will be maxSize etc. Since HQ is package private I don't have a problem with it. Generally speaking, the code which instantiates a PQ and the code that uses it must be in sync ... i.e., if I instantiate a PQ and pass it to some other code which just receives a PQ and adds elements to it, that code should not rely on size() being smaller or anything. I don't feel it complicates things ... and anyway someone can always create a PQ impl which receives a boolean that determines whether sentinel objects should be created or not and if not return null in its getSentinelObject().

        Maybe we should add a "docsInOrder()" method to Scorer?

        I'm not sure that will solve it .. BS2 consults its allowDocsOutOfOrder only if score(Collector) is called, which it then instantiates a BS and delegates the score(Collector) to. So suppose that BS.docsInOrder return false, what will BS2 return? Remember that it may be used by IndexSearcher in two modes: (1) without a filter - BS2.score(Collector), (2) with filter - BS2.next() and skipTo(). So it cannot consult its own allowDocsOutOfOrder (even though it gets it as a parameter) since depending on how it will be used, the answer is different.
        BTW, IndexSearch.doSearch creates the Scorer, but already receives the Collector as argument, therefore at this point it's too late to make any decisions regarding orderness of docs, no?

        There are few issues that we need to solve:

        1. A user can set BooleanQuery.setAllowDocsOutOfOrder today, which may trigger BS2.score(Collector) to instantiate BS, which may screw up the Collector's logic if it assumes in-order documents. IndexSearcher creates the Collector before it knows whether BQ is used or not so it cannot do any intelligent checks. I see two possible solutions, which only 1 of them may be implemented now and the other in 3.0:
          1. Add docsInOrder to Weight (it's an interface, therefore just in 3.0), since that seems to allow IS to check if the current query may visit documents out-of-order.
            • Actually, maybe we add it to Query, which is abstract, and in IS we do weight.getQuery().docsInOrder()?
          2. In IS we check BQ.getAllowDocsOutOfOrder() and if true we always create out-of-order collectors. That might impact performance if there are no BQ clauses, but I assume it is not used much? And this doesn't break back-compat since that's the only way to instantiate an out-of-order Scorer today (besides creating your own).
        2. Someone can create his own Collector, but will have no way to know if the docs will be sent in-order or not. Whatever we do, we have to allow people to correctly 'predict' the behavior of their Collectors, that's why I like the BQ static setting of that variant. The user is the one that sets it to true, so he/she should know that and create their appropriate Collector instance.
          • On the other hand, if we choose to add that information to Query, those Collectors may not have that information in hand when they are instantiated ...

        So I'm torn here. Adding that information to Query will solve it for those that use the convenient search methods (i.e., those that don't receive a Collector), but provide their own Query impl, since if we add a default impl to Query which returns false (i.e., out-of-order), it should not change the behavior for them. And if they always return docs in-order, they can override it to return true.

        About those that pass in Collector ... do we really have a problem? I mean, today that have no choice but to pass in a Collector that expects out-of-order docs, right? We did not make any promises in 2.4.1 regarding documents order? So in the worse case their code will be slightly inefficient by perhaps unnecessarily attempts to insert docs that compare equal to the top of the queue.
        And since they can always create the Query and only then create the Collector, if we add that info to Query they should have enough information at hand to create the proper Collector instance.

        If we do add it to Query, then I'd like to deprecate BQ's static setter and getter of that attribute and provide a docsInOrder() impl, but we need to resolve how it will know whether it will use BS or BS2.

        I apologize for the long post, but I'd like to verify that I didn't miss anything regarding back-compat and possible usage, so that we make the right decision.

        Show
        Shai Erera added a comment - I think I'd lean towards the 12 impls now. They are tiny classes. If we resolve everything else, that should not hold us back. I'll do it. We can mull it over some more... sleep on it. Ok sleeping did help. Originally I thought that the difference between our thinking is that you think that PQ should know how to construct a sentinel object, while I thought the code which uses PQ should know that. Now I realize both are true - the code which uses PQ, or at least instantiates PQ, already knows how to create those sentinel objects, since it determines which PQ impl to instantiate. I forgot for a moment that PQ is not a concrete class, and anyone using it should create his own specialized PQ, or reuse an existing one, but anyway that specialized PQ should know how to create the sentinel objects and compare them to real objects. So I'm ok with it - I'll make the following changes, as you suggest: Add a protected getSentinelObject() which returns null. Use it in PQ.init() to fill the queue if it's not null. Make the necessary changes to HitQueue. Remove the addSentinelObjects from PQ and the code from TSDC. BTW, we should be aware that this means anyone using HitQueue needs to know that upon initialization it's filled with sentinel objects, and that its size() will be maxSize etc. Since HQ is package private I don't have a problem with it. Generally speaking, the code which instantiates a PQ and the code that uses it must be in sync ... i.e., if I instantiate a PQ and pass it to some other code which just receives a PQ and adds elements to it, that code should not rely on size() being smaller or anything. I don't feel it complicates things ... and anyway someone can always create a PQ impl which receives a boolean that determines whether sentinel objects should be created or not and if not return null in its getSentinelObject(). Maybe we should add a "docsInOrder()" method to Scorer? I'm not sure that will solve it .. BS2 consults its allowDocsOutOfOrder only if score(Collector) is called, which it then instantiates a BS and delegates the score(Collector) to. So suppose that BS.docsInOrder return false, what will BS2 return? Remember that it may be used by IndexSearcher in two modes: (1) without a filter - BS2.score(Collector), (2) with filter - BS2.next() and skipTo(). So it cannot consult its own allowDocsOutOfOrder (even though it gets it as a parameter) since depending on how it will be used, the answer is different. BTW, IndexSearch.doSearch creates the Scorer, but already receives the Collector as argument, therefore at this point it's too late to make any decisions regarding orderness of docs, no? There are few issues that we need to solve: A user can set BooleanQuery.setAllowDocsOutOfOrder today, which may trigger BS2.score(Collector) to instantiate BS, which may screw up the Collector's logic if it assumes in-order documents. IndexSearcher creates the Collector before it knows whether BQ is used or not so it cannot do any intelligent checks. I see two possible solutions, which only 1 of them may be implemented now and the other in 3.0: Add docsInOrder to Weight (it's an interface, therefore just in 3.0), since that seems to allow IS to check if the current query may visit documents out-of-order. Actually, maybe we add it to Query, which is abstract, and in IS we do weight.getQuery().docsInOrder()? In IS we check BQ.getAllowDocsOutOfOrder() and if true we always create out-of-order collectors. That might impact performance if there are no BQ clauses, but I assume it is not used much? And this doesn't break back-compat since that's the only way to instantiate an out-of-order Scorer today (besides creating your own). Someone can create his own Collector, but will have no way to know if the docs will be sent in-order or not. Whatever we do, we have to allow people to correctly 'predict' the behavior of their Collectors, that's why I like the BQ static setting of that variant. The user is the one that sets it to true, so he/she should know that and create their appropriate Collector instance. On the other hand, if we choose to add that information to Query, those Collectors may not have that information in hand when they are instantiated ... So I'm torn here. Adding that information to Query will solve it for those that use the convenient search methods (i.e., those that don't receive a Collector), but provide their own Query impl, since if we add a default impl to Query which returns false (i.e., out-of-order), it should not change the behavior for them. And if they always return docs in-order, they can override it to return true. About those that pass in Collector ... do we really have a problem? I mean, today that have no choice but to pass in a Collector that expects out-of-order docs, right? We did not make any promises in 2.4.1 regarding documents order? So in the worse case their code will be slightly inefficient by perhaps unnecessarily attempts to insert docs that compare equal to the top of the queue. And since they can always create the Query and only then create the Collector, if we add that info to Query they should have enough information at hand to create the proper Collector instance. If we do add it to Query, then I'd like to deprecate BQ's static setter and getter of that attribute and provide a docsInOrder() impl, but we need to resolve how it will know whether it will use BS or BS2. I apologize for the long post, but I'd like to verify that I didn't miss anything regarding back-compat and possible usage, so that we make the right decision.
        Hide
        Michael McCandless added a comment -

        the decision on which path to take can only be determined after score(Collector) is called, or next()/skipTo().

        Oh I see: the Scorer cannot know on creation if it's the "top" scorer (score(Collector) will be called), or a secondary one (next()/skipTo(...) will be called).

        Hmm yeah maybe back to DISI.start(). I think as long the actual code that will next()/skipTo(...) through the iterator is the only one that calls start(), the BS/BS2 double-start problem won't happen?

        Really, somehow, it should be explicit when a Scorer will be topmost. IndexSearcher knows this when it creates it.

        Show
        Michael McCandless added a comment - the decision on which path to take can only be determined after score(Collector) is called, or next()/skipTo(). Oh I see: the Scorer cannot know on creation if it's the "top" scorer (score(Collector) will be called), or a secondary one (next()/skipTo(...) will be called). Hmm yeah maybe back to DISI.start(). I think as long the actual code that will next()/skipTo(...) through the iterator is the only one that calls start(), the BS/BS2 double-start problem won't happen? Really, somehow, it should be explicit when a Scorer will be topmost. IndexSearcher knows this when it creates it.
        Hide
        Michael McCandless added a comment -

        Use FMPP?

        I think the first question we need to answer is whether we cutover to
        specialization for this. At this point I don't think we need to, yet
        (I think the 12 classes is tolerable, since they are tiny and
        private).

        The second question is, if we do switch to specialization at some
        point (which I think we should: the performance gains are sizable),
        how should we do the generation (Python, Java, FMPP, XSLT, etc.). I
        think it's a long time before we need to make that decision (many
        iterations remain on LUCENE-1594).

        Show
        Michael McCandless added a comment - Use FMPP? I think the first question we need to answer is whether we cutover to specialization for this. At this point I don't think we need to, yet (I think the 12 classes is tolerable, since they are tiny and private). The second question is, if we do switch to specialization at some point (which I think we should: the performance gains are sizable), how should we do the generation (Python, Java, FMPP, XSLT, etc.). I think it's a long time before we need to make that decision (many iterations remain on LUCENE-1594 ).
        Hide
        Michael McCandless added a comment -

        Some extensions of PQ may not know how to construct such a sentinel object. Consider ComparablePQ, which assumes all items are Comparable. Unlike HitQueue, it does not know what will be the items stored in the queue. But ... I guess it can return a Comparable which always prefers the other element ... so maybe that's not a good example.
        I just have the feeling that a setter method will give us more freedom, rather than having to extend PQ just for that ...

        Such extensions shouldn't use a sentinel?

        Various things spook me about the separate method: one could easily
        pass in bad sentinels, and then the queue is in an invalid state; the
        method can be called at any time (whereas the only time you should do
        this is on init); you could pass in wrong-sized array; the API is
        necessarily public (whereas with getSentinel() it'd be protected).

        We can mull it over some more... sleep on it

        Right ... it's too late for me

        I've been starting to wonder if you are a robot...

        hmm I wonder why this wasn't done so far?

        I don't know! Seems like a simple optimization. So we don't need
        start/end (now at least).

        oIt's just that maintaining that class becomes more and more problematic.

        I completely agree: this is the tradeoff we have to mull. But I like
        that all these classes are private (it hides the fact that there are
        12 concrete impls).

        I think I'd lean towards the 12 impls now. They are tiny classes.

        But I wonder from where would we control it

        Hmm.. yeah good point. The only known place in Lucene's core that
        visits hits out of order is BooleanScorer. But presumably an external
        Query somewhere may provide a Scorer that does things out of order
        (maybe Solr does?), and so technically making the core collectors not
        break ties by docID by default is a break in back-compat.

        Maybe we should add a "docsInOrder()" method to Scorer? By default it
        returns false, but we fix that to return true for all core Lucene
        queries? And then IndexSearcher consults that to decide whether it
        can do this?

        Show
        Michael McCandless added a comment - Some extensions of PQ may not know how to construct such a sentinel object. Consider ComparablePQ, which assumes all items are Comparable. Unlike HitQueue, it does not know what will be the items stored in the queue. But ... I guess it can return a Comparable which always prefers the other element ... so maybe that's not a good example. I just have the feeling that a setter method will give us more freedom, rather than having to extend PQ just for that ... Such extensions shouldn't use a sentinel? Various things spook me about the separate method: one could easily pass in bad sentinels, and then the queue is in an invalid state; the method can be called at any time (whereas the only time you should do this is on init); you could pass in wrong-sized array; the API is necessarily public (whereas with getSentinel() it'd be protected). We can mull it over some more... sleep on it Right ... it's too late for me I've been starting to wonder if you are a robot... hmm I wonder why this wasn't done so far? I don't know! Seems like a simple optimization. So we don't need start/end (now at least). oIt's just that maintaining that class becomes more and more problematic. I completely agree: this is the tradeoff we have to mull. But I like that all these classes are private (it hides the fact that there are 12 concrete impls). I think I'd lean towards the 12 impls now. They are tiny classes. But I wonder from where would we control it Hmm.. yeah good point. The only known place in Lucene's core that visits hits out of order is BooleanScorer. But presumably an external Query somewhere may provide a Scorer that does things out of order (maybe Solr does?), and so technically making the core collectors not break ties by docID by default is a break in back-compat. Maybe we should add a "docsInOrder()" method to Scorer? By default it returns false, but we fix that to return true for all core Lucene queries? And then IndexSearcher consults that to decide whether it can do this?
        Hide
        Shai Erera added a comment -

        EG BooleanWeight.scorer(...) should call BS2's initCountingSumScorer (and/or somehow forward to BS)?

        Ok that "somehow forward to BS" is more problematic than I thought so initially. BS2.score(Collector) determines whether to instantiate a new BS, add the scorers and call bs.score(Collector), or to execute the score itself. On the other hand, it uses the same scorers in next() and skipTo(). Therefore there's kind of a mutual exclusiveness here: either the scorers are used by BS or by BS2. They cannot be used by both, unless we clone() them. If we want to clone them, we need to:

        • Create a BS in init().
        • Clone all the Scorers and pass them to BS.
        • Initialize BS2's countingSumScorer.
        • In score(Collector) use the class member of BS.

        hmm I wonder why this wasn't done so far?

        I think I understand now ... the decision on which path to take can only be determined after score(Collector) is called, or next()/skipTo(). Before that, i.e., when BW returns BS2 it does not know how it will be used, right? The decision is made by IndexSearcher.doSearch depending on whether there's a filter (next()/skipTo() are used) or not (score(Collector)).

        So perhaps we should revert back to having start() on DISI? Since IndexSearcher can call start before iterating over the docs, but not if it uses scorer.score(Collector), which is delegated to the scorer. In that case, we should check whether the countingSumScorer was initialized and if not initialize it outselves.

        Am I missing something?

        Show
        Shai Erera added a comment - EG BooleanWeight.scorer(...) should call BS2's initCountingSumScorer (and/or somehow forward to BS)? Ok that "somehow forward to BS" is more problematic than I thought so initially. BS2.score(Collector) determines whether to instantiate a new BS, add the scorers and call bs.score(Collector), or to execute the score itself. On the other hand, it uses the same scorers in next() and skipTo(). Therefore there's kind of a mutual exclusiveness here: either the scorers are used by BS or by BS2. They cannot be used by both, unless we clone() them. If we want to clone them, we need to: Create a BS in init(). Clone all the Scorers and pass them to BS. Initialize BS2's countingSumScorer. In score(Collector) use the class member of BS. hmm I wonder why this wasn't done so far? I think I understand now ... the decision on which path to take can only be determined after score(Collector) is called, or next()/skipTo(). Before that, i.e., when BW returns BS2 it does not know how it will be used, right? The decision is made by IndexSearcher.doSearch depending on whether there's a filter (next()/skipTo() are used) or not (score(Collector)). So perhaps we should revert back to having start() on DISI? Since IndexSearcher can call start before iterating over the docs, but not if it uses scorer.score(Collector), which is delegated to the scorer. In that case, we should check whether the countingSumScorer was initialized and if not initialize it outselves. Am I missing something?
        Hide
        Earwin Burrfoot added a comment -

        Forgive my ignorance, but what is FMPP?

        Forgive my laziness, http://fmpp.sourceforge.net/ - "What is FMPP? FMPP is a general-purpose text file preprocessor tool that uses FreeMarker templates."

        And to which of the above is it related?

        to this

        It's just that maintaining that class becomes more and more problematic. It already contains 6 inner classes, which duplicate the code to avoid 'if' statements. Meaning every time a bug is found, all 6 need to be checked and fixed. With that proposal, it means 12 ...

        Mike experimented with generated code for specialized search, I see no reasons not to use the same approach for cases where you're already hand-coding N almost-identical classes. You're generating query parser after all
        For official release FMPP is superior to Python as it can be bundled in a crossplatform manner.

        Show
        Earwin Burrfoot added a comment - Forgive my ignorance, but what is FMPP? Forgive my laziness, http://fmpp.sourceforge.net/ - "What is FMPP? FMPP is a general-purpose text file preprocessor tool that uses FreeMarker templates." And to which of the above is it related? to this It's just that maintaining that class becomes more and more problematic. It already contains 6 inner classes, which duplicate the code to avoid 'if' statements. Meaning every time a bug is found, all 6 need to be checked and fixed. With that proposal, it means 12 ... Mike experimented with generated code for specialized search, I see no reasons not to use the same approach for cases where you're already hand-coding N almost-identical classes. You're generating query parser after all For official release FMPP is superior to Python as it can be bundled in a crossplatform manner.
        Hide
        Shai Erera added a comment -

        Forgive my ignorance, but what is FMPP? And to which of the above is it related?

        Show
        Shai Erera added a comment - Forgive my ignorance, but what is FMPP? And to which of the above is it related?
        Hide
        Earwin Burrfoot added a comment -

        Use FMPP? It is pretty nice and integrates well into maven/ant builds. I'm using it for primitive-specialized fieldcaches.

        Show
        Earwin Burrfoot added a comment - Use FMPP? It is pretty nice and integrates well into maven/ant builds. I'm using it for primitive-specialized fieldcaches.
        Hide
        Shai Erera added a comment -

        I wonder if, instead, we could define an Object getSentinelObject(), returning null by default, and the pqueue on creation would call that and if it's non-null, fill the queue (by calling it maxSize times)?

        Some extensions of PQ may not know how to construct such a sentinel object. Consider ComparablePQ, which assumes all items are Comparable. Unlike HitQueue, it does not know what will be the items stored in the queue. But ... I guess it can return a Comparable which always prefers the other element ... so maybe that's not a good example.
        I just have the feeling that a setter method will give us more freedom, rather than having to extend PQ just for that ...

        Actually.... shouldn't Weight.scorer(...) in general be the place where such "pre-next() initializatoin" is done?

        Ok - so BS's add() is only called from BS2.score(Collector). Therefore BS can be initialized from BS2 directly. Since both are package-private, we should be safe.
        BS2's add() is called from BooleanWeight.scorer() (I'm sorry if I repeat what you wrote above, but that's just me learning the code), and so can be initialized from there ... hmm I wonder why this wasn't done so far?

        I'll give it a try.

        That basically undoes the "don't fallback to docID" optimization right?

        Right ... it's too late for me

        I guess since the 6 classes are hidden under the TopFieldCollector.create it's maybe not so bad?

        It's just that maintaining that class becomes more and more problematic. It already contains 6 inner classes, which duplicate the code to avoid 'if' statements. Meaning every time a bug is found, all 6 need to be checked and fixed. With that proposal, it means 12 ...

        But I wonder from where would we control it ... IndexSearcher no longer has a ctor which allows to define whether docs should be collected in order or not (the patch removes it). The only other place where it's defined is in BooleanQuery's static setter (affects all boolean queries). But BooleanWeight receives the Collector, and does not create it ...
        So, if we check in IndexSearcher's search() methods whether this parameter is set or not, we can control the creation of TSDC and TFC. And if anyone else instantiates them on his own, he should know whether he executes searches in-order or not. Back-compat-wise, TFC and TSDC are still in trunk and haven't been released, so we shouldn't have a problem right?

        Does that sound like a good approach? I still hate to duplicate the code in TFC, but I don't think there's any other choice. Maybe create completely separate classes for TFC and TSDC? although that will make code maintenance even harder.

        Show
        Shai Erera added a comment - I wonder if, instead, we could define an Object getSentinelObject(), returning null by default, and the pqueue on creation would call that and if it's non-null, fill the queue (by calling it maxSize times)? Some extensions of PQ may not know how to construct such a sentinel object. Consider ComparablePQ, which assumes all items are Comparable. Unlike HitQueue, it does not know what will be the items stored in the queue. But ... I guess it can return a Comparable which always prefers the other element ... so maybe that's not a good example. I just have the feeling that a setter method will give us more freedom, rather than having to extend PQ just for that ... Actually.... shouldn't Weight.scorer(...) in general be the place where such "pre-next() initializatoin" is done? Ok - so BS's add() is only called from BS2.score(Collector). Therefore BS can be initialized from BS2 directly. Since both are package-private, we should be safe. BS2's add() is called from BooleanWeight.scorer() (I'm sorry if I repeat what you wrote above, but that's just me learning the code), and so can be initialized from there ... hmm I wonder why this wasn't done so far? I'll give it a try. That basically undoes the "don't fallback to docID" optimization right? Right ... it's too late for me I guess since the 6 classes are hidden under the TopFieldCollector.create it's maybe not so bad? It's just that maintaining that class becomes more and more problematic. It already contains 6 inner classes, which duplicate the code to avoid 'if' statements. Meaning every time a bug is found, all 6 need to be checked and fixed. With that proposal, it means 12 ... But I wonder from where would we control it ... IndexSearcher no longer has a ctor which allows to define whether docs should be collected in order or not (the patch removes it). The only other place where it's defined is in BooleanQuery's static setter (affects all boolean queries). But BooleanWeight receives the Collector, and does not create it ... So, if we check in IndexSearcher's search() methods whether this parameter is set or not, we can control the creation of TSDC and TFC. And if anyone else instantiates them on his own, he should know whether he executes searches in-order or not. Back-compat-wise, TFC and TSDC are still in trunk and haven't been released, so we shouldn't have a problem right? Does that sound like a good approach? I still hate to duplicate the code in TFC, but I don't think there's any other choice. Maybe create completely separate classes for TFC and TSDC? although that will make code maintenance even harder.
        Hide
        Michael McCandless added a comment -

        I thought we agreed that initializing to Float.NEG_INF is reasonable for TSDC?

        Woops, sorry, you're right. I'm just losing my mind.

        I think the javadoc for PriorityQueue.addSentinelObjects should state
        that the Objects must all be logically "equal"? Ie we do a straight
        copy into the pqueue, so if they are not equal then the pqueue is in a
        messed up state.

        Actually that method is somewhat awkward. I wonder if, instead, we
        could define an Object getSentinelObject(), returning null by default,
        and the pqueue on creation would call that and if it's non-null, fill
        the queue (by calling it maxSize times)?

        Could be useful - but then we should probably do it on DocIdSetIterator with default impl, and override where it makes sense (BS and BS2)? Also, if we do this, why not adding an end() too, allowing a DISI to release resources?

        Actually.... shouldn't Weight.scorer(...) in general be the place
        where such "pre-next() initializatoin" is done? EG
        BooleanWeight.scorer(...) should call BS2's initCountingSumScorer
        (and/or somehow forward to BS)?

        Yes I kept BS and BS2 in mind ... if we condiionalize anything, it means extra 'if'. If we want to avoid that 'if', we need to create a variant of the class, which might not be so bad in TSDC, but will look awful in TFC (additional 6 classes).

        Yeah that's (the * 2 splintering) is what I was fearing. At some
        point we should leave this splintering to source code
        specialization...it's getting somewhat crazy now.

        Perhaps we should still attempt to add to PQ if cmp == 0?

        That basically undoes the "don't fallback to docID" optimization
        right?

        Or did you have something else in mind?

        The 6 new classes is what I feared we'd need to do. Else, with the
        changes here (that never break ties by docID), TopFieldCollector can't
        be used with BooleanScorer (which breaks back compat).

        I guess since the 6 classes are hidden under the
        TopFieldCollector.create it's maybe not so bad?

        Show
        Michael McCandless added a comment - I thought we agreed that initializing to Float.NEG_INF is reasonable for TSDC? Woops, sorry, you're right. I'm just losing my mind. I think the javadoc for PriorityQueue.addSentinelObjects should state that the Objects must all be logically "equal"? Ie we do a straight copy into the pqueue, so if they are not equal then the pqueue is in a messed up state. Actually that method is somewhat awkward. I wonder if, instead, we could define an Object getSentinelObject(), returning null by default, and the pqueue on creation would call that and if it's non-null, fill the queue (by calling it maxSize times)? Could be useful - but then we should probably do it on DocIdSetIterator with default impl, and override where it makes sense (BS and BS2)? Also, if we do this, why not adding an end() too, allowing a DISI to release resources? Actually.... shouldn't Weight.scorer(...) in general be the place where such "pre-next() initializatoin" is done? EG BooleanWeight.scorer(...) should call BS2's initCountingSumScorer (and/or somehow forward to BS)? Yes I kept BS and BS2 in mind ... if we condiionalize anything, it means extra 'if'. If we want to avoid that 'if', we need to create a variant of the class, which might not be so bad in TSDC, but will look awful in TFC (additional 6 classes). Yeah that's (the * 2 splintering) is what I was fearing. At some point we should leave this splintering to source code specialization...it's getting somewhat crazy now. Perhaps we should still attempt to add to PQ if cmp == 0? That basically undoes the "don't fallback to docID" optimization right? Or did you have something else in mind? The 6 new classes is what I feared we'd need to do. Else, with the changes here (that never break ties by docID), TopFieldCollector can't be used with BooleanScorer (which breaks back compat). I guess since the 6 classes are hidden under the TopFieldCollector.create it's maybe not so bad?
        Hide
        Shai Erera added a comment -

        he patch still has various logic to handle the sentinel values

        Are you talking about TSDC? I thought we agreed that initializing to Float.NEG_INF is reasonable for TSDC? If not, then I can remove it from there as well as the changes done to PQ.

        Maybe we should add a "start()" method to Scorer

        Could be useful - but then we should probably do it on DocIdSetIterator with default impl, and override where it makes sense (BS and BS2)? Also, if we do this, why not adding an end() too, allowing a DISI to release resources?
        And if we document that calling next() and skipTo() without start() before that may result in an unspecified behavior, it will resemble somewhat to TermPositions, where you have to call next() before anything else.

        However, this should be done with caution. BS2 calls initCountingSumScorer in two places: (1) next() and skipTo() and (2) score(Collector). Now, in the latter, it first checks if allowDocsOutOfOrder and if so initializes BS, with adding the Scorers that were added in add(). However those Scorers must not be initalized prior to creating BS, since they will be advanced.
        So now it gets tricky - upon call to start(), what should BS2 do? Check allowDocsOutOfOrder to determine if to initialize or not? And what if it is true but score(Collector) will not be called, and instead next() and skipTo()?
        We should also protect against calling start() more than once, and in Scorers that aggregate several scorers, we should make sure their start() is called after all Scorers were added ... gets a bit complicated. What do you think?

        Also, I fear we need to conditionalize the "don't need to break ties by docID", because BooleanScorer doesn't visit docs in order?

        Yes I kept BS and BS2 in mind ... if we condiionalize anything, it means extra 'if'. If we want to avoid that 'if', we need to create a variant of the class, which might not be so bad in TSDC, but will look awful in TFC (additional 6 classes).
        Perhaps we should still attempt to add to PQ if cmp == 0?
        Or did you have something else in mind?

        Show
        Shai Erera added a comment - he patch still has various logic to handle the sentinel values Are you talking about TSDC? I thought we agreed that initializing to Float.NEG_INF is reasonable for TSDC? If not, then I can remove it from there as well as the changes done to PQ. Maybe we should add a "start()" method to Scorer Could be useful - but then we should probably do it on DocIdSetIterator with default impl, and override where it makes sense (BS and BS2)? Also, if we do this, why not adding an end() too, allowing a DISI to release resources? And if we document that calling next() and skipTo() without start() before that may result in an unspecified behavior, it will resemble somewhat to TermPositions, where you have to call next() before anything else. However, this should be done with caution. BS2 calls initCountingSumScorer in two places: (1) next() and skipTo() and (2) score(Collector). Now, in the latter, it first checks if allowDocsOutOfOrder and if so initializes BS, with adding the Scorers that were added in add(). However those Scorers must not be initalized prior to creating BS, since they will be advanced. So now it gets tricky - upon call to start(), what should BS2 do? Check allowDocsOutOfOrder to determine if to initialize or not? And what if it is true but score(Collector) will not be called, and instead next() and skipTo()? We should also protect against calling start() more than once, and in Scorers that aggregate several scorers, we should make sure their start() is called after all Scorers were added ... gets a bit complicated. What do you think? Also, I fear we need to conditionalize the "don't need to break ties by docID", because BooleanScorer doesn't visit docs in order? Yes I kept BS and BS2 in mind ... if we condiionalize anything, it means extra 'if'. If we want to avoid that 'if', we need to create a variant of the class, which might not be so bad in TSDC, but will look awful in TFC (additional 6 classes). Perhaps we should still attempt to add to PQ if cmp == 0? Or did you have something else in mind?
        Hide
        Michael McCandless added a comment -

        The patch still has various logic to handle the sentinel values, but we backed away from that optimization (it's not generally safe)?

        Also, I fear we need to conditionalize the "don't need to break ties by docID", because BooleanScorer doesn't visit docs in order?

        I chose to discard that optimization, which only affects next() and skipTo().

        Maybe we should add a "start()" method to Scorer, to handle initializations like this, so that next() doesn't have to check every time?

        Show
        Michael McCandless added a comment - The patch still has various logic to handle the sentinel values, but we backed away from that optimization (it's not generally safe)? Also, I fear we need to conditionalize the "don't need to break ties by docID", because BooleanScorer doesn't visit docs in order? I chose to discard that optimization, which only affects next() and skipTo(). Maybe we should add a "start()" method to Scorer, to handle initializations like this, so that next() doesn't have to check every time?
        Hide
        Shai Erera added a comment -

        The patch implements all that has been suggested except:

        • pre-populating the queue in TopFieldCollector - as was noted here previously, this seems to remove the 'if (queueFull)' check but add another 'if' in FieldComparator (which may be executed several times per collect().
        • Move initCountingSumScorer() to BS2's ctor and add(). That's because if more than one Scorer is added we create a DisjunctionSumScorer, which initializes its queue by calling next() on the passed-in Scorer. Therefore if we call initCountingSumScorer for every Scorer added, we advance that Scorer as well as all the previous ones. I chose to discard that optimization, which only affects next() and skipTo().

        The patch also includes the fix for TestSort in the 2.4 back_compat branch. I only fixed TestSort, and not MultiSearcher and ParallelMultiSearcher.

        All tests pass.

        I also ran some performance measurements (all on SRV 2003):

        JRE sort best time (trunk) best time (patch) diff (%)
        SUN 1.6 int 1017.59 1015.96 ~1%
        SUN 1.6 doc 767.49 763.20 ~1%
        IBM 1.5 int 1018.77 1017.39 ~1%
        IBM 1.5 doc 768.10 764.14 ~1%

        As you can see, there is a slight performance improvement, but nothing too dramatic.

        You are welcome to review the patch as well as run the PerfTest I attached. It accepts two arguments: <indexDir> and [sort]. 'sort' is optional and if not defined it sorts by doc. Otherwise, whatever you pass there, it sorts by int.

        Show
        Shai Erera added a comment - The patch implements all that has been suggested except: pre-populating the queue in TopFieldCollector - as was noted here previously, this seems to remove the 'if (queueFull)' check but add another 'if' in FieldComparator (which may be executed several times per collect(). Move initCountingSumScorer() to BS2's ctor and add(). That's because if more than one Scorer is added we create a DisjunctionSumScorer, which initializes its queue by calling next() on the passed-in Scorer. Therefore if we call initCountingSumScorer for every Scorer added, we advance that Scorer as well as all the previous ones. I chose to discard that optimization, which only affects next() and skipTo(). The patch also includes the fix for TestSort in the 2.4 back_compat branch. I only fixed TestSort, and not MultiSearcher and ParallelMultiSearcher. All tests pass. I also ran some performance measurements (all on SRV 2003): JRE sort best time (trunk) best time (patch) diff (%) SUN 1.6 int 1017.59 1015.96 ~1% SUN 1.6 doc 767.49 763.20 ~1% IBM 1.5 int 1018.77 1017.39 ~1% IBM 1.5 doc 768.10 764.14 ~1% As you can see, there is a slight performance improvement, but nothing too dramatic. You are welcome to review the patch as well as run the PerfTest I attached. It accepts two arguments: <indexDir> and [sort] . 'sort' is optional and if not defined it sorts by doc. Otherwise, whatever you pass there, it sorts by int.
        Hide
        Shai Erera added a comment -

        Nothing wrong with that. Was just trying to think of a more automated approach, that will remove some responsibilities from the committers.
        It's not that important. So if there's no easy way to do it, I'm fine if we leave it as is.

        Show
        Shai Erera added a comment - Nothing wrong with that. Was just trying to think of a more automated approach, that will remove some responsibilities from the committers. It's not that important. So if there's no easy way to do it, I'm fine if we leave it as is.
        Hide
        Michael McCandless added a comment -

        So if we work against the branch and when you execute "ant test-tag" we somehow detect you don't have the latest revision and download it, it'll work ok? Maybe add a checksum file, or a revision file to the branch and have the test download it and compare? Or, when the test extracts the branch, it will create a file with the revision information, and when you'll run the test it will compare the branch's revision to your file?
        I'm sure there's some way to solve it, but unfortunately I'm not overly familiar with Ant's capabilities, so I appologize if I'm not proposing anything too intelligent .

        Well, if we did branch only (no tag), we can't auto-update your checkout of the branch, unless we also auto-update your main checkout (sometimes we change internal APIs). I don't think we should auto update your main checkout when you run "ant test-tag".

        Why not just keep using the current tag approach? It works correctly... and it's not so bad to retag whenever we have to change back compat tests.

        Show
        Michael McCandless added a comment - So if we work against the branch and when you execute "ant test-tag" we somehow detect you don't have the latest revision and download it, it'll work ok? Maybe add a checksum file, or a revision file to the branch and have the test download it and compare? Or, when the test extracts the branch, it will create a file with the revision information, and when you'll run the test it will compare the branch's revision to your file? I'm sure there's some way to solve it, but unfortunately I'm not overly familiar with Ant's capabilities, so I appologize if I'm not proposing anything too intelligent . Well, if we did branch only (no tag), we can't auto-update your checkout of the branch, unless we also auto-update your main checkout (sometimes we change internal APIs). I don't think we should auto update your main checkout when you run "ant test-tag". Why not just keep using the current tag approach? It works correctly... and it's not so bad to retag whenever we have to change back compat tests.
        Hide
        Shai Erera added a comment -

        Ok, I understand that now. So if you develop something, and your common-build.xml is set against tag X, and then we update it to tag X+1, then the next time you'll update from SVN and run "ant test-tag", it will detect the tag is missing and download it.

        So if we work against the branch and when you execute "ant test-tag" we somehow detect you don't have the latest revision and download it, it'll work ok? Maybe add a checksum file, or a revision file to the branch and have the test download it and compare? Or, when the test extracts the branch, it will create a file with the revision information, and when you'll run the test it will compare the branch's revision to your file?
        I'm sure there's some way to solve it, but unfortunately I'm not overly familiar with Ant's capabilities, so I appologize if I'm not proposing anything too intelligent .

        Show
        Shai Erera added a comment - Ok, I understand that now. So if you develop something, and your common-build.xml is set against tag X, and then we update it to tag X+1, then the next time you'll update from SVN and run "ant test-tag", it will detect the tag is missing and download it. So if we work against the branch and when you execute "ant test-tag" we somehow detect you don't have the latest revision and download it, it'll work ok? Maybe add a checksum file, or a revision file to the branch and have the test download it and compare? Or, when the test extracts the branch, it will create a file with the revision information, and when you'll run the test it will compare the branch's revision to your file? I'm sure there's some way to solve it, but unfortunately I'm not overly familiar with Ant's capabilities, so I appologize if I'm not proposing anything too intelligent .
        Hide
        Michael McCandless added a comment -

        How should I do it? Just attach a separate patch for tag?

        That would be great! Hopefully I can succeed in committing it, retagging, etc., w/o breaking anything, this time.

        why can't we test the branch by default in test-tag, which means the latest tag?

        This will cause people's checkouts to suddenly (unexpectedly) start failing "ant test-tag". EG if I have a checkout from 2 weeks ago, where I'm developing something, then suddenly someone commits a fix to the back compat branch, if my checkout simply uses the branch, I'll suddenly start seeing the back compat tests [incorrectly] fail.

        Show
        Michael McCandless added a comment - How should I do it? Just attach a separate patch for tag? That would be great! Hopefully I can succeed in committing it, retagging, etc., w/o breaking anything, this time. why can't we test the branch by default in test-tag, which means the latest tag? This will cause people's checkouts to suddenly (unexpectedly) start failing "ant test-tag". EG if I have a checkout from 2 weeks ago, where I'm developing something, then suddenly someone commits a fix to the back compat branch, if my checkout simply uses the branch, I'll suddenly start seeing the back compat tests [incorrectly] fail.
        Hide
        Shai Erera added a comment -

        as long as that doesn't then percolate up to requiring adding throws IOException to a public API

        Like I said, I did the change and it was local to BS2. The rest of Lucene's code which uses it already has a throws IOE declaration. Good - then I'll proceed with that change.

        Show
        Shai Erera added a comment - as long as that doesn't then percolate up to requiring adding throws IOException to a public API Like I said, I did the change and it was local to BS2. The rest of Lucene's code which uses it already has a throws IOE declaration. Good - then I'll proceed with that change.
        Hide
        Shai Erera added a comment -

        I'm ready to attach the patch, still need to run some perf measurements with it, but that's orthogonal. I ran "ant clean test" and all tests passed, except for TestSort in tag. The reason is because the test is invalid (asserts that a bug exists), and so we'll need to change TestSort in tag in order for it to pass, when this issue is committed (and the bug in MultiSearcher is fixed).

        How should I do it? Just attach a separate patch for tag? This also warrants a change in common-build.xml, to upgrade to the new tag, which I don't know how it will be named yet. Which brings me back to a question I asked recently on LUCENE-1529 - why can't we test the branch by default in test-tag, which means the latest tag? We already today test against the latest tag, only we need to define it explicitly, which leaves room for errors in case we forget to tag it, or change common-build.xml accordingly. Instead, we can allow one to specify a tag name (as -D option) and if specified, we test against that tag name, or otherwise we test against the branch. Wouldn't that:

        • Still allow people to test trunk against a specific tag?
        • Be more robust in terms that trunk's test-tag always executes against the latest branch, and we won't be required to modify common-build.xml every time we change the tag?

        Perhaps the latter can even let us to change 2.4's branch a couple of times during a release, but create a single tag that's compliant with that release in its end? E.g., we've already created 1 tag because of changes to the trunk (future 2.9) and now we'll need another change. But from what I understand, When 2.9 comes out, we only need one 2.4 tag which matches the changes done against 2.9?

        Anyway, that's a side discussion. For this issue, we need to update TestSort in 2.4 branch regardless. Whatever we decide about branch/tag policy will only affect whether I'll need to update common-build.xml before/after the tag is created.

        Show
        Shai Erera added a comment - I'm ready to attach the patch, still need to run some perf measurements with it, but that's orthogonal. I ran "ant clean test" and all tests passed, except for TestSort in tag. The reason is because the test is invalid (asserts that a bug exists), and so we'll need to change TestSort in tag in order for it to pass, when this issue is committed (and the bug in MultiSearcher is fixed). How should I do it? Just attach a separate patch for tag? This also warrants a change in common-build.xml, to upgrade to the new tag, which I don't know how it will be named yet. Which brings me back to a question I asked recently on LUCENE-1529 - why can't we test the branch by default in test-tag, which means the latest tag? We already today test against the latest tag, only we need to define it explicitly, which leaves room for errors in case we forget to tag it, or change common-build.xml accordingly. Instead, we can allow one to specify a tag name (as -D option) and if specified, we test against that tag name, or otherwise we test against the branch. Wouldn't that: Still allow people to test trunk against a specific tag? Be more robust in terms that trunk's test-tag always executes against the latest branch, and we won't be required to modify common-build.xml every time we change the tag? Perhaps the latter can even let us to change 2.4's branch a couple of times during a release, but create a single tag that's compliant with that release in its end? E.g., we've already created 1 tag because of changes to the trunk (future 2.9) and now we'll need another change. But from what I understand, When 2.9 comes out, we only need one 2.4 tag which matches the changes done against 2.9? Anyway, that's a side discussion. For this issue, we need to update TestSort in 2.4 branch regardless. Whatever we decide about branch/tag policy will only affect whether I'll need to update common-build.xml before/after the tag is created.
        Hide
        Michael McCandless added a comment -

        BooleanScorer2 is package private, so I think it's fine to add "throws IOException" to the ctor & add, as long as that doesn't then percolate up to requiring adding throws IOException to a public API (does it? Seems likely not to).

        Show
        Michael McCandless added a comment - BooleanScorer2 is package private, so I think it's fine to add "throws IOException" to the ctor & add, as long as that doesn't then percolate up to requiring adding throws IOException to a public API (does it? Seems likely not to).
        Hide
        Shai Erera added a comment -

        Same for BS2.score() and score(HC) - initCountingSumScorer?

        I proposed this change, but it is problematic. initCountingSumScorer declares throwing IOE, but neither any of the ctors or add() declares it. So I cannot add a call to initCountingSumScorer without breaking back-compat, unless:

        • I wrap IOE with RE, but I don't like it very much.
        • We close our eyes saying it's not very likely that BS2 is used on its own, outside BooleanQuery (or BooleanWeight for that matters). I did a short test and added IOE, nothing breaks (at least on the 'java' side, haven't checked contrib), and I'm pretty confident other code will not break as well, since the rest of the methods do throw IOE (like score(), next(), skipTo()), and why would you want to initialize BS2 if not to call these?

        What do you think about the 2nd option?

        Show
        Shai Erera added a comment - Same for BS2.score() and score(HC) - initCountingSumScorer? I proposed this change, but it is problematic. initCountingSumScorer declares throwing IOE, but neither any of the ctors or add() declares it. So I cannot add a call to initCountingSumScorer without breaking back-compat, unless: I wrap IOE with RE, but I don't like it very much. We close our eyes saying it's not very likely that BS2 is used on its own, outside BooleanQuery (or BooleanWeight for that matters). I did a short test and added IOE, nothing breaks (at least on the 'java' side, haven't checked contrib), and I'm pretty confident other code will not break as well, since the rest of the methods do throw IOE (like score(), next(), skipTo()), and why would you want to initialize BS2 if not to call these? What do you think about the 2nd option?
        Hide
        Michael McCandless added a comment -

        If it equals, it's also rejected, since now that we move to returning docs in order, it is assumed that this doc's doc Id is greater than whatever is in the queue, and so it's rejected.

        Argh, you are right... so the approach will fail if any of the first topN (eg 10) hits have a field value equal to the sentinel value. I guess we could do two separate passes: a "startup" pass (while queue is filling up) and a "the rest" pass that knows the queue is full. But that's getting rather ugly; probably we should leave this optimization to source code specialization.

        I can do the same for the FieldDoc.fields() value, in case one of the fields is FIELD_DOC.

        Excellent! Let's just do this as part of this issue.

        I must say I still didn't fully understand what do you mean here.

        I also did not understand how BooleanScorer works until Doug explained it (see the comment at the top).

        Right now it gathers hits of each clause in a reversed linked list, which it then makes a 2nd pass to collect. So the Collector will see docIDs in reverse order for that clause. I thought we could simply fix the linking to be forward and we'd have docIDs in order. But that isn't quite right because any new docIDs hit by the 2nd clause will be inserted at the end of the linked list, out of order.

        Show
        Michael McCandless added a comment - If it equals, it's also rejected, since now that we move to returning docs in order, it is assumed that this doc's doc Id is greater than whatever is in the queue, and so it's rejected. Argh, you are right... so the approach will fail if any of the first topN (eg 10) hits have a field value equal to the sentinel value. I guess we could do two separate passes: a "startup" pass (while queue is filling up) and a "the rest" pass that knows the queue is full. But that's getting rather ugly; probably we should leave this optimization to source code specialization. I can do the same for the FieldDoc.fields() value, in case one of the fields is FIELD_DOC. Excellent! Let's just do this as part of this issue. I must say I still didn't fully understand what do you mean here. I also did not understand how BooleanScorer works until Doug explained it (see the comment at the top). Right now it gathers hits of each clause in a reversed linked list, which it then makes a 2nd pass to collect. So the Collector will see docIDs in reverse order for that clause. I thought we could simply fix the linking to be forward and we'd have docIDs in order. But that isn't quite right because any new docIDs hit by the 2nd clause will be inserted at the end of the linked list, out of order.
        Hide
        Shai Erera added a comment -

        I think it should work fine, for most types, because we'd set docID (the tie breaker) to Integer.MAX_VALUE. No special additional if is then required, since that entry would always compare at the bottom?

        That's not so simple. Let's say I initialize the sentinel of IntComp to Integer.MAX_VALUE. That should have guaranteed that any 'int' < MAX_VAL would compare better. But the code in TFC compares the current doc against the 'bottom'. For all Sentinels, it means MAX_VAL. If the input doc's val < MAX_VAL, it compares better. Otherwise, it is rejected, because:

        1. If it is bigger than the bottom, it should be rejected.
        2. If it equals, it's also rejected, since now that we move to returning docs in order, it is assumed that this doc's doc Id is greater than whatever is in the queue, and so it's rejected.
          Actually, the tie is broken only after it's in queue, when the latter calls compare(). This assumption removed the 'if' that checked for doc Id value, so if I reinstate it, we don't really gain anything, right (replacing 'if (queueFull)' with 'if (bottom.doc > doc + docBase)')?

        For String we should be able to use U+FFFF.

        If we resolve the core issues with sentinel values, than this will be the value I'd use for Strings, right.

        I think we could fix this by allowing one to pass in a docbase when searching?

        I actually would like to propose the following: MultiSearcher already fixes the FD.doc before inserting it to the queue. I can do the same for the FieldDoc.fields() value, in case one of the fields is FIELD_DOC. This can happen just after searcher.search() returns, and before MS adds the results to its own FieldDocSortedHitQueue. I already did it, and all the testMultiSearch cases fail, but that's because they just assert that the bug exists .
        If you think a separate issue is still required, I can do it, but that would mean that the tests will fail until we fix it, or I don't change Sort in this issue and do it as part of the other one.

        let's fix it to be a deterministic test?

        Will do, but it depends - if a new issue is required, I'll do it there.

        I was wrong about this

        I must say I still didn't fully understand what do you mean here. I intended to keep that to after everything else will work in that issue's scope, and note if there are any tests that fail, or BQ actually behaves properly. So I'll simply count on what you say is true , since I'm not familiar with that code.

        Show
        Shai Erera added a comment - I think it should work fine, for most types, because we'd set docID (the tie breaker) to Integer.MAX_VALUE. No special additional if is then required, since that entry would always compare at the bottom? That's not so simple. Let's say I initialize the sentinel of IntComp to Integer.MAX_VALUE. That should have guaranteed that any 'int' < MAX_VAL would compare better. But the code in TFC compares the current doc against the 'bottom'. For all Sentinels, it means MAX_VAL. If the input doc's val < MAX_VAL, it compares better. Otherwise, it is rejected, because: If it is bigger than the bottom, it should be rejected. If it equals, it's also rejected, since now that we move to returning docs in order, it is assumed that this doc's doc Id is greater than whatever is in the queue, and so it's rejected. Actually, the tie is broken only after it's in queue, when the latter calls compare(). This assumption removed the 'if' that checked for doc Id value, so if I reinstate it, we don't really gain anything, right (replacing 'if (queueFull)' with 'if (bottom.doc > doc + docBase)')? For String we should be able to use U+FFFF. If we resolve the core issues with sentinel values, than this will be the value I'd use for Strings, right. I think we could fix this by allowing one to pass in a docbase when searching? I actually would like to propose the following: MultiSearcher already fixes the FD.doc before inserting it to the queue. I can do the same for the FieldDoc.fields() value, in case one of the fields is FIELD_DOC. This can happen just after searcher.search() returns, and before MS adds the results to its own FieldDocSortedHitQueue. I already did it, and all the testMultiSearch cases fail, but that's because they just assert that the bug exists . If you think a separate issue is still required, I can do it, but that would mean that the tests will fail until we fix it, or I don't change Sort in this issue and do it as part of the other one. let's fix it to be a deterministic test? Will do, but it depends - if a new issue is required, I'll do it there. I was wrong about this I must say I still didn't fully understand what do you mean here. I intended to keep that to after everything else will work in that issue's scope, and note if there are any tests that fail, or BQ actually behaves properly. So I'll simply count on what you say is true , since I'm not familiar with that code.
        Hide
        Michael McCandless added a comment -

        Actually, I think BooleanScorer need not process docs out of order? The only "out of order ness" seems to come from how it appends each new Bucket to the head of the linked list; if instead it appended to the tail, the collector would see docs arrive in order. I think?

        I was wrong about this – you also get "out of order ness" due to
        2nd, 3rd, etc. clauses in the boolean query adding in new docs.
        Re-ordering those will be costly.

        But: in the initial collection of each chunk, we do know that any
        docID in the queue will be less than those being visited, so this
        could be a good future optimization for immediately ruling out docs
        during TermScorer. That's a larger change... eg it'd require being
        able to say "I don't need exact total hit count for this search".

        Show
        Michael McCandless added a comment - Actually, I think BooleanScorer need not process docs out of order? The only "out of order ness" seems to come from how it appends each new Bucket to the head of the linked list; if instead it appended to the tail, the collector would see docs arrive in order. I think? I was wrong about this – you also get "out of order ness" due to 2nd, 3rd, etc. clauses in the boolean query adding in new docs. Re-ordering those will be costly. But: in the initial collection of each chunk, we do know that any docID in the queue will be less than those being visited, so this could be a good future optimization for immediately ruling out docs during TermScorer. That's a larger change... eg it'd require being able to say "I don't need exact total hit count for this search".
        Hide
        Michael McCandless added a comment -

        I tried to move to pre-populate the queue in TFC, but that proved to be impossible (unless I missed something)

        I think it should work fine, for most types, because we'd set docID
        (the tie breaker) to Integer.MAX_VALUE. No special additional if is
        then required, since that entry would always compare at the bottom?

        For String we should be able to use U+FFFF.

        So I debugged it and either the test works fine but there's a bug in MultiSearcher, or the test does not work fine (or should be adjusted) but then we'll be changing runtime behavior of Sort (e.g., everyone who used setSort might get undesired behavior, only in MultiSearcher).

        Hmm – good catch! I think this is in fact a bug in MultiSearcher
        (AEB is the right answer): it's failing to "break ties" (by docID)
        properly. Ie, it will not "match" what
        IndexSearcher(MultiSegmentReader(...)) will do.

        I think we could fix this by allowing one to pass in a docbase when
        searching? Though that's a more involved change... could you open a
        new issue for this one?

        I think that's wrong in the first place, we should always have predicted tests output, since we're not involving any randomness in that code.

        I agree – let's fix it to be a deterministic test?

        Show
        Michael McCandless added a comment - I tried to move to pre-populate the queue in TFC, but that proved to be impossible (unless I missed something) I think it should work fine, for most types, because we'd set docID (the tie breaker) to Integer.MAX_VALUE. No special additional if is then required, since that entry would always compare at the bottom? For String we should be able to use U+FFFF. So I debugged it and either the test works fine but there's a bug in MultiSearcher, or the test does not work fine (or should be adjusted) but then we'll be changing runtime behavior of Sort (e.g., everyone who used setSort might get undesired behavior, only in MultiSearcher). Hmm – good catch! I think this is in fact a bug in MultiSearcher (AEB is the right answer): it's failing to "break ties" (by docID) properly. Ie, it will not "match" what IndexSearcher(MultiSegmentReader(...)) will do. I think we could fix this by allowing one to pass in a docbase when searching? Though that's a more involved change... could you open a new issue for this one? I think that's wrong in the first place, we should always have predicted tests output, since we're not involving any randomness in that code. I agree – let's fix it to be a deterministic test?
        Hide
        Shai Erera added a comment -

        Few updates (it's been long since I posted on this issue):

        • I tried to move to pre-populate the queue in TFC, but that proved to be impossible (unless I missed something). The only reason to do it was to get rid of the 'if (queueFull)' check in every collect. However, it turned out that pre-populating the queue for TFC with sentinel values is unreliable. Reason is, if I want to get rid of that 'if', I don't want to add any 'if' to FieldComparator.compare, so the sentinel values should be something like MIN_VALUE or MAX_VALUE (depends on the value of 'reverse'). However, someone can set a field value to either of these values (and there are tests that do), so I need to check if FC if the current value is a sentinel, which adds that 'if' back, only worse - it's now executed in every compare() call. Unless I missed something, I don't think that's possible, or at least worth the effort (to get rid of one 'if').
          • BTW, in TSDC I use Float.NEG_INF as a sentinel value. This might be broken if a Scorer decided to return that value, in which case pre-populating the queue will not work either. I think it is still safe in TSDC, but want to get your opinion.
        • I changed Sort.setSort() to not add SortField.FIELD_DOC, as suggested by Mike. But then TestSort.testMultiSort failed. So I debugged it and either the test works fine but there's a bug in MultiSearcher, or the test does not work fine (or should be adjusted) but then we'll be changing runtime behavior of Sort (e.g., everyone who used setSort might get undesired behavior, only in MultiSearcher).
          • MultiSearcher's search(Weight, Filter, int, Sort) method executes a search against all its Searchables, sequentially. The test adds documents to two indexes, odd and even, so that the odd index is added two documents (A and E) with the value "5" in docs 0 and 2, and the even index is added one doc (B) with value "5" in index 0. When I use the 2 SortField version (2nd sorts by doc), the output is ABE, since the compare by doc uses the searcher's doc Ids (0, 0, 2) and B is always less than E and so preferred, even though its 'fixed' doc Id is 7 (it appears in the 2nd Searcher). When I use the 1 SortField version, the result is AEB, since B's fixed doc Id is 7, and now the code breaks a tie on their true doc Id.

        I hope you were able to follow my description. That's why I don't know which one is the true output ABE or AEB. I tend to say AEB, since those are the true doc Ids the application will get in the end. Few more comments:

        1. TestSort.runMultiSort contains 3 versions of this test:
          • One that sorts by the INT value only, relying on the doc Id tie breaking. You can see the output was not determined to be exact and so the pattern included is [ABE] {3}

            , as if the test's output is not predicted. I think that's wrong in the first place, we should always have predicted tests output, since we're not involving any randomness in that code.

          • The second explicitly sets INT followed by DOC Sorts, and expects [AB] {2}E.
            #* The third relies on setSort's adding DOC, so it expects the same [AB]{2}

            E.

        2. The problem in MultiSearcher is that it uses FieldDocSortedHitQueue, but doesn't update the DOC's field value the same as it does to the scoreDoc.doc value (adding the current Searcher's start).

        Again, whatever the right output is, changing Sort to not include SortField.FIELD_DOC might result in someone experiencing a change in behavior (if he used MultiSearcher), even if that behavior is buggy.

        What do you think?

        Show
        Shai Erera added a comment - Few updates (it's been long since I posted on this issue): I tried to move to pre-populate the queue in TFC, but that proved to be impossible (unless I missed something). The only reason to do it was to get rid of the 'if (queueFull)' check in every collect. However, it turned out that pre-populating the queue for TFC with sentinel values is unreliable. Reason is, if I want to get rid of that 'if', I don't want to add any 'if' to FieldComparator.compare, so the sentinel values should be something like MIN_VALUE or MAX_VALUE (depends on the value of 'reverse'). However, someone can set a field value to either of these values (and there are tests that do), so I need to check if FC if the current value is a sentinel, which adds that 'if' back, only worse - it's now executed in every compare() call. Unless I missed something, I don't think that's possible, or at least worth the effort (to get rid of one 'if'). BTW, in TSDC I use Float.NEG_INF as a sentinel value. This might be broken if a Scorer decided to return that value, in which case pre-populating the queue will not work either. I think it is still safe in TSDC, but want to get your opinion. I changed Sort.setSort() to not add SortField.FIELD_DOC, as suggested by Mike. But then TestSort.testMultiSort failed. So I debugged it and either the test works fine but there's a bug in MultiSearcher, or the test does not work fine (or should be adjusted) but then we'll be changing runtime behavior of Sort (e.g., everyone who used setSort might get undesired behavior, only in MultiSearcher). MultiSearcher's search(Weight, Filter, int, Sort) method executes a search against all its Searchables, sequentially. The test adds documents to two indexes, odd and even, so that the odd index is added two documents (A and E) with the value "5" in docs 0 and 2, and the even index is added one doc (B) with value "5" in index 0. When I use the 2 SortField version (2nd sorts by doc), the output is ABE, since the compare by doc uses the searcher's doc Ids (0, 0, 2) and B is always less than E and so preferred, even though its 'fixed' doc Id is 7 (it appears in the 2nd Searcher). When I use the 1 SortField version, the result is AEB, since B's fixed doc Id is 7, and now the code breaks a tie on their true doc Id. I hope you were able to follow my description. That's why I don't know which one is the true output ABE or AEB. I tend to say AEB, since those are the true doc Ids the application will get in the end. Few more comments: TestSort.runMultiSort contains 3 versions of this test: One that sorts by the INT value only, relying on the doc Id tie breaking. You can see the output was not determined to be exact and so the pattern included is [ABE] {3} , as if the test's output is not predicted. I think that's wrong in the first place, we should always have predicted tests output, since we're not involving any randomness in that code. The second explicitly sets INT followed by DOC Sorts, and expects [AB] {2}E. #* The third relies on setSort's adding DOC, so it expects the same [AB] {2} E. The problem in MultiSearcher is that it uses FieldDocSortedHitQueue, but doesn't update the DOC's field value the same as it does to the scoreDoc.doc value (adding the current Searcher's start). Again, whatever the right output is, changing Sort to not include SortField.FIELD_DOC might result in someone experiencing a change in behavior (if he used MultiSearcher), even if that behavior is buggy. What do you think?
        Hide
        Michael McCandless added a comment -

        if so, can we agree on the new names (add, updateTop)?

        I think it makes sense to add these, returning the min value (and deprecate the old ones).

        Show
        Michael McCandless added a comment - if so, can we agree on the new names (add, updateTop)? I think it makes sense to add these, returning the min value (and deprecate the old ones).
        Hide
        Earwin Burrfoot added a comment -

        Lucene failed drop in replacement between 2.3 and 2.4, so I don't see any reason not to do it, this time deliberately, for 3.0.

        Show
        Earwin Burrfoot added a comment - Lucene failed drop in replacement between 2.3 and 2.4, so I don't see any reason not to do it, this time deliberately, for 3.0.
        Hide
        Shai Erera added a comment -

        I believe if a user upgrades to release XX.9 and removes all code that is using deprecated methods/classes, it needs to be a jar drop in for 3.0.

        This might work, but 3.0 is also about moving to Java 5 with all the implications. If my app is already on Java 5, then a jar drop is all that'll be required. But if not, I need to update my app anyway. In addition, there are some changes in runtime behavior that are going to be made in 3.0. My point is - I don't know who will actually upgrade to 3.0 by just dropping a jar.

        But anyway, I'm not going to argue with policies - you seem to know better than me about Lucene's back-compat requirements. So the question is whether we want to deprecate these methods and add the new ones, and if so, can we agree on the new names (add, updateTop)?

        Show
        Shai Erera added a comment - I believe if a user upgrades to release XX.9 and removes all code that is using deprecated methods/classes, it needs to be a jar drop in for 3.0. This might work, but 3.0 is also about moving to Java 5 with all the implications. If my app is already on Java 5, then a jar drop is all that'll be required. But if not, I need to update my app anyway. In addition, there are some changes in runtime behavior that are going to be made in 3.0. My point is - I don't know who will actually upgrade to 3.0 by just dropping a jar. But anyway, I'm not going to argue with policies - you seem to know better than me about Lucene's back-compat requirements. So the question is whether we want to deprecate these methods and add the new ones, and if so, can we agree on the new names (add, updateTop)?
        Hide
        Mark Miller added a comment -

        I believe if a user upgrades to release XX.9 and removes all code that is using deprecated methods/classes, it needs to be a jar drop in for 3.0.

        Here are the back compat requirements in detail: http://wiki.apache.org/jakarta-lucene/BackwardsCompatibility

        Show
        Mark Miller added a comment - I believe if a user upgrades to release XX.9 and removes all code that is using deprecated methods/classes, it needs to be a jar drop in for 3.0. Here are the back compat requirements in detail: http://wiki.apache.org/jakarta-lucene/BackwardsCompatibility
        Hide
        Shai Erera added a comment -

        I thought that when we move to 3.0 we're not expected to support jar drop-in-ability? Those two methods are already declared final, so no one could override them. The only problem now is with jar drop-in-ability.

        If I'm wrong, and we are suppose to support it, then I can offer new names for these two and deprecate them:

        • toppablePut(), toppableAdjustTop() - least favored of mine, but I write them here anyway.
        • putWithOverflow, adjustTopWithOverflow - the 'withOverflow' is consistent with insert, but unlike insert, there's no overflow here.
        • offer/add (for put) and updateTop.

        You can suggest yours too. However I do think that if jar drop-in-ability is not required in 3.0, we should not change the names and just document the impl changes in 3.0.

        Show
        Shai Erera added a comment - I thought that when we move to 3.0 we're not expected to support jar drop-in-ability? Those two methods are already declared final, so no one could override them. The only problem now is with jar drop-in-ability. If I'm wrong, and we are suppose to support it, then I can offer new names for these two and deprecate them: toppablePut(), toppableAdjustTop() - least favored of mine, but I write them here anyway. putWithOverflow, adjustTopWithOverflow - the 'withOverflow' is consistent with insert, but unlike insert, there's no overflow here. offer/add (for put) and updateTop. You can suggest yours too. However I do think that if jar drop-in-ability is not required in 3.0, we should not change the names and just document the impl changes in 3.0.
        Hide
        Mark Miller added a comment -

        In general, 3.0 gives us no extra in regards to back compat freedom. All it gives is the opportunity to remove all the deprecated stuff. So generally, if you can't do it through deprecation, you can't do it. Except that exceptions for certain complicated/beneficial issues have been made I think (ie Field has slightly loosened requirements) with proper notification and minimal impact. That probably comes down to the size of the jolt to the community.

        Show
        Mark Miller added a comment - In general, 3.0 gives us no extra in regards to back compat freedom. All it gives is the opportunity to remove all the deprecated stuff. So generally, if you can't do it through deprecation, you can't do it. Except that exceptions for certain complicated/beneficial issues have been made I think (ie Field has slightly loosened requirements) with proper notification and minimal impact. That probably comes down to the size of the jolt to the community.
        Hide
        Shai Erera added a comment -

        Ok, so with the latest changes to LUCENE-1529, I ran "ant clean clean-tags test-tags" and I see this:

            [junit] Testcase: testPQ(org.apache.lucene.util.TestPriorityQueue):	Caused an ERROR
            [junit] org/apache/lucene/util/PriorityQueue.put(Ljava/lang/Object;)V
            [junit] java.lang.NoSuchMethodError: org/apache/lucene/util/PriorityQueue.put(Ljava/lang/Object;)V
            [junit] 	at org.apache.lucene.util.TestPriorityQueue.testPQ(TestPriorityQueue.java:61)
            [junit] 	at org.apache.lucene.util.TestPriorityQueue.testPQ(TestPriorityQueue.java:48)
            [junit] Testcase: testClear(org.apache.lucene.util.TestPriorityQueue):	Caused an ERROR
            [junit] org/apache/lucene/util/PriorityQueue.put(Ljava/lang/Object;)V
            [junit] java.lang.NoSuchMethodError: org/apache/lucene/util/PriorityQueue.put(Ljava/lang/Object;)V
            [junit] 	at org.apache.lucene.util.TestPriorityQueue.testClear(TestPriorityQueue.java:90)
            [junit] Test org.apache.lucene.util.TestPriorityQueue FAILED
        

        TestPQ fails as Mike expected.

        Since this breaks back-compat, can we still do this in 3.0? If so, I'll add it to LUCENE-1601 even though it's not a direct followup of 1575. What do you think?

        Show
        Shai Erera added a comment - Ok, so with the latest changes to LUCENE-1529 , I ran "ant clean clean-tags test-tags" and I see this: [junit] Testcase: testPQ(org.apache.lucene.util.TestPriorityQueue): Caused an ERROR [junit] org/apache/lucene/util/PriorityQueue.put(Ljava/lang/ Object ;)V [junit] java.lang.NoSuchMethodError: org/apache/lucene/util/PriorityQueue.put(Ljava/lang/ Object ;)V [junit] at org.apache.lucene.util.TestPriorityQueue.testPQ(TestPriorityQueue.java:61) [junit] at org.apache.lucene.util.TestPriorityQueue.testPQ(TestPriorityQueue.java:48) [junit] Testcase: testClear(org.apache.lucene.util.TestPriorityQueue): Caused an ERROR [junit] org/apache/lucene/util/PriorityQueue.put(Ljava/lang/ Object ;)V [junit] java.lang.NoSuchMethodError: org/apache/lucene/util/PriorityQueue.put(Ljava/lang/ Object ;)V [junit] at org.apache.lucene.util.TestPriorityQueue.testClear(TestPriorityQueue.java:90) [junit] Test org.apache.lucene.util.TestPriorityQueue FAILED TestPQ fails as Mike expected. Since this breaks back-compat, can we still do this in 3.0? If so, I'll add it to LUCENE-1601 even though it's not a direct followup of 1575. What do you think?
        Hide
        Michael McCandless added a comment -

        I thought what it was supposed to do is compile the 2.4 tests against the 2.4 JAR, then run the compiled 2.4 build/classes/test/* files against a 2.9 jar; this was the reason LUCENE-1529 was opened (to explicitly test "jar drop-in-ability").

        Show
        Michael McCandless added a comment - I thought what it was supposed to do is compile the 2.4 tests against the 2.4 JAR, then run the compiled 2.4 build/classes/test/* files against a 2.9 jar; this was the reason LUCENE-1529 was opened (to explicitly test "jar drop-in-ability").
        Hide
        Shai Erera added a comment -

        According to my understanding, test-tag runs the 2.4 tests against the 2.9 jar. Does it compile the code against the 2.4 sources or against the 2.9 jar? If the latter, then I think that explains it - the compiler sees the updated PQ and therefore there's no problem. If it compiles against the 2.4 source, then I think we should see the problem, which is in my understanding the reason why we see it before calling "clean".

        Show
        Shai Erera added a comment - According to my understanding, test-tag runs the 2.4 tests against the 2.9 jar. Does it compile the code against the 2.4 sources or against the 2.9 jar? If the latter, then I think that explains it - the compiler sees the updated PQ and therefore there's no problem. If it compiles against the 2.4 source, then I think we should see the problem, which is in my understanding the reason why we see it before calling "clean".
        Hide
        Michael McCandless added a comment -

        So I "ant clean" and then "ant test-core" - all tests passed

        Hmmm something weird is going on.

        I ran this test:

        • start "ant clean test"
        • STOP that job after it's done compiling everything
        • go change PriorityQueue.put to return int instead of void
        • ant jar-core
        • restart the "ant test" job

        and I hit errors like this:

            [junit] java.lang.NoSuchMethodError: org.apache.lucene.search.PhraseQueue.put(Ljava/lang/Object;)V
            [junit] 	at org.apache.lucene.search.PhraseScorer.sort(PhraseScorer.java:139)
            [junit] 	at org.apache.lucene.search.PhraseScorer.init(PhraseScorer.java:133)
            [junit] 	at org.apache.lucene.search.PhraseScorer.next(PhraseScorer.java:76)
            [junit] 	at org.apache.lucene.search.Scorer.score(Scorer.java:67)
            [junit] 	at org.apache.lucene.search.IndexSearcher.doSearch(IndexSearcher.java:269)
            [junit] 	at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:257)
            [junit] 	at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:181)
            [junit] 	at org.apache.lucene.search.Searcher.search(Searcher.java:181)
            [junit] 	at org.apache.lucene.TestSearch.doTestSearch(TestSearch.java:124)
            [junit] 	at org.apache.lucene.TestSearch.testSearch(TestSearch.java:58)
            [junit] 	at org.apache.lucene.util.LuceneTestCase.runTest(LuceneTestCase.java:88)
        

        (which is what I expected).

        Yet, when I run "ant clean test-tag" with that same change, I don't see a failure... I fear "ant test-tag" is somehow not doing the right thing. I'll go reopen that issue...

        Show
        Michael McCandless added a comment - So I "ant clean" and then "ant test-core" - all tests passed Hmmm something weird is going on. I ran this test: start "ant clean test" STOP that job after it's done compiling everything go change PriorityQueue.put to return int instead of void ant jar-core restart the "ant test" job and I hit errors like this: [junit] java.lang.NoSuchMethodError: org.apache.lucene.search.PhraseQueue.put(Ljava/lang/ Object ;)V [junit] at org.apache.lucene.search.PhraseScorer.sort(PhraseScorer.java:139) [junit] at org.apache.lucene.search.PhraseScorer.init(PhraseScorer.java:133) [junit] at org.apache.lucene.search.PhraseScorer.next(PhraseScorer.java:76) [junit] at org.apache.lucene.search.Scorer.score(Scorer.java:67) [junit] at org.apache.lucene.search.IndexSearcher.doSearch(IndexSearcher.java:269) [junit] at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:257) [junit] at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:181) [junit] at org.apache.lucene.search.Searcher.search(Searcher.java:181) [junit] at org.apache.lucene.TestSearch.doTestSearch(TestSearch.java:124) [junit] at org.apache.lucene.TestSearch.testSearch(TestSearch.java:58) [junit] at org.apache.lucene.util.LuceneTestCase.runTest(LuceneTestCase.java:88) (which is what I expected). Yet, when I run "ant clean test-tag" with that same change, I don't see a failure... I fear "ant test-tag" is somehow not doing the right thing. I'll go reopen that issue...
        Hide
        Shai Erera added a comment -

        Wait, I think I was wrong before. I tried ant test-core and noticed many errors such as NoSuchMethod etc. so I figured I should clean first. So I "ant clean" and then "ant test-core" - all tests passed. I then "ant clean-tags" and "ant test-tags" and all tests passed too. So I guess this change is ok and does not break back-compat?

        Show
        Shai Erera added a comment - Wait, I think I was wrong before. I tried ant test-core and noticed many errors such as NoSuchMethod etc. so I figured I should clean first. So I "ant clean" and then "ant test-core" - all tests passed. I then "ant clean-tags" and "ant test-tags" and all tests passed too. So I guess this change is ok and does not break back-compat?
        Hide
        Shai Erera added a comment -

        what about JAR drop-in-ability?

        Darn it, you're right. It changes the method's signature and test-tag fails. I'll revert that change. Perhaps we can do that in 3.0 then? Is it acceptable to document this API change in CHANGES and still proceed with it in 2.9?

        Show
        Shai Erera added a comment - what about JAR drop-in-ability? Darn it, you're right. It changes the method's signature and test-tag fails. I'll revert that change. Perhaps we can do that in 3.0 then? Is it acceptable to document this API change in CHANGES and still proceed with it in 2.9?
        Hide
        Michael McCandless added a comment -

        I would also like to change pq.adjustTop() and pq.put() to return Object, which is heap[1].

        Sounds good...

        Both are marked final, so there shouldn't be any back-compat here, and since we're just adding a return value (rather than removing), everyone can ignore it.

        Hmm what about JAR drop-in-ability? The method signature will have changed. ("ant test-tag" now test JAR drop-in-ability so you could try it).

        Show
        Michael McCandless added a comment - I would also like to change pq.adjustTop() and pq.put() to return Object, which is heap [1] . Sounds good... Both are marked final, so there shouldn't be any back-compat here, and since we're just adding a return value (rather than removing), everyone can ignore it. Hmm what about JAR drop-in-ability? The method signature will have changed. ("ant test-tag" now test JAR drop-in-ability so you could try it).
        Hide
        Shai Erera added a comment -

        I would also like to change pq.adjustTop() and pq.put() to return Object, which is heap[1]. This saves calls to top() immediately following these two. Any objection? Both are marked final, so there shouldn't be any back-compat here, and since we're just adding a return value (rather than removing), everyone can ignore it. I'll also fix TSDC and TFC to not call top() after that.
        Saves 1 more method call.

        Show
        Shai Erera added a comment - I would also like to change pq.adjustTop() and pq.put() to return Object, which is heap [1] . This saves calls to top() immediately following these two. Any objection? Both are marked final, so there shouldn't be any back-compat here, and since we're just adding a return value (rather than removing), everyone can ignore it. I'll also fix TSDC and TFC to not call top() after that. Saves 1 more method call.
        Hide
        Michael McCandless added a comment -

        I thought these were introduced as part of LUCENE-1483 for performance reasons. Are we sure that removing these, just for the sake of what looks like minor optimizations to TSDC and TFC (not breaking a tie on doc Id) is worth it?

        Right, this was added in LUCENE-1483, and I think we should remove it, ie IndexSearcher will always visit docIDs in order again.

        The optimization was done for the StringValCollector: that collector "in general" has better performance if you give it the biggest segments first, since the queue should converge faster and require less fallback to by-value comparisons. But I believe that is a minor optimization in general since segments tend to be biggest to smallest, already, except for indexes with many deletes on the older segments.

        My feeling is we can proceed without testing it; I believe this "not breaking tie by docID" will be a bigger win.

        Show
        Michael McCandless added a comment - I thought these were introduced as part of LUCENE-1483 for performance reasons. Are we sure that removing these, just for the sake of what looks like minor optimizations to TSDC and TFC (not breaking a tie on doc Id) is worth it? Right, this was added in LUCENE-1483 , and I think we should remove it, ie IndexSearcher will always visit docIDs in order again. The optimization was done for the StringValCollector: that collector "in general" has better performance if you give it the biggest segments first, since the queue should converge faster and require less fallback to by-value comparisons. But I believe that is a minor optimization in general since segments tend to be biggest to smallest, already, except for indexes with many deletes on the older segments. My feeling is we can proceed without testing it; I believe this "not breaking tie by docID" will be a bigger win.
        Hide
        Shai Erera added a comment -

        I noticed that IndexSearcher has a ctor which allows to define whether documents are intended to be collected in-order of out-of-order. I assume we have to remove that ctor and also sortSubReaders? If we change TSDC and TFC to always assume docs come in order, this cannot be configurable.

        I thought these were introduced as part of LUCENE-1483 for performance reasons. Are we sure that removing these, just for the sake of what looks like minor optimizations to TSDC and TFC (not breaking a tie on doc Id) is worth it?

        Show
        Shai Erera added a comment - I noticed that IndexSearcher has a ctor which allows to define whether documents are intended to be collected in-order of out-of-order. I assume we have to remove that ctor and also sortSubReaders? If we change TSDC and TFC to always assume docs come in order, this cannot be configurable. I thought these were introduced as part of LUCENE-1483 for performance reasons. Are we sure that removing these, just for the sake of what looks like minor optimizations to TSDC and TFC (not breaking a tie on doc Id) is worth it?
        Hide
        Michael McCandless added a comment -

        Those look like great optimizations!

        I think we can indeed fix BS to send docs in order to the collector, and that allows us to not fallback to docID on hitting ties, even when using BS.

        Show
        Michael McCandless added a comment - Those look like great optimizations! I think we can indeed fix BS to send docs in order to the collector, and that allows us to not fallback to docID on hitting ties, even when using BS.
        Hide
        Shai Erera added a comment -

        This sounds like it should work, but since I'm not fully familiar with that code, I can't back you up . I guess a test case will clarify it. In the meantime, I read BooleanScorer and BooleanScorer2 code, and came across several possible optimizations:

        • BS.score() and score(HC) check if coordFactors is null on every call (this actually is problematic for score() only). I think we can init coordFactors in the ctor, as well as after every call to add(Scorer)? That is not called during query execution, but score() is.
        • Same for BS2.score() and score(HC) - initCountingSumScorer?
        • Cleanup BS.add() code a bit. For example, it checks 'if (prohibited)', does something and then 'if (!prohibited)'. Maybe merge all the required/prohibited/both cases together?
        • BS2 declares 'defaultSimilarity' and instantiates it to new DefaultSimilarity(). Two things here: (1) can the field be defined final (2) use Similarity.getDefault()?
        • BS2.SingleMatchScorer's score() method looks a bit suspicious. It checks if doc() >= lastScoredDoc and if so updates lastScoredDoc and increments coordinator.nrMatchers. It then calls scorer.score() regardless. So it looks as if this method is expected to be called for the same doc several times. When 1575 is committed, I think we should wrap the input scorer with the new ScoreCachingWrappingScorer so that the actual score will not be computed over and over? Also, I think doc() needs to be saved in a local variable.
        Show
        Shai Erera added a comment - This sounds like it should work, but since I'm not fully familiar with that code, I can't back you up . I guess a test case will clarify it. In the meantime, I read BooleanScorer and BooleanScorer2 code, and came across several possible optimizations: BS.score() and score(HC) check if coordFactors is null on every call (this actually is problematic for score() only). I think we can init coordFactors in the ctor, as well as after every call to add(Scorer)? That is not called during query execution, but score() is. Same for BS2.score() and score(HC) - initCountingSumScorer? Cleanup BS.add() code a bit. For example, it checks 'if (prohibited)', does something and then 'if (!prohibited)'. Maybe merge all the required/prohibited/both cases together? BS2 declares 'defaultSimilarity' and instantiates it to new DefaultSimilarity(). Two things here: (1) can the field be defined final (2) use Similarity.getDefault()? BS2.SingleMatchScorer's score() method looks a bit suspicious. It checks if doc() >= lastScoredDoc and if so updates lastScoredDoc and increments coordinator.nrMatchers. It then calls scorer.score() regardless. So it looks as if this method is expected to be called for the same doc several times. When 1575 is committed, I think we should wrap the input scorer with the new ScoreCachingWrappingScorer so that the actual score will not be computed over and over? Also, I think doc() needs to be saved in a local variable.
        Hide
        Michael McCandless added a comment -

        Actually, I think BooleanScorer need not process docs out of order? The only "out of order ness" seems to come from how it appends each new Bucket to the head of the linked list; if instead it appended to the tail, the collector would see docs arrive in order. I think?

        Show
        Michael McCandless added a comment - Actually, I think BooleanScorer need not process docs out of order? The only "out of order ness" seems to come from how it appends each new Bucket to the head of the linked list; if instead it appended to the tail, the collector would see docs arrive in order. I think?
        Hide
        Michael McCandless added a comment -

        Change TSDC and TFC's code to not use the doc id as a tie breaker.

        We have to remember that out-of-order docID collectors (BooleanScorer) cannot apply this optimization. When we get to this, lets make sure we have a test case that does out-of-order collection, with a field search, that shows failure if we fail to break ties by docID.

        Actually, the optimization can be applied sometimes: if we can push bottomVal testing down to each TermDocs enum (much like unfactoring random-access filter down to the TermDocs), that first pass collection in BooleanScorer is in fact in order. But that's a deeper optimization...

        Show
        Michael McCandless added a comment - Change TSDC and TFC's code to not use the doc id as a tie breaker. We have to remember that out-of-order docID collectors (BooleanScorer) cannot apply this optimization. When we get to this, lets make sure we have a test case that does out-of-order collection, with a field search, that shows failure if we fail to break ties by docID. Actually, the optimization can be applied sometimes: if we can push bottomVal testing down to each TermDocs enum (much like unfactoring random-access filter down to the TermDocs), that first pass collection in BooleanScorer is in fact in order. But that's a deeper optimization...
        Hide
        Michael McCandless added a comment -

        See https://issues.apache.org/jira/browse/LUCENE-1575?focusedCommentId=12697152&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12697152 for some specifics.

        Re #2: docID must still be used as a tie-breaker in the "lessThan" method, since that is used by pq during up/down heap. We don't need to use docID as tie-breaker when testing a new hit for insertion.

        Show
        Michael McCandless added a comment - See https://issues.apache.org/jira/browse/LUCENE-1575?focusedCommentId=12697152&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12697152 for some specifics. Re #2: docID must still be used as a tie-breaker in the "lessThan" method, since that is used by pq during up/down heap. We don't need to use docID as tie-breaker when testing a new hit for insertion.

          People

          • Assignee:
            Unassigned
            Reporter:
            Shai Erera
          • Votes:
            1 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development