Lucene - Core
  1. Lucene - Core
  2. LUCENE-1575

Refactoring Lucene collectors (HitCollector and extensions)

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.9
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      This issue is a result of a recent discussion we've had on the mailing list. You can read the thread here.

      We have agreed to do the following refactoring:

      • Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations.
      • Deprecate HitCollector in favor of the new Collector.
      • Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector.
        • Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector.
        • HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector.
        • It will remove any instanceof checks that currently exist in IndexSearcher code.
      • Create a new (abstract) TopDocsCollector, which will:
        • Leave collect and setNextReader unimplemented.
        • Introduce protected members PriorityQueue and totalHits.
        • Introduce a single protected constructor which accepts a PriorityQueue.
        • Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden.
        • Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only.
      • Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final.
      • Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany).
      • Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany)
      • Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead.

      Additionally, the following proposal was made w.r.t. decoupling score from collect():

      • Change collect to accecpt only a doc Id (unbased).
      • Introduce a setScorer(Scorer) method.
      • If during collect the implementation needs the score, it can call scorer.score().
        If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions:
      • What if during collect() Scorer is null? (i.e., not set) - is it even possible?
      • I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always?

      Open issues:

      • The name for Collector
      • TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output.
      • Decoupling score from collect().

      I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method.
      There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different issue?)

      1. LUCENE-1575.9.patch
        234 kB
        Shai Erera
      2. LUCENE-1575.9.patch
        226 kB
        Shai Erera
      3. LUCENE-1575.patch
        215 kB
        Michael McCandless
      4. LUCENE-1575.patch
        213 kB
        Michael McCandless
      5. LUCENE-1575.8.patch
        224 kB
        Shai Erera
      6. PerfTest.java
        2 kB
        Michael McCandless
      7. LUCENE-1575.patch
        200 kB
        Michael McCandless
      8. LUCENE-1575.7.patch
        194 kB
        Shai Erera
      9. sortCollate5.py
        1 kB
        Michael McCandless
      10. sortBench5.py
        6 kB
        Michael McCandless
      11. LUCENE-1575.patch
        163 kB
        Michael McCandless
      12. LUCENE-1575.6.patch
        161 kB
        Shai Erera
      13. LUCENE-1575.5.patch
        149 kB
        Shai Erera
      14. LUCENE-1575.4.patch
        140 kB
        Shai Erera
      15. LUCENE-1575.3.patch
        133 kB
        Shai Erera
      16. LUCENE-1575.2.patch
        132 kB
        Shai Erera
      17. LUCENE-1575.1.patch
        132 kB
        Shai Erera
      18. LUCENE-1575.patch
        132 kB
        Shai Erera

        Activity

        Hide
        Michael McCandless added a comment -

        Looks good! Thanks Shai. Some responses:

        Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector.

        This turns deprecated HitCollector into a Collector? Seems like it
        should be package private?

        Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany)

        This is deprecated, so we shouldn't add topDocs(start, howMany)? I
        think just switch it back to extending the deprecated TopDocCollector
        (like it does in 2.4)?

        What if during collect() Scorer is null? (i.e., not set) - is it even possible?

        I think Lucene should guarantee not to do that?

        I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always?

        Hmmmm good point. I would love to stop screening for 0 score in the
        core collectors (like Solr). Maybe we fix the core collectors to not
        screen by zero score, but we add a new "only keep positive scores"
        collector chain/wrapper class that does the filtering and the forwards
        collection to another collector? This way there's a migration path if
        somehow users are relying on this.

        And we should note this difference clearly in the javadocs for the new
        hierarchy.

        There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different issue?)

        I think it's fine if it's the same issue, though doing it as 2 patches
        is going to make life difficult. I think a single patch covering
        changes to src/java, and one to src/test is OK, though I'd personally
        prefer just one patch overall.

        Show
        Michael McCandless added a comment - Looks good! Thanks Shai. Some responses: Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector. This turns deprecated HitCollector into a Collector? Seems like it should be package private? Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany) This is deprecated, so we shouldn't add topDocs(start, howMany)? I think just switch it back to extending the deprecated TopDocCollector (like it does in 2.4)? What if during collect() Scorer is null? (i.e., not set) - is it even possible? I think Lucene should guarantee not to do that? I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always? Hmmmm good point. I would love to stop screening for 0 score in the core collectors (like Solr). Maybe we fix the core collectors to not screen by zero score, but we add a new "only keep positive scores" collector chain/wrapper class that does the filtering and the forwards collection to another collector? This way there's a migration path if somehow users are relying on this. And we should note this difference clearly in the javadocs for the new hierarchy. There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different issue?) I think it's fine if it's the same issue, though doing it as 2 patches is going to make life difficult. I think a single patch covering changes to src/java, and one to src/test is OK, though I'd personally prefer just one patch overall.
        Hide
        Shai Erera added a comment -

        This turns deprecated HitCollector into a Collector? Seems like it
        should be package private?

        Initially I wrote it but then deleted. I decided to make the decision as I create the patch. If this will be used only in IndexSearcher, then it should be a private static final class in IndexSearcher, otherwise a package private one. However, if it turns out we'd want to use it for now in other places too where we deprecate the HitCollector methods, then it will be public.
        Anyway, it will be marked deprecated, and I have the intention to make it as 'invisible' as possible.

        This is deprecated, so we shouldn't add topDocs(start, howMany)? I
        think just switch it back to extending the deprecated TopDocCollector
        (like it does in 2.4)?

        That's a good idea.

        Hmmmm good point. I would love to stop screening for 0 score in the
        core collectors (like Solr). Maybe we fix the core collectors to not
        screen by zero score, but we add a new "only keep positive scores"
        collector chain/wrapper class that does the filtering and the forwards
        collection to another collector? This way there's a migration path if
        somehow users are relying on this.

        I can do that. Create a FilterZeroScoresCollector which wraps a Collector and passes forward only documents with score > 0. BTW, how can a document get a zero score?

        I thought to split patches to code and test since I believe the code patch can be ready sooner for review. The test patch will just fix test cases. If that matters so much, I can create a final patch in the end which contains all the changes for easier commit?

        Show
        Shai Erera added a comment - This turns deprecated HitCollector into a Collector? Seems like it should be package private? Initially I wrote it but then deleted. I decided to make the decision as I create the patch. If this will be used only in IndexSearcher, then it should be a private static final class in IndexSearcher, otherwise a package private one. However, if it turns out we'd want to use it for now in other places too where we deprecate the HitCollector methods, then it will be public. Anyway, it will be marked deprecated, and I have the intention to make it as 'invisible' as possible. This is deprecated, so we shouldn't add topDocs(start, howMany)? I think just switch it back to extending the deprecated TopDocCollector (like it does in 2.4)? That's a good idea. Hmmmm good point. I would love to stop screening for 0 score in the core collectors (like Solr). Maybe we fix the core collectors to not screen by zero score, but we add a new "only keep positive scores" collector chain/wrapper class that does the filtering and the forwards collection to another collector? This way there's a migration path if somehow users are relying on this. I can do that. Create a FilterZeroScoresCollector which wraps a Collector and passes forward only documents with score > 0. BTW, how can a document get a zero score? I thought to split patches to code and test since I believe the code patch can be ready sooner for review. The test patch will just fix test cases. If that matters so much, I can create a final patch in the end which contains all the changes for easier commit?
        Hide
        Marvin Humphrey added a comment -

        > BTW, how can a document get a zero score?

        Any number of ways, since Query and Scorer are extensible. How about a RandomScoreQuery that uses floor(rand(1.9))? Or say that you have a bitset of docs which should match and you use that to feed a scorer. What score should you assign? Why not 0? Why not -1? Should it matter?

        Show
        Marvin Humphrey added a comment - > BTW, how can a document get a zero score? Any number of ways, since Query and Scorer are extensible. How about a RandomScoreQuery that uses floor(rand(1.9))? Or say that you have a bitset of docs which should match and you use that to feed a scorer. What score should you assign? Why not 0? Why not -1? Should it matter?
        Hide
        Michael McCandless added a comment -

        I thought to split patches to code and test since I believe the code patch can be ready sooner for review. The test patch will just fix test cases. If that matters so much, I can create a final patch in the end which contains all the changes for easier commit?

        OK that sounds great. The back-compat tests will also assert nothing broke.

        Anyway, it will be marked deprecated, and I have the intention to make it as 'invisible' as possible.

        OK.

        BTW, how can a document get a zero score?

        I've wondered the same thing. There was this thread recently:

        http://www.nabble.com/TopDocCollector-td22244245.html

        Show
        Michael McCandless added a comment - I thought to split patches to code and test since I believe the code patch can be ready sooner for review. The test patch will just fix test cases. If that matters so much, I can create a final patch in the end which contains all the changes for easier commit? OK that sounds great. The back-compat tests will also assert nothing broke. Anyway, it will be marked deprecated, and I have the intention to make it as 'invisible' as possible. OK. BTW, how can a document get a zero score? I've wondered the same thing. There was this thread recently: http://www.nabble.com/TopDocCollector-td22244245.html
        Hide
        Shai Erera added a comment -

        After I posted the question on how can a document get a 0 score, I realized that it's possible due to extensions of Similarity for example. Thanks Marvin for clearing that up. I guess though that the Lucene core classes will not assign <= 0 score to a document?

        Anyway, whether it's true or not, I think I agree with Mike saying we should remove this screening from the core collectors. If my application extends Lucene in a way that it can assign <= 0 scores to documents, and it has the intention of screening those documents, it should use the new FilterZeroScoresCollector (maybe call it OnlyPositiveScoresCollector?)

        I don't think that assigning <= 0 score to a document necessarily means it should be removed from the result set.

        However, Mike (and others) - isn't there a back-compatibility issue with changing the core collectors to not screen on <=0 score documents? I mean, what if my application relies on that and extended Lucene in a way that it sometimes assigns 0 scores to documents? Now when I'll switch to 2.9, those documents won't be filtered. I will be able to use the new FilterZeroScoresCollector, but that'll require me to change my app's code.

        Maybe just do it for the new collectors (TopScoreDocCollector and TopFieldCollector)? I need to change my app's code anyway if I want to use them, so as long as we document this fact in their javadocs, we should be fine?

        Show
        Shai Erera added a comment - After I posted the question on how can a document get a 0 score, I realized that it's possible due to extensions of Similarity for example. Thanks Marvin for clearing that up. I guess though that the Lucene core classes will not assign <= 0 score to a document? Anyway, whether it's true or not, I think I agree with Mike saying we should remove this screening from the core collectors. If my application extends Lucene in a way that it can assign <= 0 scores to documents, and it has the intention of screening those documents, it should use the new FilterZeroScoresCollector (maybe call it OnlyPositiveScoresCollector?) I don't think that assigning <= 0 score to a document necessarily means it should be removed from the result set. However, Mike (and others) - isn't there a back-compatibility issue with changing the core collectors to not screen on <=0 score documents? I mean, what if my application relies on that and extended Lucene in a way that it sometimes assigns 0 scores to documents? Now when I'll switch to 2.9, those documents won't be filtered. I will be able to use the new FilterZeroScoresCollector, but that'll require me to change my app's code. Maybe just do it for the new collectors (TopScoreDocCollector and TopFieldCollector)? I need to change my app's code anyway if I want to use them, so as long as we document this fact in their javadocs, we should be fine?
        Hide
        Michael McCandless added a comment -

        However, Mike (and others) - isn't there a back-compatibility issue with changing the core collectors to not screen on <=0 score documents?

        Hmm right there is, because the search methods will use the new collectors.

        I need to change my app's code anyway if I want to use them, so as long as we document this fact in their javadocs, we should be fine?

        Actually there's no change to your code required (the search methods should use the new collectors). So we do have a back-compat difference.

        We could make the change (turn off filtering), but put a setter on IndexSearcher to have it insert the "PositiveScoresOnlyCollector" wrapper? I think the vast majority of users are not relying on <= 0 scoring docs to be filtered out.

        Show
        Michael McCandless added a comment - However, Mike (and others) - isn't there a back-compatibility issue with changing the core collectors to not screen on <=0 score documents? Hmm right there is, because the search methods will use the new collectors. I need to change my app's code anyway if I want to use them, so as long as we document this fact in their javadocs, we should be fine? Actually there's no change to your code required (the search methods should use the new collectors). So we do have a back-compat difference. We could make the change (turn off filtering), but put a setter on IndexSearcher to have it insert the "PositiveScoresOnlyCollector" wrapper? I think the vast majority of users are not relying on <= 0 scoring docs to be filtered out.
        Hide
        Shai Erera added a comment -

        We could make the change (turn off filtering), but put a setter on IndexSearcher to have it insert the "PositiveScoresOnlyCollector" wrapper?

        Then why do that at all? If I need to call searcher.setKeepOnlyPositiveScores, then it means a change to my code. I could then just pass in the PositiveScoresOnlyCollector to the search methods instead, right?

        I guess you are referring to the methods which don't take a collector as a parameter and instantiate a new TopScoreDocCollector internally? I tend to think that if someone uses those, it is just because they are simple, and I find it very hard to imagine that that someone relies on the filtering. So perhaps we can get away with just documenting the change in behavior?

        I think the vast majority of users are not relying on <= 0 scoring docs to be filtered out.

        I tend to agree. This has been around for quite some time. I checked my custom collectors, and they do the same check. I only now realize I just followed the code practice I saw in Lucene's code, never giving it much thought of whether this can actually happen. I believe that if I'd have extended Lucene in a way such that it returns <=0 scores, I'd be aware of that and probably won't use the built-in collectors. I see no reason to filter <= 0 scored docs anyway, and if I wanted that, I'd probably write my own filtering collector ...

        I think that if we don't believe people rely on the <= 0 filtering, let's just document it. I'd hate to add a setter method to IndexSearcher, and a unit test, and check where else it should be added (i.e., in extending searcher classes) and introduce a new API which we might need to deprecate some day ...
        People who'll need that functionality can move to use the methods that accept a Collector, and pass in the PositiveScoresOnlyCollector. That way we also keep the 'fast and easy' search methods really simple, fast and easy.

        Is that acceptable?

        Show
        Shai Erera added a comment - We could make the change (turn off filtering), but put a setter on IndexSearcher to have it insert the "PositiveScoresOnlyCollector" wrapper? Then why do that at all? If I need to call searcher.setKeepOnlyPositiveScores, then it means a change to my code. I could then just pass in the PositiveScoresOnlyCollector to the search methods instead, right? I guess you are referring to the methods which don't take a collector as a parameter and instantiate a new TopScoreDocCollector internally? I tend to think that if someone uses those, it is just because they are simple, and I find it very hard to imagine that that someone relies on the filtering. So perhaps we can get away with just documenting the change in behavior? I think the vast majority of users are not relying on <= 0 scoring docs to be filtered out. I tend to agree. This has been around for quite some time. I checked my custom collectors, and they do the same check. I only now realize I just followed the code practice I saw in Lucene's code, never giving it much thought of whether this can actually happen. I believe that if I'd have extended Lucene in a way such that it returns <=0 scores, I'd be aware of that and probably won't use the built-in collectors. I see no reason to filter <= 0 scored docs anyway, and if I wanted that, I'd probably write my own filtering collector ... I think that if we don't believe people rely on the <= 0 filtering, let's just document it. I'd hate to add a setter method to IndexSearcher, and a unit test, and check where else it should be added (i.e., in extending searcher classes) and introduce a new API which we might need to deprecate some day ... People who'll need that functionality can move to use the methods that accept a Collector, and pass in the PositiveScoresOnlyCollector. That way we also keep the 'fast and easy' search methods really simple, fast and easy. Is that acceptable?
        Hide
        Michael McCandless added a comment -

        Then why do that at all? If I need to call searcher.setKeepOnlyPositiveScores, then it means a change to my code. I could then just pass in the PositiveScoresOnlyCollector to the search methods instead, right?

        OK, I agree. Let's add an entry to the top of CHANGES.txt that states this [minor] break in back compatibility, as well as the code fragment showing how to use that filter to get back to the pre-2.9 way?

        Show
        Michael McCandless added a comment - Then why do that at all? If I need to call searcher.setKeepOnlyPositiveScores, then it means a change to my code. I could then just pass in the PositiveScoresOnlyCollector to the search methods instead, right? OK, I agree. Let's add an entry to the top of CHANGES.txt that states this [minor] break in back compatibility, as well as the code fragment showing how to use that filter to get back to the pre-2.9 way?
        Hide
        Shai Erera added a comment -

        Great !

        Show
        Shai Erera added a comment - Great !
        Hide
        Shai Erera added a comment -

        BooleanScorer defines an internal package private static final Collector class. Two questions:

        1. May I change it to BooleanCollector? (the name conflicts with the Collector name we want to give to all base collectors)
        2. May I change it to private static final? It is used only in BooleanScorer's newCollector() method.
          I think the two are safe because it's already package-private and there's no other Lucene code which uses it.

        BTW, we might wanna review BooleanScorer's internal classes visibility. They are all package-private, with some public methods, however used by BooleanScorer only ... But that's something for a different issue.

        Show
        Shai Erera added a comment - BooleanScorer defines an internal package private static final Collector class. Two questions: May I change it to BooleanCollector? (the name conflicts with the Collector name we want to give to all base collectors) May I change it to private static final? It is used only in BooleanScorer's newCollector() method. I think the two are safe because it's already package-private and there's no other Lucene code which uses it. BTW, we might wanna review BooleanScorer's internal classes visibility. They are all package-private, with some public methods, however used by BooleanScorer only ... But that's something for a different issue.
        Hide
        Michael McCandless added a comment -

        May I change it to BooleanCollector? (the name conflicts with the Collector name we want to give to all base collectors)

        May I change it to private static final? It is used only in BooleanScorer's newCollector() method.

        I think these are fine.

        Show
        Michael McCandless added a comment - May I change it to BooleanCollector? (the name conflicts with the Collector name we want to give to all base collectors) May I change it to private static final? It is used only in BooleanScorer's newCollector() method. I think these are fine.
        Hide
        Michael McCandless added a comment -

        I think as part of this we should allow TopFieldCollector to NOT get the score of each hit? EG another boolean to the ctor?

        Show
        Michael McCandless added a comment - I think as part of this we should allow TopFieldCollector to NOT get the score of each hit? EG another boolean to the ctor?
        Hide
        Shai Erera added a comment -

        I am not sure what you mean - score is used all over the place in collect() as well as other methods. updateBottom for example takes a score, updates bottom.score and then calls adjustTop(). Do you mean that if ignoreScore is true (in ctor), then setScorer should not save the Scorer and not call scorer.score()? If so, what should I do with all the methods that accept score? Create another code path in TopFieldCollector which ignore the score?

        Also, what should the default value be? true (for ignoring scores)?

        Show
        Shai Erera added a comment - I am not sure what you mean - score is used all over the place in collect() as well as other methods. updateBottom for example takes a score, updates bottom.score and then calls adjustTop(). Do you mean that if ignoreScore is true (in ctor), then setScorer should not save the Scorer and not call scorer.score()? If so, what should I do with all the methods that accept score? Create another code path in TopFieldCollector which ignore the score? Also, what should the default value be? true (for ignoring scores)?
        Hide
        Shai Erera added a comment -

        Ok I now understand better where score is used in TopFieldCollector ... It is used in a number of places, two important are:

        1. Maintain maxScore for returning in TopFieldDocs.
        2. Passed to FieldComparator.compareBottom which takes a doc and score parameters. There is one comparator RelevanceComparator which makes use of the score passed, however that's part of the method signature.
        3. Passed to FieldComparator.copy - again used by RelevanceComparator only.
        4. Passed to updateBottom, which updates the score of the least element in the queue and then calls adjustTop().
        • Number 2, 3 and 4 can be resolved by adding a setScorer to FieldComparator (as empty implementation) which TopFieldCollector will call in each collect() call, passing the Scorer that was given to it in its setScorer.
        • Then, we override that in RelevanceComparator, saving the Scorer and using it whenever the score is needed. Of course we'll need to save the current score, so that we don't call score() too many times for the same document.
        • This eliminates the need to define on TopFieldCollector whether scores should be saved. The reason is that the Sort parameter may include a SortField.SCORE field, which will invoke the RelevanceComparator.

        The question is what to do with maxScore? It is needed for TopDocs / TopFieldDocs. It may also be important to know the maxScore of a query, even if you sort it by something which is not a score.

        Question is - if the steps above make sense, why should we do them at all?
        Now the score is computed and passed on to every FieldComparator we received in Sort. Cleaning the method signature means additional code overhead in RelevanceComparator. If we want to compute maxScore as well, it means the score will be computed twice, once in collect() and once in RelevanceComparator.

        We can solve the double score() computation by using an internal ScoreCacheScorer which keeps the score of the current document and returns it whenever score() is called, unless it's a new document and then it delegates the call to the wrapped Scorer. TopFieldCollector can instantiate it in setScorer.

        But this looks quite a lot for cleaning a method signature, don't you think? Of course if you can suggest how we somehow remove the maxScore computation, then it might be a good change, since only if SortField.SCORE is used, will the score be computed.

        Show
        Shai Erera added a comment - Ok I now understand better where score is used in TopFieldCollector ... It is used in a number of places, two important are: Maintain maxScore for returning in TopFieldDocs. Passed to FieldComparator.compareBottom which takes a doc and score parameters. There is one comparator RelevanceComparator which makes use of the score passed, however that's part of the method signature. Passed to FieldComparator.copy - again used by RelevanceComparator only. Passed to updateBottom, which updates the score of the least element in the queue and then calls adjustTop(). Number 2, 3 and 4 can be resolved by adding a setScorer to FieldComparator (as empty implementation) which TopFieldCollector will call in each collect() call, passing the Scorer that was given to it in its setScorer. Then, we override that in RelevanceComparator, saving the Scorer and using it whenever the score is needed. Of course we'll need to save the current score, so that we don't call score() too many times for the same document. This eliminates the need to define on TopFieldCollector whether scores should be saved. The reason is that the Sort parameter may include a SortField.SCORE field, which will invoke the RelevanceComparator. The question is what to do with maxScore? It is needed for TopDocs / TopFieldDocs. It may also be important to know the maxScore of a query, even if you sort it by something which is not a score. Question is - if the steps above make sense, why should we do them at all? Now the score is computed and passed on to every FieldComparator we received in Sort. Cleaning the method signature means additional code overhead in RelevanceComparator. If we want to compute maxScore as well, it means the score will be computed twice, once in collect() and once in RelevanceComparator. We can solve the double score() computation by using an internal ScoreCacheScorer which keeps the score of the current document and returns it whenever score() is called, unless it's a new document and then it delegates the call to the wrapped Scorer. TopFieldCollector can instantiate it in setScorer. But this looks quite a lot for cleaning a method signature, don't you think? Of course if you can suggest how we somehow remove the maxScore computation, then it might be a good change, since only if SortField.SCORE is used, will the score be computed.
        Hide
        Michael McCandless added a comment -

        The question is what to do with maxScore? It is needed for TopDocs / TopFieldDocs.

        This is why I was thinking you'd have to tell TopFieldCollector whether or not it should track scores. Furthermore, even if SortField.SCORE is not in your SortFields, an app may still want the scores to be enrolled in the TopFieldDocs, for presentation.

        Turning off scoring in TopFieldCollector's ctor just means 1) TopFieldCollector won't track max score, and 2) TopFieldCollector will leave score at 0 in the returned ScoreDoc array.

        Number 2, 3 and 4 can be resolved by adding a setScorer to FieldComparator (as empty implementation) which TopFieldCollector will call in each collect() call, passing the Scorer that was given to it in its setScorer.

        +1

        It makes sense to push the same improvement (not always passing a score; instead, you ask the scorer for score if you need it) down into the FieldCollector API.

        We can solve the double score() computation by using an internal ScoreCacheScorer which keeps the score of the current document and returns it whenever score() is called, unless it's a new document and then it delegates the call to the wrapped Scorer. TopFieldCollector can instantiate it in setScorer.

        +1

        Show
        Michael McCandless added a comment - The question is what to do with maxScore? It is needed for TopDocs / TopFieldDocs. This is why I was thinking you'd have to tell TopFieldCollector whether or not it should track scores. Furthermore, even if SortField.SCORE is not in your SortFields, an app may still want the scores to be enrolled in the TopFieldDocs, for presentation. Turning off scoring in TopFieldCollector's ctor just means 1) TopFieldCollector won't track max score, and 2) TopFieldCollector will leave score at 0 in the returned ScoreDoc array. Number 2, 3 and 4 can be resolved by adding a setScorer to FieldComparator (as empty implementation) which TopFieldCollector will call in each collect() call, passing the Scorer that was given to it in its setScorer. +1 It makes sense to push the same improvement (not always passing a score; instead, you ask the scorer for score if you need it) down into the FieldCollector API. We can solve the double score() computation by using an internal ScoreCacheScorer which keeps the score of the current document and returns it whenever score() is called, unless it's a new document and then it delegates the call to the wrapped Scorer. TopFieldCollector can instantiate it in setScorer. +1
        Hide
        Shai Erera added a comment -

        Turning off scoring in TopFieldCollector's ctor just means 1) TopFieldCollector won't track max score, and 2) TopFieldCollector will leave score at 0 in the returned ScoreDoc array.

        Just to be clear - TopDocs as well as TopFieldDocs require a maxScore parameter in their ctor. So are you suggesting to pass something like Float.NaN as maxScore if scoring is turned off? Or introducing a new ctor which does not require maxScore, and defaults to Float.NaN? (or both?)

        Furthermore, even if SortField.SCORE is not in your SortFields, an app may still want the scores to be enrolled in the TopFieldDocs, for presentation.

        Right - we should separate between getting score out of the FieldComparator API and tracking scores in TFC. If I don't have SortField.SCORE in my list of sort fields, then scorer.score() will not be called at all from the FieldComparators layer.

        Tracking scores in TFC is what I'm having troubles with. Turning it off does not necessarily improve anything .. Might be, and might not. In setScorer() I'd still need to register Scorer for passing on to FieldComparator. In collect() I'd still need to check whether score tracking is on, and if so, call scorer.score() and track maxScore. Note that if ScoreCacheScorer is used, then calling scorer.score() in collect does not have too much overhead.

        Also, what will be the default from a Lucene perspective? true - i.e., always keep track of scores?

        Show
        Shai Erera added a comment - Turning off scoring in TopFieldCollector's ctor just means 1) TopFieldCollector won't track max score, and 2) TopFieldCollector will leave score at 0 in the returned ScoreDoc array. Just to be clear - TopDocs as well as TopFieldDocs require a maxScore parameter in their ctor. So are you suggesting to pass something like Float.NaN as maxScore if scoring is turned off? Or introducing a new ctor which does not require maxScore, and defaults to Float.NaN? (or both?) Furthermore, even if SortField.SCORE is not in your SortFields, an app may still want the scores to be enrolled in the TopFieldDocs, for presentation. Right - we should separate between getting score out of the FieldComparator API and tracking scores in TFC. If I don't have SortField.SCORE in my list of sort fields, then scorer.score() will not be called at all from the FieldComparators layer. Tracking scores in TFC is what I'm having troubles with. Turning it off does not necessarily improve anything .. Might be, and might not. In setScorer() I'd still need to register Scorer for passing on to FieldComparator. In collect() I'd still need to check whether score tracking is on, and if so, call scorer.score() and track maxScore. Note that if ScoreCacheScorer is used, then calling scorer.score() in collect does not have too much overhead. Also, what will be the default from a Lucene perspective? true - i.e., always keep track of scores?
        Hide
        Michael McCandless added a comment -

        Or introducing a new ctor which does not require maxScore, and defaults to Float.NaN?

        I think that's a good approach, though for TopDocs the ctor should be package private I think (only called from TopFieldDocs' ctor)? And the javadocs should clearly spell out that this could happen (so people don't get scared on seeing Float.NaN coming back).

        Turning it off does not necessarily improve anything .. Might be, and might not. In setScorer() I'd still need to register Scorer for passing on to FieldComparator. In collect() I'd still need to check whether score tracking is on, and if so, call scorer.score() and track maxScore. Note that if ScoreCacheScorer is used, then calling scorer.score() in collect does not have too much overhead.

        I think this is an improvement? (Scorer.score() will not have been called... that's the goal here).

        I guess we could also consider making a separate TopFieldCollector (NonScoringTopFieldCollector or some such), instead of sprinkling if statements all over the place.

        Also, what will be the default from a Lucene perspective? true - i.e., always keep track of scores?

        Good question... we have the freedom to choose. Perhaps default to off? But say clearly in the migration javadocs that you have to set that to true to get same behavior as TSDC?

        Show
        Michael McCandless added a comment - Or introducing a new ctor which does not require maxScore, and defaults to Float.NaN? I think that's a good approach, though for TopDocs the ctor should be package private I think (only called from TopFieldDocs' ctor)? And the javadocs should clearly spell out that this could happen (so people don't get scared on seeing Float.NaN coming back). Turning it off does not necessarily improve anything .. Might be, and might not. In setScorer() I'd still need to register Scorer for passing on to FieldComparator. In collect() I'd still need to check whether score tracking is on, and if so, call scorer.score() and track maxScore. Note that if ScoreCacheScorer is used, then calling scorer.score() in collect does not have too much overhead. I think this is an improvement? (Scorer.score() will not have been called... that's the goal here). I guess we could also consider making a separate TopFieldCollector (NonScoringTopFieldCollector or some such), instead of sprinkling if statements all over the place. Also, what will be the default from a Lucene perspective? true - i.e., always keep track of scores? Good question... we have the freedom to choose. Perhaps default to off? But say clearly in the migration javadocs that you have to set that to true to get same behavior as TSDC?
        Hide
        Shai Erera added a comment -

        ok I'll add another package-private ctor to TopDocs which does not get maxScore and defaults to NaN, as well as update the javadocs. No back-compat here since the only code that will use it is TFC, which is new.

        I think this is an improvement? (Scorer.score() will not have been called... that's the goal here).

        Well, if we use ScoreCacheScorer, then this call is really fast, returning immediately and w/o computing the score.

        I guess we could also consider making a separate TopFieldCollector (NonScoringTopFieldCollector or some such), instead of sprinkling if statements all over the place.

        I always like such approaches. How's that sound:

        1. Create a base NSTFC, which has a protected score member, initialized to 0, and is exactly like the current TFC, only w/o maxScore.
        2. Have TFC extend NSTFC, override collect() and:
          1. Set super.score = scorer.score(). That is required for updateBottom which updates the score on the ScoreDoc in pq.
          2. Compute maxScore.
          3. Call super.collect().
          4. Override topDocs(start, howMany) to provide one with maxScore.

        Good question... we have the freedom to choose. Perhaps default to off? But say clearly in the migration javadocs that you have to set that to true to get same behavior as TSDC?

        So you suggest the methods on IndexSearcher today that take a Sort as parameter will default to NSTFC? As long as we document it it's ok? Are all of these new?

        Show
        Shai Erera added a comment - ok I'll add another package-private ctor to TopDocs which does not get maxScore and defaults to NaN, as well as update the javadocs. No back-compat here since the only code that will use it is TFC, which is new. I think this is an improvement? (Scorer.score() will not have been called... that's the goal here). Well, if we use ScoreCacheScorer, then this call is really fast, returning immediately and w/o computing the score. I guess we could also consider making a separate TopFieldCollector (NonScoringTopFieldCollector or some such), instead of sprinkling if statements all over the place. I always like such approaches. How's that sound: Create a base NSTFC, which has a protected score member, initialized to 0, and is exactly like the current TFC, only w/o maxScore. Have TFC extend NSTFC, override collect() and: Set super.score = scorer.score(). That is required for updateBottom which updates the score on the ScoreDoc in pq. Compute maxScore. Call super.collect(). Override topDocs(start, howMany) to provide one with maxScore. Good question... we have the freedom to choose. Perhaps default to off? But say clearly in the migration javadocs that you have to set that to true to get same behavior as TSDC? So you suggest the methods on IndexSearcher today that take a Sort as parameter will default to NSTFC? As long as we document it it's ok? Are all of these new?
        Hide
        Michael McCandless added a comment -

        How's that sound:

        That sounds good! So to be consistent maybe we create ScoringTopFieldCollector and NonScoringTopFieldCollector?

        This means we don't need ScoreCacheScorer? (because ScoringTopFieldCollector will always grab the score). Though how do we change FieldComparator API so as to not pass score around? All comparators except RelevanceComparator don't use it.

        Well, if we use ScoreCacheScorer, then this call is really fast, returning immediately and w/o computing the score.

        I'm actually torn on how fast this will be: I think that will be an if statement that's hard for the CPU to predict, which is costly.

        So you suggest the methods on IndexSearcher today that take a Sort as parameter will default to NSTFC? As long as we document it it's ok? Are all of these new?

        Hmmm... actually, no, I think those must continue to use NSTFC for the existing methods (to remain back compatible), but add a new search method that takes a boolean trackScore?

        Show
        Michael McCandless added a comment - How's that sound: That sounds good! So to be consistent maybe we create ScoringTopFieldCollector and NonScoringTopFieldCollector? This means we don't need ScoreCacheScorer? (because ScoringTopFieldCollector will always grab the score). Though how do we change FieldComparator API so as to not pass score around? All comparators except RelevanceComparator don't use it. Well, if we use ScoreCacheScorer, then this call is really fast, returning immediately and w/o computing the score. I'm actually torn on how fast this will be: I think that will be an if statement that's hard for the CPU to predict, which is costly. So you suggest the methods on IndexSearcher today that take a Sort as parameter will default to NSTFC? As long as we document it it's ok? Are all of these new? Hmmm... actually, no, I think those must continue to use NSTFC for the existing methods (to remain back compatible), but add a new search method that takes a boolean trackScore?
        Hide
        Shai Erera added a comment -

        So to be consistent maybe we create ScoringTopFieldCollector and NonScoringTopFieldCollector?

        And have STFC extend NSTFC? I see no reason to create an abstract TopFieldCollector.

        This means we don't need ScoreCacheScorer? (because ScoringTopFieldCollector will always grab the score). Though how do we change FieldComparator API so as to not pass score around? All comparators except RelevanceComparator don't use it.

        I was actually thinking of that class for RelevanceComparator. So perhaps I can implement the logic inside RelevanceComparator? Although this sounds like a nice utility class, now that we have a setScorer on Collector - others may find it useful too.
        Remember that score-tracking is done for maxScore and ScoreDoc purposes (inside STFC). The score in the FieldComparator API is used only in RelevanceComparator, whether it's STFC or NSTFC.

        I think those must continue to use NSTFC for the existing methods (to remain back compatible)

        Did you mean continue to use STFC? The current behavior is that scoring is tracked, I think.

        add a new search method that takes a boolean trackScore?

        I actually prefer not to expose any more methods. IndexSearcher already has plenty of them. Instead, one can use the very generic, simple and useful method search(Query, Collector) and pass in a NSTFC instance. Otherwise we'll end up adding many search() methods to IndexSearcher, if we continue with that approach going forward.

        Show
        Shai Erera added a comment - So to be consistent maybe we create ScoringTopFieldCollector and NonScoringTopFieldCollector? And have STFC extend NSTFC? I see no reason to create an abstract TopFieldCollector. This means we don't need ScoreCacheScorer? (because ScoringTopFieldCollector will always grab the score). Though how do we change FieldComparator API so as to not pass score around? All comparators except RelevanceComparator don't use it. I was actually thinking of that class for RelevanceComparator. So perhaps I can implement the logic inside RelevanceComparator? Although this sounds like a nice utility class, now that we have a setScorer on Collector - others may find it useful too. Remember that score-tracking is done for maxScore and ScoreDoc purposes (inside STFC). The score in the FieldComparator API is used only in RelevanceComparator, whether it's STFC or NSTFC. I think those must continue to use NSTFC for the existing methods (to remain back compatible) Did you mean continue to use STFC? The current behavior is that scoring is tracked, I think. add a new search method that takes a boolean trackScore? I actually prefer not to expose any more methods. IndexSearcher already has plenty of them. Instead, one can use the very generic, simple and useful method search(Query, Collector) and pass in a NSTFC instance. Otherwise we'll end up adding many search() methods to IndexSearcher, if we continue with that approach going forward.
        Hide
        Michael McCandless added a comment -

        And have STFC extend NSTFC? I see no reason to create an abstract TopFieldCollector.

        Yes.

        Although this sounds like a nice utility class, now that we have a setScorer on Collector - others may find it useful too.

        OK I agree, it would be useful to have for general usage (eg chaining collectors).

        But what is the plan now for the FieldComparator API? We no longer pass score all around, but expose access to scorer, which only RelevanceComparator (in core) will use?

        Did you mean continue to use STFC? The current behavior is that scoring is tracked, I think.

        Sorry, yes, STFC. Beginning to lose mind...

        I actually prefer not to expose any more methods. IndexSearcher already has plenty of them. Instead, one can use the very generic, simple and useful method search(Query, Collector) and pass in a NSTFC instance. Otherwise we'll end up adding many search() methods to IndexSearcher, if we continue with that approach going forward.

        I think that'd be OK... the only thing that bothers me is I think the natural default when sorting by field is to not gather the score. Ie I don't want someone evaluating Lucene in the future to say our field sort is too slow when they didn't realize they had to go use this advanced API that turns off scoring.

        What if we add a new method, and deprecate the old one? This way come 3.0 we will not have added any methods, and then when sorting by field you see that you have to choose with or without scores.

        Show
        Michael McCandless added a comment - And have STFC extend NSTFC? I see no reason to create an abstract TopFieldCollector. Yes. Although this sounds like a nice utility class, now that we have a setScorer on Collector - others may find it useful too. OK I agree, it would be useful to have for general usage (eg chaining collectors). But what is the plan now for the FieldComparator API? We no longer pass score all around, but expose access to scorer, which only RelevanceComparator (in core) will use? Did you mean continue to use STFC? The current behavior is that scoring is tracked, I think. Sorry, yes, STFC. Beginning to lose mind... I actually prefer not to expose any more methods. IndexSearcher already has plenty of them. Instead, one can use the very generic, simple and useful method search(Query, Collector) and pass in a NSTFC instance. Otherwise we'll end up adding many search() methods to IndexSearcher, if we continue with that approach going forward. I think that'd be OK... the only thing that bothers me is I think the natural default when sorting by field is to not gather the score. Ie I don't want someone evaluating Lucene in the future to say our field sort is too slow when they didn't realize they had to go use this advanced API that turns off scoring. What if we add a new method, and deprecate the old one? This way come 3.0 we will not have added any methods, and then when sorting by field you see that you have to choose with or without scores.
        Hide
        Shai Erera added a comment -

        But what is the plan now for the FieldComparator API? We no longer pass score all around, but expose access to scorer, which only RelevanceComparator (in core) will use?

        Yes. FieldComparator will have a default empty setScorer() method, which will be overridden by RelevanceComparator. In TopFieldCollector (forget the final name now) setScorer() method we set the Scorer on all FieldComparators. During collect(), only RelevanceComparator, if it exists, will compute the score.

        What if we add a new method, and deprecate the old one?

        The current methods are:

        • Searchable - search(Weight, Filter, int, Sort)
        • Searcher - search(Query, Filter, int, Sort)
        • IndexSearcher - search(Weight, Filter, int, Sort, boolean /* fillFields */)

        Deprecate all three, and add the same but take another boolean as a parameter? I have two comments regarding that:

        1. The current methods need to call the new ones with trackScores = true since that's the current behavior.
        2. When we are left with only the new versions, I'm afraid those methods will not look 'simple fast' to a user - I now have to decide whether I want to track scores or not, something I haven't given much thought to before. I kind of like the current signature, but I understand your concern regarding defaults.

        BTW, Searchable is an interface, so we cannot add it there. Searcher is an abstract class and we cannot add the method to it with default implementation (as I believe the other search methods will call the new one with default=true). So it only leaves IndexSearcher as an option. But then what if someone uses MultiSearcher? ParallelMultiSearcher? etc.

        Is it possible to deprecate a method, documenting that its runtime behavior will change in 3.0 and then in 3.0 change to not track scores?

        If we're touching TopFieldCollector in such a way, I'd like to propose the following refactoring. It stems from the current complex implementation in collect() which checks in every collect call if we have just one Comparator or Multi, and we're talking about having two versions w/ and w/o score tracking:

        • Keep TopFieldCollector as abstract class with a static create() factory method, and an abstract updateBottom() method. It will still implement topDocs(start, howMany).
        • Have the factory method create one of 4 instances, all extend TFC, but are private internal classes, which do not concern the user:
          1. OneComparatorNonScoringTopFieldCollector - assumes just one FieldComparator exists as well as scoring should not be tracked.
          2. OneComparatorScoringTopFieldCollector - assumes just one FieldComparator exists as well as scoring should be tracked.
          3. MultiComparatorNonScoringTopFieldCollector - assumes more than one FieldComparator exists, as well as scoring should not be tracked.
          4. MultiComparatorScoringTopFieldCollector - assumes more than one FieldComparator exists, as well as scoring should be tracked.

        The advantages are:

        • We simplify the API - the user is only aware of TopFieldCollector, and instead of calling a ctor it calls a static create method, which takes all the arguments as the ctor takes.
        • We are free to create whatever instance is the right and most optimized one given the input parameters. The user does not care how the instance is called. Hence the long names - they are internal anyway.
        • The code is much cleaner and easy to understand. It also does not need to check if we have just one comparator or more in every call to collect.
        • We don't need to add a protected score to a NonScoring collector (read above about code readability) just because a Scoring one extends it and will make use of it.

        Since TopFieldCollector is new, we have the freedom to do it right w/o deprecating anything. I think it's a much cleaner design. It is orthogonal to the discussion we're having regarding the search methods and parameters. They will use the create() factory method instead of creating a collector, passing whatever arguments they have. So let's not confuse the two.

        The patch for this issue is ready. As soon as we agree on how to proceed with TFC, I'll add the changes and submit the patch.

        Show
        Shai Erera added a comment - But what is the plan now for the FieldComparator API? We no longer pass score all around, but expose access to scorer, which only RelevanceComparator (in core) will use? Yes. FieldComparator will have a default empty setScorer() method, which will be overridden by RelevanceComparator. In TopFieldCollector (forget the final name now) setScorer() method we set the Scorer on all FieldComparators. During collect(), only RelevanceComparator, if it exists, will compute the score. What if we add a new method, and deprecate the old one? The current methods are: Searchable - search(Weight, Filter, int, Sort) Searcher - search(Query, Filter, int, Sort) IndexSearcher - search(Weight, Filter, int, Sort, boolean /* fillFields */) Deprecate all three, and add the same but take another boolean as a parameter? I have two comments regarding that: The current methods need to call the new ones with trackScores = true since that's the current behavior. When we are left with only the new versions, I'm afraid those methods will not look 'simple fast' to a user - I now have to decide whether I want to track scores or not, something I haven't given much thought to before. I kind of like the current signature, but I understand your concern regarding defaults. BTW, Searchable is an interface, so we cannot add it there. Searcher is an abstract class and we cannot add the method to it with default implementation (as I believe the other search methods will call the new one with default=true). So it only leaves IndexSearcher as an option. But then what if someone uses MultiSearcher? ParallelMultiSearcher? etc. Is it possible to deprecate a method, documenting that its runtime behavior will change in 3.0 and then in 3.0 change to not track scores? If we're touching TopFieldCollector in such a way, I'd like to propose the following refactoring. It stems from the current complex implementation in collect() which checks in every collect call if we have just one Comparator or Multi, and we're talking about having two versions w/ and w/o score tracking: Keep TopFieldCollector as abstract class with a static create() factory method, and an abstract updateBottom() method. It will still implement topDocs(start, howMany). Have the factory method create one of 4 instances, all extend TFC, but are private internal classes, which do not concern the user: OneComparatorNonScoringTopFieldCollector - assumes just one FieldComparator exists as well as scoring should not be tracked. OneComparatorScoringTopFieldCollector - assumes just one FieldComparator exists as well as scoring should be tracked. MultiComparatorNonScoringTopFieldCollector - assumes more than one FieldComparator exists, as well as scoring should not be tracked. MultiComparatorScoringTopFieldCollector - assumes more than one FieldComparator exists, as well as scoring should be tracked. The advantages are: We simplify the API - the user is only aware of TopFieldCollector, and instead of calling a ctor it calls a static create method, which takes all the arguments as the ctor takes. We are free to create whatever instance is the right and most optimized one given the input parameters. The user does not care how the instance is called. Hence the long names - they are internal anyway. The code is much cleaner and easy to understand. It also does not need to check if we have just one comparator or more in every call to collect. We don't need to add a protected score to a NonScoring collector (read above about code readability) just because a Scoring one extends it and will make use of it. Since TopFieldCollector is new, we have the freedom to do it right w/o deprecating anything. I think it's a much cleaner design. It is orthogonal to the discussion we're having regarding the search methods and parameters. They will use the create() factory method instead of creating a collector, passing whatever arguments they have. So let's not confuse the two. The patch for this issue is ready. As soon as we agree on how to proceed with TFC, I'll add the changes and submit the patch.
        Hide
        Michael McCandless added a comment -

        If we're touching TopFieldCollector in such a way, I'd like to propose the following refactoring

        +1. That looks great.

        When we are left with only the new versions, I'm afraid those methods will not look 'simple fast' to a user

        I agree, which is why I'd like in 3.0 for the default to be "don't score when sorting by fields".

        Is it possible to deprecate a method, documenting that its runtime behavior will change in 3.0 and then in 3.0 change to not track scores?

        I think this may in fact be our best option: don't deprecate the method, but document that in 3.0 this method will no longer do scoring. There is a precedent here: in 3.0, IndexReader.open is going to return readOnly readers by default (vs read/write today). We have also done similar fixes within a minor release, eg fixes to StandardAnalyzer. I think there are other things we should do (eg, StopFilter should enable position increment by default, which it doesn't today – LUCENE-1258). If we do this approach, on committing this issue you should open a new one w/ fix version 3.0 to switch up the default.

        I think, with 3.0, if we clearly document in CHANGES, as well as on the particular APIs, the changes to Lucene's defaults, that's sufficient?

        Show
        Michael McCandless added a comment - If we're touching TopFieldCollector in such a way, I'd like to propose the following refactoring +1. That looks great. When we are left with only the new versions, I'm afraid those methods will not look 'simple fast' to a user I agree, which is why I'd like in 3.0 for the default to be "don't score when sorting by fields". Is it possible to deprecate a method, documenting that its runtime behavior will change in 3.0 and then in 3.0 change to not track scores? I think this may in fact be our best option: don't deprecate the method, but document that in 3.0 this method will no longer do scoring. There is a precedent here: in 3.0, IndexReader.open is going to return readOnly readers by default (vs read/write today). We have also done similar fixes within a minor release, eg fixes to StandardAnalyzer. I think there are other things we should do (eg, StopFilter should enable position increment by default, which it doesn't today – LUCENE-1258 ). If we do this approach, on committing this issue you should open a new one w/ fix version 3.0 to switch up the default. I think, with 3.0, if we clearly document in CHANGES, as well as on the particular APIs, the changes to Lucene's defaults, that's sufficient?
        Hide
        Shai Erera added a comment -

        If we do this approach, on committing this issue you should open a new one w/ fix version 3.0 to switch up the default.

        Ok then let's do that. I'll add a TODO to these methods that it should be changed in 3.0, and also open another issue. (The TODO is in case I forget to open the issue).

        I'll also add documentation to the CHANGES file as well as the API.

        Ok, I think the patch should be ready soon then. Just need to complete the refactoring to TopFieldCollector.

        Show
        Shai Erera added a comment - If we do this approach, on committing this issue you should open a new one w/ fix version 3.0 to switch up the default. Ok then let's do that. I'll add a TODO to these methods that it should be changed in 3.0, and also open another issue. (The TODO is in case I forget to open the issue). I'll also add documentation to the CHANGES file as well as the API. Ok, I think the patch should be ready soon then. Just need to complete the refactoring to TopFieldCollector.
        Hide
        Shai Erera added a comment -

        When I was about to make the changes to FieldComparator (add setScorer and calling scorer.score() when necessary) I noticed that scorer.score() declares it throws IOException, while the FieldComparatro methods don't. So two ways to handle it:

        1. catch the IOException and ignore it, assuming a 0 score. This is if we think no Scorer will actually throw an IOException.
        2. Change FieldComparator APIs to declare throwing IOE, which will give us flexibility in the future. Since the Lucene code throws IOE in many places, I don't think it's a problem.

        I'm in favor of (2) (I had to add IOE to Collector.collect() for that reason).

        Show
        Shai Erera added a comment - When I was about to make the changes to FieldComparator (add setScorer and calling scorer.score() when necessary) I noticed that scorer.score() declares it throws IOException, while the FieldComparatro methods don't. So two ways to handle it: catch the IOException and ignore it, assuming a 0 score. This is if we think no Scorer will actually throw an IOException. Change FieldComparator APIs to declare throwing IOE, which will give us flexibility in the future. Since the Lucene code throws IOE in many places, I don't think it's a problem. I'm in favor of (2) (I had to add IOE to Collector.collect() for that reason).
        Hide
        Michael McCandless added a comment -

        I like 2 as well. I find reserving that freedom to be very helpful, while preventing it to be a real hassle later on...

        Show
        Michael McCandless added a comment - I like 2 as well. I find reserving that freedom to be very helpful, while preventing it to be a real hassle later on...
        Hide
        Shai Erera added a comment -

        Eventually I decided to include just one patch file (instead of code and test) since it was simpler after all. Please be sure to review the following:

        1. Collector class and documentation.
        2. New TopDocsCollector class.
        3. TopFieldCollector refactoring.
        4. Methods deprecation.
        5. New TestTopDocsCollector as well as test cases in TestSort.
        Show
        Shai Erera added a comment - Eventually I decided to include just one patch file (instead of code and test) since it was simpler after all. Please be sure to review the following: Collector class and documentation. New TopDocsCollector class. TopFieldCollector refactoring. Methods deprecation. New TestTopDocsCollector as well as test cases in TestSort.
        Hide
        Uwe Schindler added a comment -

        I just wonder, why HitCollectorWrapper implements:

        public void collect(int doc, float score) {
            collector.collect(doc + base, score);
        }
        

        This is not needed by Collector abstract class and never called.

        Show
        Uwe Schindler added a comment - I just wonder, why HitCollectorWrapper implements: public void collect( int doc, float score) { collector.collect(doc + base, score); } This is not needed by Collector abstract class and never called.
        Hide
        Shai Erera added a comment -

        oops leftovers from when it extended MultiReaderHitCollector (now called Collector)

        Show
        Shai Erera added a comment - oops leftovers from when it extended MultiReaderHitCollector (now called Collector)
        Hide
        Michael McCandless added a comment -

        Shai, it looks like you "svn copy"'d MultiReaderHitCollector.java --> Collector.java? It causes "patch" to be confused when applying the patch. The simple workaround is to pre-copy that file yourself, manually, before appying the patch.

        Show
        Michael McCandless added a comment - Shai, it looks like you "svn copy"'d MultiReaderHitCollector.java --> Collector.java? It causes "patch" to be confused when applying the patch. The simple workaround is to pre-copy that file yourself, manually, before appying the patch.
        Hide
        Uwe Schindler added a comment -

        oops leftovers from when it extended MultiReaderHitCollector (now called Collector)

        This is why we really should move to Java 1.5 soon and its @Override annotation...

        Show
        Uwe Schindler added a comment - oops leftovers from when it extended MultiReaderHitCollector (now called Collector) This is why we really should move to Java 1.5 soon and its @Override annotation...
        Hide
        Shai Erera added a comment -

        I did not do any "svn copy", just used Eclipse refactoring to change the name of the class to Collector. I did not understand though from your comment if I should do it differently and post another patch, or is that a hint to how someone can still apply the patch?

        Show
        Shai Erera added a comment - I did not do any "svn copy", just used Eclipse refactoring to change the name of the class to Collector. I did not understand though from your comment if I should do it differently and post another patch, or is that a hint to how someone can still apply the patch?
        Hide
        Uwe Schindler added a comment -

        JavaDoc errors:

        • The Collector javadoc example still contains the score in its collect() method.
        • The javadocs of ParallelMultiSearcher's new Collector's
          public void search(Weight weight, Filter filter, final Collector collector)

          has still HitCollector in its JavaDocs.

        This is what I found out when reading the new generated Javadocs.

        Show
        Uwe Schindler added a comment - JavaDoc errors: The Collector javadoc example still contains the score in its collect() method. The javadocs of ParallelMultiSearcher's new Collector's public void search(Weight weight, Filter filter, final Collector collector) has still HitCollector in its JavaDocs. This is what I found out when reading the new generated Javadocs.
        Hide
        Shai Erera added a comment -

        Thanks Mike. I ran the javadocs task and found other mentions of MultiReaderHitCollector as well as fixed some more javadocs. BTW, the javadoc Ant task outputs many errors on missing files/names, but that something for another issue.

        Show
        Shai Erera added a comment - Thanks Mike. I ran the javadocs task and found other mentions of MultiReaderHitCollector as well as fixed some more javadocs. BTW, the javadoc Ant task outputs many errors on missing files/names, but that something for another issue.
        Hide
        Andrzej Bialecki added a comment -

        I'm late to this discussion, so I may have missed something. Is there any provision in the new API for the early termination of hit collection? Currently the (time | count)-limited collectors that are used in Nutch and Solr have to throw RuntimeException to break the loop. It would be much more elegant if the new Collector.collect() had a way to signal the caller that it should stop looping without incurring the cost of throwing an Exception. E.g. by returning a boolean, or setting a flag in the caller.

        Show
        Andrzej Bialecki added a comment - I'm late to this discussion, so I may have missed something. Is there any provision in the new API for the early termination of hit collection? Currently the (time | count)-limited collectors that are used in Nutch and Solr have to throw RuntimeException to break the loop. It would be much more elegant if the new Collector.collect() had a way to signal the caller that it should stop looping without incurring the cost of throwing an Exception. E.g. by returning a boolean, or setting a flag in the caller.
        Hide
        Michael McCandless added a comment -

        Thanks Mike.

        That was Uwe

        Show
        Michael McCandless added a comment - Thanks Mike. That was Uwe
        Hide
        Michael McCandless added a comment -

        Looks great! Thanks Shai.

        My biggest question/issue is how TopDocsCollector now (still?)
        requires that the PQ you give it is sorting primarily by score (eg
        getMaxScore() assumes the max is in the PQ; topDocs() uses
        results[0]'s score when start == 0). We've sort of come full circle
        (a "hidden assumption" that you are sorting by score was what started
        the whole thread in the beginning)

        We get away with that with TopFieldCollector because that overrides
        all the score-related processing.

        I'm not sure how to cleanly fix this... maybe TopDocsCollector should
        make maxScore() abstract? Or... it's as if we need the
        scoring/non-scoring bifurcation up higher (moved out of
        TopFieldCollector to above TopDocsCollector)?

        EG say I provide my own PQ that's sorting by date (by loading date
        from some external source, say); I may or may not care for score.

        Maybe we make a ScoringWrapperCollector that grabs score in its own
        collect, then calls collect() on its child?

        A few other small things:

        • In the private final static classes inside TopFieldCollector, you
          can make the members final too (eg reverseMul, comparator), in
          case it helps compiler.
        • Can you add a paragraph @ top of CHANGES stating the pending
          default swap in 3.0 (ie the same note you added to
          IndexSearcher.search). Add a new "Changes in backwards
          compatibility policy" section at the very top (look at how 2.4.0
          release did it). And can you give explicit code fragment showing
          how to get back to the old way (ie show that you must pass in
          "true" for trackDocScores).
        Show
        Michael McCandless added a comment - Looks great! Thanks Shai. My biggest question/issue is how TopDocsCollector now (still?) requires that the PQ you give it is sorting primarily by score (eg getMaxScore() assumes the max is in the PQ; topDocs() uses results [0] 's score when start == 0). We've sort of come full circle (a "hidden assumption" that you are sorting by score was what started the whole thread in the beginning) We get away with that with TopFieldCollector because that overrides all the score-related processing. I'm not sure how to cleanly fix this... maybe TopDocsCollector should make maxScore() abstract? Or... it's as if we need the scoring/non-scoring bifurcation up higher (moved out of TopFieldCollector to above TopDocsCollector)? EG say I provide my own PQ that's sorting by date (by loading date from some external source, say); I may or may not care for score. Maybe we make a ScoringWrapperCollector that grabs score in its own collect, then calls collect() on its child? A few other small things: In the private final static classes inside TopFieldCollector, you can make the members final too (eg reverseMul, comparator), in case it helps compiler. Can you add a paragraph @ top of CHANGES stating the pending default swap in 3.0 (ie the same note you added to IndexSearcher.search). Add a new "Changes in backwards compatibility policy" section at the very top (look at how 2.4.0 release did it). And can you give explicit code fragment showing how to get back to the old way (ie show that you must pass in "true" for trackDocScores).
        Hide
        Michael McCandless added a comment -

        It would be much more elegant if the new Collector.collect() had a way to signal the caller that it should stop looping without incurring the cost of throwing an Exception. E.g. by returning a boolean, or setting a flag in the caller.

        I agree: we should work into this new collection API a way to stop early.

        But I'm nervous about the cost of checking a returned boolean on every collect call vs the cost of throwing/catching an exception. Adding the boolean check slows down every single collect() call (by just a bit, but bits really count here), even those that never use stop early (the majority of apps today). Throwing an exception adds a clear cost when you actually throw & catch it, but presumably that cost is proportionally tiny because you only throw it on searches that have been running for a long time, anyway.

        We could upgrade the exception to a checked exception; then we'd need to add "throws XXX" to the search methods in 3.0 (perhaps wrapping as RuntimeException until 3.0). But then I also wonder if the checked exception logic would add instructions in the collect() path.

        Show
        Michael McCandless added a comment - It would be much more elegant if the new Collector.collect() had a way to signal the caller that it should stop looping without incurring the cost of throwing an Exception. E.g. by returning a boolean, or setting a flag in the caller. I agree: we should work into this new collection API a way to stop early. But I'm nervous about the cost of checking a returned boolean on every collect call vs the cost of throwing/catching an exception. Adding the boolean check slows down every single collect() call (by just a bit, but bits really count here), even those that never use stop early (the majority of apps today). Throwing an exception adds a clear cost when you actually throw & catch it, but presumably that cost is proportionally tiny because you only throw it on searches that have been running for a long time, anyway. We could upgrade the exception to a checked exception; then we'd need to add "throws XXX" to the search methods in 3.0 (perhaps wrapping as RuntimeException until 3.0). But then I also wonder if the checked exception logic would add instructions in the collect() path.
        Hide
        Uwe Schindler added a comment -

        I think, a checked Exception to stop collecting would be the best. The "cost" of the exception is very minimal (it is only thrown once in the collector and catched somewhere at top level). So where would be the costly part? Microseconds per search?

        Show
        Uwe Schindler added a comment - I think, a checked Exception to stop collecting would be the best. The "cost" of the exception is very minimal (it is only thrown once in the collector and catched somewhere at top level). So where would be the costly part? Microseconds per search?
        Hide
        Shai Erera added a comment -

        Sorry – thanks Uwe

        On Wed, Apr 1, 2009 at 1:01 PM, Michael McCandless (JIRA)

        Show
        Shai Erera added a comment - Sorry – thanks Uwe On Wed, Apr 1, 2009 at 1:01 PM, Michael McCandless (JIRA)
        Hide
        Shai Erera added a comment -

        I think, a checked Exception to stop collecting would be the best.

        I actually think a RuntimeException is better for the following reasons:

        • The effects of a checked exception added to collect() will not touch IndexSearcher only, but also some Scorer implementations.
        • That only exists in the TimeLimitedCollector, which is not instantiated by Lucene code, but rather by my application code. Therefore, reading its javadocs, I know I should wrap the searcher.search with a try-catch. On the other hand, the rest of the Lucene users will not need to do that since they don't use it.
        • Going forward, someone may come up with another limiting Collector implementation, for example "I want to stop collecting and searching as soon as I hit 10 documents" - what would we do then - add another checked exception?

        My point is - since such Collectors (at least now) are now instantiated by the search application and not by the Lucene code itself, RuntimeException is as good as checked exception, only they don't require any changes to the methods signature.

        Show
        Shai Erera added a comment - I think, a checked Exception to stop collecting would be the best. I actually think a RuntimeException is better for the following reasons: The effects of a checked exception added to collect() will not touch IndexSearcher only, but also some Scorer implementations. That only exists in the TimeLimitedCollector, which is not instantiated by Lucene code, but rather by my application code. Therefore, reading its javadocs, I know I should wrap the searcher.search with a try-catch. On the other hand, the rest of the Lucene users will not need to do that since they don't use it. Going forward, someone may come up with another limiting Collector implementation, for example "I want to stop collecting and searching as soon as I hit 10 documents" - what would we do then - add another checked exception? My point is - since such Collectors (at least now) are now instantiated by the search application and not by the Lucene code itself, RuntimeException is as good as checked exception, only they don't require any changes to the methods signature.
        Hide
        Shai Erera added a comment -

        Maybe we make a ScoringWrapperCollector that grabs score in its own collect, then calls collect() on its child?

        How about this:

        1. I remove maxScore() completely from TopDocsCollector.
        2. I define newTopDocs to return a TopDocs() w/o a maxScore (like we said, defaulting to Float.NaN). Also change the method signature to receive the start as well as ScoreDoc[] that was collected so far.
        3. TopScoreDocCollector will override it, fetching maxScore and returning a TopDocs.
          • TopFieldCollector will do the same, but returning TopFieldDocs.

        Can you add a paragraph @ top of CHANGES stating the pending default swap in 3.0

        Done.

        In the private final static classes inside TopFieldCollector, you can make the members final too

        Done.

        Show
        Shai Erera added a comment - Maybe we make a ScoringWrapperCollector that grabs score in its own collect, then calls collect() on its child? How about this: I remove maxScore() completely from TopDocsCollector. I define newTopDocs to return a TopDocs() w/o a maxScore (like we said, defaulting to Float.NaN). Also change the method signature to receive the start as well as ScoreDoc[] that was collected so far. TopScoreDocCollector will override it, fetching maxScore and returning a TopDocs. TopFieldCollector will do the same, but returning TopFieldDocs. Can you add a paragraph @ top of CHANGES stating the pending default swap in 3.0 Done. In the private final static classes inside TopFieldCollector, you can make the members final too Done.
        Hide
        Shai Erera added a comment -

        Includes the latest comments from Mike.

        Show
        Shai Erera added a comment - Includes the latest comments from Mike.
        Hide
        Michael McCandless added a comment -

        How about this:

        OK that looks good!

        Though we now don't have a single shared ancestor class that can give
        you a maxScore() when sorting by score or when sorting by fields. Not
        sure how big a loss that is though...

        More comments:

        • Should we implement a default
          TopDocsCollector.collect/setScorer/setNextReader? Ie,
          TopDocsCollector is supposed to be the "you provide the PQ and we
          do the rest" collector, most closely matching TopDocCollector.
          Or.... maybe we don't, because we don't know if we should compute
          the score for you, you may want to put something other than
          ScoreDoc into the queue, etc. Who knows what your class
          considers "top".
        • What happened to OnlyPositiveScoresFilter? (I don't see it).
        • It's a bit of a trap on calling either topDocs method in
          TopDocsCollector (or its subclasses) that it has popped everything
          from your PQ as a side-effect. This is technically a pre-existing
          issue, but the new bracketed version makes the trap more "trappy".
          For example, I can't call it N times once for each page – I have
          to re-run the search to get another page's worth of results. Can
          you update javadocs eg something like "NOTE: you cannot call this
          method more than once" .
        Show
        Michael McCandless added a comment - How about this: OK that looks good! Though we now don't have a single shared ancestor class that can give you a maxScore() when sorting by score or when sorting by fields. Not sure how big a loss that is though... More comments: Should we implement a default TopDocsCollector.collect/setScorer/setNextReader? Ie, TopDocsCollector is supposed to be the "you provide the PQ and we do the rest" collector, most closely matching TopDocCollector. Or.... maybe we don't, because we don't know if we should compute the score for you, you may want to put something other than ScoreDoc into the queue, etc. Who knows what your class considers "top". What happened to OnlyPositiveScoresFilter? (I don't see it). It's a bit of a trap on calling either topDocs method in TopDocsCollector (or its subclasses) that it has popped everything from your PQ as a side-effect. This is technically a pre-existing issue, but the new bracketed version makes the trap more "trappy". For example, I can't call it N times once for each page – I have to re-run the search to get another page's worth of results. Can you update javadocs eg something like "NOTE: you cannot call this method more than once" .
        Hide
        Michael McCandless added a comment -

        I did not understand though from your comment if I should do it differently and post another patch, or is that a hint to how someone can still apply the patch?

        The patch command gets confused because it sees set of diffs against what looks to be a pre-existing Collector.java, but of course I have no Collector.java locally. "svn diff" did this because it knows you had renamed MRHC --> C.

        I think for now it's fine if those of us that need to apply the patch simply first copy MultiReaderHitCollector.java --> Collector.java.

        This is yet another example of how "patch" needs to be better integrated with svn: there needs to be a mirror "svn patch" to "svn diff" that's able to properly carry over all svn changes, perhaps do a 3way merge, etc.

        Show
        Michael McCandless added a comment - I did not understand though from your comment if I should do it differently and post another patch, or is that a hint to how someone can still apply the patch? The patch command gets confused because it sees set of diffs against what looks to be a pre-existing Collector.java, but of course I have no Collector.java locally. "svn diff" did this because it knows you had renamed MRHC --> C. I think for now it's fine if those of us that need to apply the patch simply first copy MultiReaderHitCollector.java --> Collector.java. This is yet another example of how "patch" needs to be better integrated with svn: there needs to be a mirror "svn patch" to "svn diff" that's able to properly carry over all svn changes, perhaps do a 3way merge, etc.
        Hide
        Shai Erera added a comment -

        Though we now don't have a single shared ancestor class that can give you a maxScore() when sorting by score or when sorting by fields

        I don't think it's a big issue. We have just two extensions to TopDocsCollector, each tracking maxScore differently. Perhaps later on if more extensions are created, or a demand comes from users, we add a ScoringTopDocsCollector class?

        Should we implement a default TopDocsCollector.collect/setScorer/setNextReader?

        I actually think that TopDocsCollector gives you exactly what I had in mind. When I started it, there was only collect(doc, score) method, and I wanted to have a base class that will implement getTotalHits and topDocs for you, while you only need to provide an implementation to collect(). Now collect has really become setNextReader, setScorer and collect. Meaning, you know what you want to do, TDC just takes care of creating the TopDocs for you, and you can even override topDocs(start, howMany) if you want to return a different TopDocs instance.

        I think now it's clean and simple.

        What happened to OnlyPositiveScoresFilter? (I don't see it).

        Well ... you don't see it because I forgot to implement it .
        Funny thing - all the tests pass without it, meaning all that time, we were filtering <= 0 scored documents, but never really tested it ... I guess if it's because we never really believed it'll happen?

        Anyway, I'll add such a class and a suitable test class (make sure it fails w/o using that wrapper first).

        Can you update javadocs eg something like "NOTE: you cannot call this method more than once" .

        That has always been the case. Previously if you called topDocs() everything has been popped out of the queue for you. I understand what you say though, since we have the topDocs(start, howMany) API, nothing prevents a user from calling topDocs(0, 10) and topDocs(10, 10), only the second call will fail. However, I don't really think that's how people will use it ... and if they do, then perhaps they should just call topDocs() and do whatever they need on these ranges using the TopDocs object?
        I'll add that to the documentation as well.

        Show
        Shai Erera added a comment - Though we now don't have a single shared ancestor class that can give you a maxScore() when sorting by score or when sorting by fields I don't think it's a big issue. We have just two extensions to TopDocsCollector, each tracking maxScore differently. Perhaps later on if more extensions are created, or a demand comes from users, we add a ScoringTopDocsCollector class? Should we implement a default TopDocsCollector.collect/setScorer/setNextReader? I actually think that TopDocsCollector gives you exactly what I had in mind. When I started it, there was only collect(doc, score) method, and I wanted to have a base class that will implement getTotalHits and topDocs for you, while you only need to provide an implementation to collect(). Now collect has really become setNextReader, setScorer and collect. Meaning, you know what you want to do, TDC just takes care of creating the TopDocs for you, and you can even override topDocs(start, howMany) if you want to return a different TopDocs instance. I think now it's clean and simple. What happened to OnlyPositiveScoresFilter? (I don't see it). Well ... you don't see it because I forgot to implement it . Funny thing - all the tests pass without it, meaning all that time, we were filtering <= 0 scored documents, but never really tested it ... I guess if it's because we never really believed it'll happen? Anyway, I'll add such a class and a suitable test class (make sure it fails w/o using that wrapper first). Can you update javadocs eg something like "NOTE: you cannot call this method more than once" . That has always been the case. Previously if you called topDocs() everything has been popped out of the queue for you. I understand what you say though, since we have the topDocs(start, howMany) API, nothing prevents a user from calling topDocs(0, 10) and topDocs(10, 10), only the second call will fail. However, I don't really think that's how people will use it ... and if they do, then perhaps they should just call topDocs() and do whatever they need on these ranges using the TopDocs object? I'll add that to the documentation as well.
        Hide
        Shai Erera added a comment -

        Adds:

        • PositiveScoresOnlyCollector and TestPositiveScoresOnlyCollector.
        • Relevant comments in CHANGES
        • Comments to TopDocsCollector.topDocs(start, howMany) and topDocs(start)
        Show
        Shai Erera added a comment - Adds: PositiveScoresOnlyCollector and TestPositiveScoresOnlyCollector. Relevant comments in CHANGES Comments to TopDocsCollector.topDocs(start, howMany) and topDocs(start)
        Hide
        Michael McCandless added a comment -

        Funny thing - all the tests pass without it, meaning all that time, we were filtering <= 0 scored documents, but never really tested it

        Hmmm!

        I actually think that TopDocsCollector gives you exactly what I had in mind.

        OK let's keep the current approach.

        Perhaps later on if more extensions are created, or a demand comes from users, we add a ScoringTopDocsCollector class?

        OK.

        I think patch looks good – I don't have any more comments now.

        Show
        Michael McCandless added a comment - Funny thing - all the tests pass without it, meaning all that time, we were filtering <= 0 scored documents, but never really tested it Hmmm! I actually think that TopDocsCollector gives you exactly what I had in mind. OK let's keep the current approach. Perhaps later on if more extensions are created, or a demand comes from users, we add a ScoringTopDocsCollector class? OK. I think patch looks good – I don't have any more comments now.
        Hide
        Shai Erera added a comment -

        I think patch looks good - I don't have any more comments now.

        Great !

        Will you be committing it (perhaps after we give some more time for people to review it)?

        Show
        Shai Erera added a comment - I think patch looks good - I don't have any more comments now. Great ! Will you be committing it (perhaps after we give some more time for people to review it)?
        Hide
        Michael McCandless added a comment -

        Will you be committing it (perhaps after we give some more time for people to review it)?

        Yes... but let's let it age some so others can review. Plus, I find coming back and looking at something again after just a couple of days results in a fresh perspective.

        Show
        Michael McCandless added a comment - Will you be committing it (perhaps after we give some more time for people to review it)? Yes... but let's let it age some so others can review. Plus, I find coming back and looking at something again after just a couple of days results in a fresh perspective.
        Hide
        Michael McCandless added a comment -

        Shai, some contrib tests fail to compile (still using MultiReaderHitCollector), eg:

        [javac] /lucene/lucene.collector/contrib/miscellaneous/src/test/org/apache/lucene/index/TestFieldNormModifier.java:26: cannot find symbol
        [javac] symbol  : class MultiReaderHitCollector
        [javac] location: package org.apache.lucene.search
        [javac] import org.apache.lucene.search.MultiReaderHitCollector;
        
        Show
        Michael McCandless added a comment - Shai, some contrib tests fail to compile (still using MultiReaderHitCollector), eg: [javac] /lucene/lucene.collector/contrib/miscellaneous/src/test/org/apache/lucene/index/TestFieldNormModifier.java:26: cannot find symbol [javac] symbol : class MultiReaderHitCollector [javac] location: package org.apache.lucene.search [javac] import org.apache.lucene.search.MultiReaderHitCollector;
        Hide
        Shai Erera added a comment -

        Perhaps you can help me here - I tried to run the test Ant task and all
        tests passed. Then it got to building contrib's db and failed on sleepycat
        with a major/minor version 49.0. Does this require Java 1.5? Should I run
        the task with a 1.5 JDK?
        The task stopped at this point with a BUILD FAILED message and did not
        continue to other modules.

        I don't have the entire contrib in my build path, only a handful of them
        which were not affected by these changes. Anyway, I'll take a look at it
        tomorrow and submit another patch.

        On Wed, Apr 1, 2009 at 7:29 PM, Michael McCandless (JIRA)

        Show
        Shai Erera added a comment - Perhaps you can help me here - I tried to run the test Ant task and all tests passed. Then it got to building contrib's db and failed on sleepycat with a major/minor version 49.0. Does this require Java 1.5? Should I run the task with a 1.5 JDK? The task stopped at this point with a BUILD FAILED message and did not continue to other modules. I don't have the entire contrib in my build path, only a handful of them which were not affected by these changes. Anyway, I'll take a look at it tomorrow and submit another patch. On Wed, Apr 1, 2009 at 7:29 PM, Michael McCandless (JIRA)
        Hide
        Michael McCandless added a comment -

        Does this require Java 1.5? Should I run the task with a 1.5 JDK?

        Yes, likely contrib/db (and others) require 1.5.

        You should be able to just do "ant test-contrib" in the toplevel dir.

        Show
        Michael McCandless added a comment - Does this require Java 1.5? Should I run the task with a 1.5 JDK? Yes, likely contrib/db (and others) require 1.5. You should be able to just do "ant test-contrib" in the toplevel dir.
        Hide
        Shai Erera added a comment -

        Fixed TestFieldNormModifier and TestLengthNormModifier.
        All tests pass now (including contrib)

        Show
        Shai Erera added a comment - Fixed TestFieldNormModifier and TestLengthNormModifier. All tests pass now (including contrib)
        Hide
        Michael McCandless added a comment -

        Could you also run "ant test-tag" (which tests JAR-drop-in back-compatibility)? EG I'm getting this compilation error:

        [javac] /lucene/src/lucene.collection/tags/lucene_2_4_back_compat_tests_20090320/src/test/org/apache/lucene/search/TestTimeLimitedCollector.java:136: incompatible types
        [javac] found   : org.apache.lucene.search.TimeLimitedCollector
        [javac] required: org.apache.lucene.search.HitCollector
        [javac]     return res;
        [javac]            ^
        
        Show
        Michael McCandless added a comment - Could you also run "ant test-tag" (which tests JAR-drop-in back-compatibility)? EG I'm getting this compilation error: [javac] /lucene/src/lucene.collection/tags/lucene_2_4_back_compat_tests_20090320/src/test/org/apache/lucene/search/TestTimeLimitedCollector.java:136: incompatible types [javac] found : org.apache.lucene.search.TimeLimitedCollector [javac] required: org.apache.lucene.search.HitCollector [javac] return res; [javac] ^
        Hide
        Shai Erera added a comment -

        I thought that ant test runs all tests. Thanks for the education.

        The reason is that TimeLimitedCollector now extends Collector, which does not extend HitCollector. Therefore the method attempts to return an invalid type. I'm not sure how to fix it, because I cannot change the 2.4 test code, since Collector is not there.

        So the only reasonable solution I see here is to:

        • Change TimeLimitedCollector to extend HitCollector, document that in 3.0 it will change to extend Collector and that in the meantime use HitCollectorWrapper if you want.
        • Comment out all the Collector related methods, including the new ctor, with a TODO to reenstate in 3.0.
        • Fix the TestTimeLimitedCollector wrap it with a HCW as well as using only HitCollector as the wrapped collector.

        Other solutions which I don't like are:

        • deprecate TLC and create a new one NewTimeLimitedCollector - I hate the name
        • Have Collector extend HitCollector - I hate to even consider that.

        What do you think?

        Show
        Shai Erera added a comment - I thought that ant test runs all tests. Thanks for the education. The reason is that TimeLimitedCollector now extends Collector, which does not extend HitCollector. Therefore the method attempts to return an invalid type. I'm not sure how to fix it, because I cannot change the 2.4 test code, since Collector is not there. So the only reasonable solution I see here is to: Change TimeLimitedCollector to extend HitCollector, document that in 3.0 it will change to extend Collector and that in the meantime use HitCollectorWrapper if you want. Comment out all the Collector related methods, including the new ctor, with a TODO to reenstate in 3.0. Fix the TestTimeLimitedCollector wrap it with a HCW as well as using only HitCollector as the wrapped collector. Other solutions which I don't like are: deprecate TLC and create a new one NewTimeLimitedCollector - I hate the name Have Collector extend HitCollector - I hate to even consider that. What do you think?
        Hide
        Michael McCandless added a comment -

        I thought that ant test runs all tests. Thanks for the education.

        Probably, it should. I'll raise this on java-dev.

        Change TimeLimitedCollector to extend HitCollector, document that in 3.0 it will change to extend Collector and that in the meantime use HitCollectorWrapper if you want.

        I think I like this solution best (though this is very much a lesser of all evils situation).

        <lament>
        Ahh the contortions we must go through because of Lucene's success. Marvin over on Lucy can happily make major changes without batting an eye. The sad reality is that the ongoing growth rate of a thing is inversely proportional to its popularity.
        </lament>

        Show
        Michael McCandless added a comment - I thought that ant test runs all tests. Thanks for the education. Probably, it should. I'll raise this on java-dev. Change TimeLimitedCollector to extend HitCollector, document that in 3.0 it will change to extend Collector and that in the meantime use HitCollectorWrapper if you want. I think I like this solution best (though this is very much a lesser of all evils situation). <lament> Ahh the contortions we must go through because of Lucene's success. Marvin over on Lucy can happily make major changes without batting an eye. The sad reality is that the ongoing growth rate of a thing is inversely proportional to its popularity. </lament>
        Hide
        Shai Erera added a comment -

        Changes:

        1. TimeLimitedCollector, TestTimeLimitedCollector and CHANGES.
        2. I also fixed a bug in TestTermScorer, which was discovered by the test-tag task, and existed since 1483 and propagated into HitCollectorWrapper as well: docBase was set to -1 by default, relying on setNextReader to be called. However if it's not called (as in TestTermScorer, or if someone called Scorer.score(Collector)), all document Ids are shifted backwards by 1. The test had a bug which asserted on the unshifted doc Id, and after I fixed the Ids to shift, it failed. Anyway, the test now works correctly, as well as HCW.
        3. I checked all other Collector implementations and changed the default base to 0, unless in some test cases, where -1 had a meaning.

        All tests (contrib, core and tags) pass.

        Show
        Shai Erera added a comment - Changes: TimeLimitedCollector, TestTimeLimitedCollector and CHANGES. I also fixed a bug in TestTermScorer, which was discovered by the test-tag task, and existed since 1483 and propagated into HitCollectorWrapper as well: docBase was set to -1 by default, relying on setNextReader to be called. However if it's not called (as in TestTermScorer, or if someone called Scorer.score(Collector)), all document Ids are shifted backwards by 1. The test had a bug which asserted on the unshifted doc Id, and after I fixed the Ids to shift, it failed. Anyway, the test now works correctly, as well as HCW. I checked all other Collector implementations and changed the default base to 0, unless in some test cases, where -1 had a meaning. All tests (contrib, core and tags) pass.
        Hide
        Michael McCandless added a comment -

        Super, all tests pass for me too...

        Show
        Michael McCandless added a comment - Super, all tests pass for me too...
        Hide
        Shai Erera added a comment -

        I've been thinking about TimeLimitedCollector and the revert to extend HitCollector I had to do in the last patch - the main reason was that I couldn't find a better name and did not want to deprecate it. But then, I thought that perhaps the current name is not so good, and we can change it? Syntactically, it is not a 'limited' collector, but more of a 'limiting' collector (I think, not being a native English speaker I may be wrong).
        Alternative names I've been thinking about are TimeKeeperCollector, TimeLimitingCollector, TimingOutCollector.
        The advantage is that we deprecate the current one and have a clear back-compat support, instead of changing it in 3.0 to extend Collector. If you agree with any of these names I can create a new class, deprecate the current one, change the tests back to use the new version (and remove all those comments about the changes in 3.0). What do you think?

        Show
        Shai Erera added a comment - I've been thinking about TimeLimitedCollector and the revert to extend HitCollector I had to do in the last patch - the main reason was that I couldn't find a better name and did not want to deprecate it. But then, I thought that perhaps the current name is not so good, and we can change it? Syntactically, it is not a 'limited' collector, but more of a 'limiting' collector (I think, not being a native English speaker I may be wrong). Alternative names I've been thinking about are TimeKeeperCollector, TimeLimitingCollector, TimingOutCollector. The advantage is that we deprecate the current one and have a clear back-compat support, instead of changing it in 3.0 to extend Collector. If you agree with any of these names I can create a new class, deprecate the current one, change the tests back to use the new version (and remove all those comments about the changes in 3.0). What do you think?
        Hide
        Michael McCandless added a comment -

        If you agree with any of these names I can create a new class, deprecate the current one, change the tests back to use the new version (and remove all those comments about the changes in 3.0). What do you think?

        I like this approach. I like "TimeLimitingCollector", or maybe "TimeoutCollector"?

        Show
        Michael McCandless added a comment - If you agree with any of these names I can create a new class, deprecate the current one, change the tests back to use the new version (and remove all those comments about the changes in 3.0). What do you think? I like this approach. I like "TimeLimitingCollector", or maybe "TimeoutCollector"?
        Hide
        Michael McCandless added a comment -

        OK, I attached a new patch with some minor changes:

        • Beefed up javadocs in Collector.java; fixed other javadocs
          warnings. Tweaked CHANGES.txt.
        • Renamed PositiveOnlyScoresCollector -->
          PositiveScoresOnlyCollector

        And also came across these questions/issues:

        • TopFieldCollector's updateBottom & add methods take score, and are
          passed score from the non-scoring collectors, but shouldn't?
        • TermScorer need not override score(HitCollector hc) (super does
          the same thing).
        • The changes to TermScorer make me a bit nervous. EG, the new
          InternalScorer: will it hurt performance? Also this part:
          +        // Set the Scorer doc and score before calling collect in case it will be
          +        // used in collect()
          +        s.d = doc;
          +        s.score = score;
          +        c.collect(doc);                      // collect score
          

          is spooky: I don't like how we worry that one may call scorer.doc() (I
          don't like the ambiguity in the API – we both pass doc and fear you
          may call scorer.doc()). Not sure how to resolve it.

        • Hmm – we added a new abstract method to
          src/java/org/apache/lucene/search/Searcher.java (that accepts
          Collector). Should that method be concrete (and throw UOE), for
          back compat?
        • We've also added a method to the "Searchable" interface, which is
          a break in back-compat. But my feeling is we should allow this
          break (but Shai can you add another Note at the top of
          CHANGES.txt, calling this out?).
        Show
        Michael McCandless added a comment - OK, I attached a new patch with some minor changes: Beefed up javadocs in Collector.java; fixed other javadocs warnings. Tweaked CHANGES.txt. Renamed PositiveOnlyScoresCollector --> PositiveScoresOnlyCollector And also came across these questions/issues: TopFieldCollector's updateBottom & add methods take score, and are passed score from the non-scoring collectors, but shouldn't? TermScorer need not override score(HitCollector hc) (super does the same thing). The changes to TermScorer make me a bit nervous. EG, the new InternalScorer: will it hurt performance? Also this part: + // Set the Scorer doc and score before calling collect in case it will be + // used in collect() + s.d = doc; + s.score = score; + c.collect(doc); // collect score is spooky: I don't like how we worry that one may call scorer.doc() (I don't like the ambiguity in the API – we both pass doc and fear you may call scorer.doc()). Not sure how to resolve it. Hmm – we added a new abstract method to src/java/org/apache/lucene/search/Searcher.java (that accepts Collector). Should that method be concrete (and throw UOE), for back compat? We've also added a method to the "Searchable" interface, which is a break in back-compat. But my feeling is we should allow this break (but Shai can you add another Note at the top of CHANGES.txt, calling this out?).
        Hide
        Shai Erera added a comment -

        I like "TimeLimitingCollector", or maybe "TimeoutCollector"?

        I like TimeLimitingCollector better, as I think the name makes the class more self explanatory.

        TopFieldCollector's updateBottom & add methods take score, and are passed score from the non-scoring collectors, but shouldn't?

        At the end of the day, even the non-scoring collectors store a score in ScoreDoc, which is Float.NaN. So they should pass a score. Unlike the scoring ones, they always pass Float.NaN without ever calling scorer.score(). That's the cleanest way I've found I can make the changes to that class, w/o duplicating implementation all over the place. Notice that the scoring versions extend the non-scoring, and just add score computation, which resulted in a very clean implementation.

        TermScorer need not override score(HitCollector hc) (super does the same thing).

        Agreed.

        The changes to TermScorer make me a bit nervous.

        Since we pass Sorer to Collector, I thought we cannot really rely on anyone not calling scorer.doc() or getSimilarity ever - it is in the API. Since doc() is abstract, I had to implement it and just thought that retuning the current doc is better than -1 for example. There are some alternatives I see to resolve it:

        1. Create an abstract ScoringOnlyScorer which extends Scorer and implements all methods to throw UOE (also as final), besides score() which it will define abstract. We then define a ScoringOnlyScorerWrapper which takes a Scorer and delegates the score() calls. We use SOSW in places where we can't extend SOS. Where we can, we just extend it directly and implement score(), like in the InternalScorer case.
        2. Create a new class which implements just score() (I've yet to come with a good name since Scorer is already taken) and create a wrapper which takes a Scorer and delegates the score() calls to it. Then Collector will use that new class, and we're sure that only score() can be called.

        The last two comments are completely an overlook by my side. I'm not so sure about your proposal though. If we add to Searcher a concrete impl which throws UOE, how would that work in 3.0? How would anyone who extends Searcher know that it has to extend this method? Maybe do it now, and document that in 3.0 it will become abstract again?
        About Searchable, I wonder how many do implement Searchable, rather than extend IndexSearcher. Perhaps instead of making any changes in back-compat and add documentation to CHANGES I'll just comment out this method with a TODO to re-enstate in 3.0?

        Show
        Shai Erera added a comment - I like "TimeLimitingCollector", or maybe "TimeoutCollector"? I like TimeLimitingCollector better, as I think the name makes the class more self explanatory. TopFieldCollector's updateBottom & add methods take score, and are passed score from the non-scoring collectors, but shouldn't? At the end of the day, even the non-scoring collectors store a score in ScoreDoc, which is Float.NaN. So they should pass a score. Unlike the scoring ones, they always pass Float.NaN without ever calling scorer.score(). That's the cleanest way I've found I can make the changes to that class, w/o duplicating implementation all over the place. Notice that the scoring versions extend the non-scoring, and just add score computation, which resulted in a very clean implementation. TermScorer need not override score(HitCollector hc) (super does the same thing). Agreed. The changes to TermScorer make me a bit nervous. Since we pass Sorer to Collector, I thought we cannot really rely on anyone not calling scorer.doc() or getSimilarity ever - it is in the API. Since doc() is abstract, I had to implement it and just thought that retuning the current doc is better than -1 for example. There are some alternatives I see to resolve it: Create an abstract ScoringOnlyScorer which extends Scorer and implements all methods to throw UOE (also as final), besides score() which it will define abstract. We then define a ScoringOnlyScorerWrapper which takes a Scorer and delegates the score() calls. We use SOSW in places where we can't extend SOS. Where we can, we just extend it directly and implement score(), like in the InternalScorer case. Create a new class which implements just score() (I've yet to come with a good name since Scorer is already taken) and create a wrapper which takes a Scorer and delegates the score() calls to it. Then Collector will use that new class, and we're sure that only score() can be called. The last two comments are completely an overlook by my side. I'm not so sure about your proposal though. If we add to Searcher a concrete impl which throws UOE, how would that work in 3.0? How would anyone who extends Searcher know that it has to extend this method? Maybe do it now, and document that in 3.0 it will become abstract again? About Searchable, I wonder how many do implement Searchable, rather than extend IndexSearcher. Perhaps instead of making any changes in back-compat and add documentation to CHANGES I'll just comment out this method with a TODO to re-enstate in 3.0?
        Hide
        Michael McCandless added a comment -

        I like TimeLimitingCollector better, as I think the name makes the class more self explanatory.

        OK let's go with that!

        At the end of the day, even the non-scoring collectors store a score in ScoreDoc, which is Float.NaN. So they should pass a score. Unlike the scoring ones, they always pass Float.NaN without ever calling scorer.score(). That's the cleanest way I've found I can make the changes to that class, w/o duplicating implementation all over the place. Notice that the scoring versions extend the non-scoring, and just add score computation, which resulted in a very clean implementation.

        OK... let's stick with this approach for now. Since the impl is
        locked down (ctor for TopFieldCollector is private) we can freely
        switch up this API in the future without breaking back compat, if we
        want to optimize not passing/copying around the unused score.

        Can't the scoring collector impls in TopFieldCollector be final?

        Since we pass Sorer to Collector, I thought we cannot really rely on anyone not calling scorer.doc() or getSimilarity ever

        Maybe instead make InternalScorer non-static, and then doc() can
        return the doc from the TermScorer instance, instead of having to copy
        "s.d = doc" each time? score can do a similar thing.

        Actually, hang on: if I'm using a Collector that doesn't need the
        score, TermScoring is still computing it? We don't want that right?
        Can we simply pass "this" to setScorer(...)?

        If we add to Searcher a concrete impl which throws UOE, how would that work in 3.0? How would anyone who extends Searcher know that it has to extend this method? Maybe do it now, and document that in 3.0 it will become abstract again?

        OK let's do that?

        About Searchable, I wonder how many do implement Searchable, rather than extend IndexSearcher. Perhaps instead of making any changes in back-compat and add documentation to CHANGES I'll just comment out this method with a TODO to re-enstate in 3.0?

        OK.

        Make sure at the end of all of this, you open a new issue, marked as
        fix version 3.0, that has all the "and then on 3.0 we do XYZ"s from
        this.

        Show
        Michael McCandless added a comment - I like TimeLimitingCollector better, as I think the name makes the class more self explanatory. OK let's go with that! At the end of the day, even the non-scoring collectors store a score in ScoreDoc, which is Float.NaN. So they should pass a score. Unlike the scoring ones, they always pass Float.NaN without ever calling scorer.score(). That's the cleanest way I've found I can make the changes to that class, w/o duplicating implementation all over the place. Notice that the scoring versions extend the non-scoring, and just add score computation, which resulted in a very clean implementation. OK... let's stick with this approach for now. Since the impl is locked down (ctor for TopFieldCollector is private) we can freely switch up this API in the future without breaking back compat, if we want to optimize not passing/copying around the unused score. Can't the scoring collector impls in TopFieldCollector be final? Since we pass Sorer to Collector, I thought we cannot really rely on anyone not calling scorer.doc() or getSimilarity ever Maybe instead make InternalScorer non-static, and then doc() can return the doc from the TermScorer instance, instead of having to copy "s.d = doc" each time? score can do a similar thing. Actually, hang on: if I'm using a Collector that doesn't need the score, TermScoring is still computing it? We don't want that right? Can we simply pass "this" to setScorer(...)? If we add to Searcher a concrete impl which throws UOE, how would that work in 3.0? How would anyone who extends Searcher know that it has to extend this method? Maybe do it now, and document that in 3.0 it will become abstract again? OK let's do that? About Searchable, I wonder how many do implement Searchable, rather than extend IndexSearcher. Perhaps instead of making any changes in back-compat and add documentation to CHANGES I'll just comment out this method with a TODO to re-enstate in 3.0? OK. Make sure at the end of all of this, you open a new issue, marked as fix version 3.0, that has all the "and then on 3.0 we do XYZ"s from this.
        Hide
        Shai Erera added a comment -

        Can't the scoring collector impls in TopFieldCollector be final?

        They can, but they are private so they cannot be extended anyway. I can do that, but does it really matter?

        We don't want that right? Can we simply pass "this" to setScorer(...)?

        That's what I wanted to do, but then noticed that TermScorer.score() method is a bit different. However, now that I look at it again, I wonder if they are different. The difference is that in score(), it does at the end

        return raw * Similarity.decodeNorm(norms[doc]);
        

        and in score(Collector, int) it does

        float[] normDecoder = Similarity.getNormDecoder();
        ...
        score *= normDecoder[norms[doc] & 0xFF];
        

        Looking in Similarity.decodeNorm, it does exactly what's done in score(Collector, int). So I guess this code has been duplicated for no good reason? Please validate what I wrote and if you also agree, I can change the entire method (score(Collector, int)) to not compute any score and call c.setScorer(this). That will solve it.

        So are you ok with passing Scorer to Collector, instead of just a class with a single score() method?

        I will open an issue w/ a fix version 3.0 and take care of all those TODOs. Should the issue also get rid of the deprecated methods? Or will we have a general issue in 3.0 that removes all deprecated methods?

        Show
        Shai Erera added a comment - Can't the scoring collector impls in TopFieldCollector be final? They can, but they are private so they cannot be extended anyway. I can do that, but does it really matter? We don't want that right? Can we simply pass "this" to setScorer(...)? That's what I wanted to do, but then noticed that TermScorer.score() method is a bit different. However, now that I look at it again, I wonder if they are different. The difference is that in score(), it does at the end return raw * Similarity.decodeNorm(norms[doc]); and in score(Collector, int) it does float [] normDecoder = Similarity.getNormDecoder(); ... score *= normDecoder[norms[doc] & 0xFF]; Looking in Similarity.decodeNorm, it does exactly what's done in score(Collector, int). So I guess this code has been duplicated for no good reason? Please validate what I wrote and if you also agree, I can change the entire method (score(Collector, int)) to not compute any score and call c.setScorer(this). That will solve it. So are you ok with passing Scorer to Collector, instead of just a class with a single score() method? I will open an issue w/ a fix version 3.0 and take care of all those TODOs. Should the issue also get rid of the deprecated methods? Or will we have a general issue in 3.0 that removes all deprecated methods?
        Hide
        Shai Erera added a comment -

        BTW Mike - I think the accidental changes to Searchable and Searcher could have been easily detected by test-tags if we had classes in the back-compat tag which implemented interfaces / extended abstract classes with empty implementations. These are not really junit tests, but if someone would have changed an interface or abstract class, then attempting to compile the test package against the trunk would fail.

        It is not so relevant now, since the next release is 2.9 following by a 3.0 and back-compat will completely go away in 3.0, but perhaps post 3.0? Also, it will prevent us from making changes to back-compat like we wanted to in this issue, but perhaps it's good?

        Show
        Shai Erera added a comment - BTW Mike - I think the accidental changes to Searchable and Searcher could have been easily detected by test-tags if we had classes in the back-compat tag which implemented interfaces / extended abstract classes with empty implementations. These are not really junit tests, but if someone would have changed an interface or abstract class, then attempting to compile the test package against the trunk would fail. It is not so relevant now, since the next release is 2.9 following by a 3.0 and back-compat will completely go away in 3.0, but perhaps post 3.0? Also, it will prevent us from making changes to back-compat like we wanted to in this issue, but perhaps it's good?
        Hide
        Michael McCandless added a comment -

        I ran a "first do no harm" perf test, comparing trunk with this patch:

        query sort hits qps qpsnew pctg
        147 score 6953 3631.1 3641.8 0.3%
        147 title 6953 2916.7 2255.6 -22.7%
        147 doc 6953 3251.2 2676.8 -17.7%
        text score 157101 208.1 202.1 -2.9%
        text title 157101 96.7 84.8 -12.3%
        text doc 157101 174.0 115.2 -33.8%
        1 score 565452 58.0 56.4 -2.8%
        1 title 565452 44.5 34.1 -23.4%
        1 doc 565452 49.2 32.8 -33.3%
        1 OR 2 score 784928 14.1 13.7 -2.8%
        1 OR 2 title 784928 12.5 11.5 -8.0%
        1 OR 2 doc 784928 13.0 11.9 -8.5%
        1 AND 2 score 333153 15.5 15.5 0.0%
        1 AND 2 title 333153 14.8 13.7 -7.4%
        1 AND 2 doc 333153 15.2 14.2 -6.6%

        Looks like:

        • Sort by relevance got maybe a tad slower (~3%)
        • Sort by field is now quite a bit slower (23-33% on term query '1')

        This was on a full wikipedia index, with 14 segments, Sun java
        1.6.0_07 on OS X Mac Pro quad core, on Intel X25M 160 GB
        SSD.

        I think we need to iterate some to try to get some performance back.

        Show
        Michael McCandless added a comment - I ran a "first do no harm" perf test, comparing trunk with this patch: query sort hits qps qpsnew pctg 147 score 6953 3631.1 3641.8 0.3% 147 title 6953 2916.7 2255.6 -22.7% 147 doc 6953 3251.2 2676.8 -17.7% text score 157101 208.1 202.1 -2.9% text title 157101 96.7 84.8 -12.3% text doc 157101 174.0 115.2 -33.8% 1 score 565452 58.0 56.4 -2.8% 1 title 565452 44.5 34.1 -23.4% 1 doc 565452 49.2 32.8 -33.3% 1 OR 2 score 784928 14.1 13.7 -2.8% 1 OR 2 title 784928 12.5 11.5 -8.0% 1 OR 2 doc 784928 13.0 11.9 -8.5% 1 AND 2 score 333153 15.5 15.5 0.0% 1 AND 2 title 333153 14.8 13.7 -7.4% 1 AND 2 doc 333153 15.2 14.2 -6.6% Looks like: Sort by relevance got maybe a tad slower (~3%) Sort by field is now quite a bit slower (23-33% on term query '1') This was on a full wikipedia index, with 14 segments, Sun java 1.6.0_07 on OS X Mac Pro quad core, on Intel X25M 160 GB SSD. I think we need to iterate some to try to get some performance back.
        Hide
        Michael McCandless added a comment -

        > Can't the scoring collector impls in TopFieldCollector be final?

        They can, but they are private so they cannot be extended anyway. I can do that, but does it really matter?

        I was thinking in case it eeks performance.

        So I guess this code has been duplicated for no good reason?

        Duplicated for performance I think.

        I can change the entire method (score(Collector, int)) to not compute any score and call c.setScorer(this). That will solve it.

        I think we should try this?

        So are you ok with passing Scorer to Collector, instead of just a class with a single score() method?

        Good question... I'm not sure. It would be cleaner to expose only
        score() (and I think we could add methods over time), but then we'll
        be creating new instance per segment per search which'll only slow
        things down.

        I will open an issue w/ a fix version 3.0 and take care of all those TODOs. Should the issue also get rid of the deprecated methods? Or will we have a general issue in 3.0 that removes all deprecated methods?

        You don't need to enumerate deprecated methods to get rid of – we
        won't forget those ones It's these other "special" tasks that may
        slip through the cracks.

        Show
        Michael McCandless added a comment - > Can't the scoring collector impls in TopFieldCollector be final? They can, but they are private so they cannot be extended anyway. I can do that, but does it really matter? I was thinking in case it eeks performance. So I guess this code has been duplicated for no good reason? Duplicated for performance I think. I can change the entire method (score(Collector, int)) to not compute any score and call c.setScorer(this). That will solve it. I think we should try this? So are you ok with passing Scorer to Collector, instead of just a class with a single score() method? Good question... I'm not sure. It would be cleaner to expose only score() (and I think we could add methods over time), but then we'll be creating new instance per segment per search which'll only slow things down. I will open an issue w/ a fix version 3.0 and take care of all those TODOs. Should the issue also get rid of the deprecated methods? Or will we have a general issue in 3.0 that removes all deprecated methods? You don't need to enumerate deprecated methods to get rid of – we won't forget those ones It's these other "special" tasks that may slip through the cracks.
        Hide
        Michael McCandless added a comment -

        BTW Mike - I think the accidental changes to Searchable and Searcher could have been easily detected by test-tags if we had classes in the back-compat tag which implemented interfaces / extended abstract classes with empty implementations. These are not really junit tests, but if someone would have changed an interface or abstract class, then attempting to compile the test package against the trunk would fail.

        I think that's a great idea! Every interface/abstract class should
        have a "just compile me" subclass in the tests.

        It is not so relevant now, since the next release is 2.9 following by a 3.0 and back-compat will completely go away in 3.0, but perhaps post 3.0?

        It is relevant because neither Searchable nor Searcher are deprecated
        (yet)? Ie during development of 2.9 and of 3.0 we have to ensure we
        don't break back compat of non-deprecated APIs.

        So maybe fold this in on the next patch iteration?

        Also, it will prevent us from making changes to back-compat like we wanted to in this issue, but perhaps it's good?

        It's good, because it'd raise the issue right way vs us catching it or
        not by staring at the code

        Show
        Michael McCandless added a comment - BTW Mike - I think the accidental changes to Searchable and Searcher could have been easily detected by test-tags if we had classes in the back-compat tag which implemented interfaces / extended abstract classes with empty implementations. These are not really junit tests, but if someone would have changed an interface or abstract class, then attempting to compile the test package against the trunk would fail. I think that's a great idea! Every interface/abstract class should have a "just compile me" subclass in the tests. It is not so relevant now, since the next release is 2.9 following by a 3.0 and back-compat will completely go away in 3.0, but perhaps post 3.0? It is relevant because neither Searchable nor Searcher are deprecated (yet)? Ie during development of 2.9 and of 3.0 we have to ensure we don't break back compat of non-deprecated APIs. So maybe fold this in on the next patch iteration? Also, it will prevent us from making changes to back-compat like we wanted to in this issue, but perhaps it's good? It's good, because it'd raise the issue right way vs us catching it or not by staring at the code
        Hide
        Jason Rutherglen added a comment -

        Something related to time limiting collectors we may want to
        solve (maybe not in this patch) is passing the time limiting to
        the sub-scorers. At the hit collector level the sub-scorers of a
        multi clause query could be busy exceeding the time limit before
        returning the first doc hit?

        Show
        Jason Rutherglen added a comment - Something related to time limiting collectors we may want to solve (maybe not in this patch) is passing the time limiting to the sub-scorers. At the hit collector level the sub-scorers of a multi clause query could be busy exceeding the time limit before returning the first doc hit?
        Hide
        Shai Erera added a comment -

        How do I run such a test? Is there an algorithm for that in the benchmark
        package?

        I compared the new TSDC to the trunk's version and the new code does ('-'
        means a negative change, '+' means a positive change, '|' means
        neither/undetermined):

        • adds one collector.setScorer() call to each query.
        • The scorer.score() call in collect() was just moved from whoever called
          collect() to inside collect(), so I don't think there's a difference. (|)
        • Does not check if score > 0.0f in each collect
        • implements the new topDocs() method. Previously, it just implemented
          topDocs() which returned everything. Now, topDocs() calls topDocs(0,
          pq.size()), which verifies parameters and such - since that's executed once
          at the end of the search, I doubt that it has any effect major effect on the
          results.

        BTW, as I scanned through the code I noticed that previously TSDC returned
        maxScore = Float.NEGATIVE_INFINITY in case there were 0 results to the
        query, and now it returns Float.NaN. I'm not sure however if this breaks
        anything, since maxScore is probably used (if at all) for normalization of
        scores, and in case there are 0 results you don't really have anything to
        normalize? However I'm not sure ...

        Regarding TopFieldDocs I am quite surprised. I assume the test uses the
        OneComparatorScoringCollector, which means scores are computed:

        • It has the same issue as in TSDC regarding topDocs(). So I think it should
          be changed here as well, however I doubt that's the cause for the
          performance hit.
        • It computes the score and then does super.collect(), which adds a method
          call
        • It doesn't check if the score is > 0
        • It calls comparator.setScorer, which is ignored in all comparators besides
          RelevanceComparator. Not sure if it has any performance effects (|)
          The rest of the code in collect() is exactly the same.

        Can it be that super.collect() has such an effect? When I think on the
        results of TSDC (-3%) vs. TFC (-28% on avg.), I think it might be since
        setScorer() is called once before the series of collect() calls, however
        super.collect() is called for every document. Your index is large (>2M
        documents, right?) and I don't know how many results are for each query, if
        they are in the range of 100Ks, then that could be the explanation.

        Mike - in case it's faster for you to run it, can you try to run the test
        again with a change in the code which inlines super.collect() into
        OneComparatorScoringCollector and compare the results again? I will run it
        also after you tell me which algorithm you used, but only tomorrow morning,
        so if you get to do it before then, that'd be great.

        I doubt that the change in topDocs() affects the query time that much, since
        it's called at the end of the search, and doing 4-5 'if' statements is
        really not that expensive (I mean once per the entire search), comparing to
        ScoreDoc[] array allocation, fetching Stored fields from the index etc. So
        I'd hate to implement all 3 topDocs() in each of the TopDocsCollector
        extensions unless it proves to be a problem.

        Shai

        On Fri, Apr 3, 2009 at 10:02 PM, Michael McCandless (JIRA)
        <jira@apache.org>wrote:

        Show
        Shai Erera added a comment - How do I run such a test? Is there an algorithm for that in the benchmark package? I compared the new TSDC to the trunk's version and the new code does ('-' means a negative change, '+' means a positive change, '|' means neither/undetermined): adds one collector.setScorer() call to each query. The scorer.score() call in collect() was just moved from whoever called collect() to inside collect(), so I don't think there's a difference. (|) Does not check if score > 0.0f in each collect implements the new topDocs() method. Previously, it just implemented topDocs() which returned everything. Now, topDocs() calls topDocs(0, pq.size()), which verifies parameters and such - since that's executed once at the end of the search, I doubt that it has any effect major effect on the results. BTW, as I scanned through the code I noticed that previously TSDC returned maxScore = Float.NEGATIVE_INFINITY in case there were 0 results to the query, and now it returns Float.NaN. I'm not sure however if this breaks anything, since maxScore is probably used (if at all) for normalization of scores, and in case there are 0 results you don't really have anything to normalize? However I'm not sure ... Regarding TopFieldDocs I am quite surprised. I assume the test uses the OneComparatorScoringCollector, which means scores are computed: It has the same issue as in TSDC regarding topDocs(). So I think it should be changed here as well, however I doubt that's the cause for the performance hit. It computes the score and then does super.collect(), which adds a method call It doesn't check if the score is > 0 It calls comparator.setScorer, which is ignored in all comparators besides RelevanceComparator. Not sure if it has any performance effects (|) The rest of the code in collect() is exactly the same. Can it be that super.collect() has such an effect? When I think on the results of TSDC (-3%) vs. TFC (-28% on avg.), I think it might be since setScorer() is called once before the series of collect() calls, however super.collect() is called for every document. Your index is large (>2M documents, right?) and I don't know how many results are for each query, if they are in the range of 100Ks, then that could be the explanation. Mike - in case it's faster for you to run it, can you try to run the test again with a change in the code which inlines super.collect() into OneComparatorScoringCollector and compare the results again? I will run it also after you tell me which algorithm you used, but only tomorrow morning, so if you get to do it before then, that'd be great. I doubt that the change in topDocs() affects the query time that much, since it's called at the end of the search, and doing 4-5 'if' statements is really not that expensive (I mean once per the entire search), comparing to ScoreDoc[] array allocation, fetching Stored fields from the index etc. So I'd hate to implement all 3 topDocs() in each of the TopDocsCollector extensions unless it proves to be a problem. Shai On Fri, Apr 3, 2009 at 10:02 PM, Michael McCandless (JIRA) <jira@apache.org>wrote:
        Hide
        Shai Erera added a comment -

        BTW, I can change FieldValueHitQueue like I changed TopFieldCollector by
        introducing a factory create() method which will return a
        OneComparaterFieldValueHitQueue and MultiComparatorsFieldValueHitQueue.
        Today, FVHQ.lessThan checks the numComparators in each call, which is
        redundant.

        Also the class isn't final and I'm not sure if we want to change it.

        What do you think?

        Show
        Shai Erera added a comment - BTW, I can change FieldValueHitQueue like I changed TopFieldCollector by introducing a factory create() method which will return a OneComparaterFieldValueHitQueue and MultiComparatorsFieldValueHitQueue. Today, FVHQ.lessThan checks the numComparators in each call, which is redundant. Also the class isn't final and I'm not sure if we want to change it. What do you think?
        Hide
        Shai Erera added a comment -

        Mike - about your comments on the new Searcher and Searchable search(Weight, Filter, Collector). I think that best (if not only) option currently is to remove them from the interface (comment out I mean) with a TODO to add in 3.0.

        I tried to just comment out in Searchable, and empty impl in Searcher which throws UOE. However that caused a problem in in MultiSearcher, ParallelMultiSearcher and RemoteSearchable:

        • RemoteSearchable impls Searchable - commenting out the new impl method with a TODO for 3.0 will be fine, but
        • MS and PMS accept Searchable in their ctor and use them in search(W, F, C) which they extend from Searcher (they MUST extend it because Searcher's throws UOE). However they call searchable.search, which accepts just a HC, and we can't wrap a Collector with a HC.

        Previously, MS and PMS implemented the HC version by always wrapping with a MRHC. I think we should just pass in the given HC to the Searchable.search method, and rely on its wrapping by a HCW later on. In 3.0 we'll delete it entirely and use the Collector implementation.

        Do you see any other way?

        Show
        Shai Erera added a comment - Mike - about your comments on the new Searcher and Searchable search(Weight, Filter, Collector). I think that best (if not only) option currently is to remove them from the interface (comment out I mean) with a TODO to add in 3.0. I tried to just comment out in Searchable, and empty impl in Searcher which throws UOE. However that caused a problem in in MultiSearcher, ParallelMultiSearcher and RemoteSearchable: RemoteSearchable impls Searchable - commenting out the new impl method with a TODO for 3.0 will be fine, but MS and PMS accept Searchable in their ctor and use them in search(W, F, C) which they extend from Searcher (they MUST extend it because Searcher's throws UOE). However they call searchable.search, which accepts just a HC, and we can't wrap a Collector with a HC. Previously, MS and PMS implemented the HC version by always wrapping with a MRHC. I think we should just pass in the given HC to the Searchable.search method, and rely on its wrapping by a HCW later on. In 3.0 we'll delete it entirely and use the Collector implementation. Do you see any other way?
        Hide
        Shai Erera added a comment -

        Oh wait .. I should have tried to implement it before I sent the last email.

        After I tried to implement it, I noticed that commenting out the
        Searcher.search(W, F, C) method creates a chain of compilation errors, since
        all the HC methods now call the Collector one (actually all the search()
        methods call that one eventually). So I'm not sure it's a good idea to
        comment it out. I thought to comment out all the Collector search methods in
        Searcher, but then it resulted in compilation errors in other places.

        How problematic is this break in back-compat, given it will be documented in
        CHANGES?

        • Have search(W, F, C) on Searchable? I don't think it will have such a
          great impact as I don't believe too many actually implement Searchable.
        • Have search(W, F, C) on Searcher as abstract? I know you offered, Mike, to
          create an empty impl which throws UOE, but I'm not sure what's worse: having
          a compilation error or UOE at runtime (which can happen at the customer's).
          After all, all the search methods call this one eventually, and if you did
          extend Searcher (rather than IndexSearcher), you'll get UOE on every search.
        Show
        Shai Erera added a comment - Oh wait .. I should have tried to implement it before I sent the last email. After I tried to implement it, I noticed that commenting out the Searcher.search(W, F, C) method creates a chain of compilation errors, since all the HC methods now call the Collector one (actually all the search() methods call that one eventually). So I'm not sure it's a good idea to comment it out. I thought to comment out all the Collector search methods in Searcher, but then it resulted in compilation errors in other places. How problematic is this break in back-compat, given it will be documented in CHANGES? Have search(W, F, C) on Searchable? I don't think it will have such a great impact as I don't believe too many actually implement Searchable. Have search(W, F, C) on Searcher as abstract? I know you offered, Mike, to create an empty impl which throws UOE, but I'm not sure what's worse: having a compilation error or UOE at runtime (which can happen at the customer's). After all, all the search methods call this one eventually, and if you did extend Searcher (rather than IndexSearcher), you'll get UOE on every search.
        Hide
        Michael McCandless added a comment -

        I'm attaching the Python scripts I use to run the tests. You also need this small mod:

        Index: contrib/benchmark/src/java/org/apache/lucene/benchmark/byTask/tasks/ReadTask.java
        ===================================================================
        --- contrib/benchmark/src/java/org/apache/lucene/benchmark/byTask/tasks/ReadTask.java	(revision 761709)
        +++ contrib/benchmark/src/java/org/apache/lucene/benchmark/byTask/tasks/ReadTask.java	(working copy)
        @@ -63,6 +63,9 @@
             super(runData);
           }
         
        +  // nocommit
        +  static boolean first = true;
        +
           public int doLogic() throws Exception {
             int res = 0;
             boolean closeReader = false;
        @@ -101,6 +104,11 @@
                 } else {
                   hits = searcher.search(q, numHits);
                 }
        +        // nocommit
        +        if (first) {
        +          System.out.println("NUMHITS=" + hits.totalHits);
        +          first = false;
        +        }
                 //System.out.println("q=" + q + ":" + hits.totalHits + " total hits"); 
         
                 if (withTraverse()) {
        

        All the python scripts do is write an alg, run it, gather the results, and collate in the end. You run sortBench5.py once on trunk and once in a checkout with this patch, each time in the contrib/benchmark directory. It saves a pickle file (results.pk) which sortCollate5.py then loads (you'll have to edit the hardwired paths in sortCollate5.py).

        Show
        Michael McCandless added a comment - I'm attaching the Python scripts I use to run the tests. You also need this small mod: Index: contrib/benchmark/src/java/org/apache/lucene/benchmark/byTask/tasks/ReadTask.java =================================================================== --- contrib/benchmark/src/java/org/apache/lucene/benchmark/byTask/tasks/ReadTask.java (revision 761709) +++ contrib/benchmark/src/java/org/apache/lucene/benchmark/byTask/tasks/ReadTask.java (working copy) @@ -63,6 +63,9 @@ super (runData); } + // nocommit + static boolean first = true ; + public int doLogic() throws Exception { int res = 0; boolean closeReader = false ; @@ -101,6 +104,11 @@ } else { hits = searcher.search(q, numHits); } + // nocommit + if (first) { + System .out.println( "NUMHITS=" + hits.totalHits); + first = false ; + } // System .out.println( "q=" + q + ":" + hits.totalHits + " total hits" ); if (withTraverse()) { All the python scripts do is write an alg, run it, gather the results, and collate in the end. You run sortBench5.py once on trunk and once in a checkout with this patch, each time in the contrib/benchmark directory. It saves a pickle file (results.pk) which sortCollate5.py then loads (you'll have to edit the hardwired paths in sortCollate5.py).
        Hide
        Michael McCandless added a comment -

        adds one collector.setScorer() call to each query.

        Actually, it's one setScorer() call per segment (my index had 14
        segments). But I can't imagine this slows things down.

        The scorer.score() call in collect() was just moved from whoever called collect() to inside collect(), so I don't think there's a difference. (|)

        I would think so, though, maybe the compiler can no longer inline the
        call. We may need to peek at the asm.

        Does not check if score > 0.0f in each collect implements the new topDocs() method.

        Yes, should be a speedup.

        Previously, it just implemented topDocs() which returned everything. Now, topDocs() calls topDocs(0,
        pq.size()), which verifies parameters and such - since that's executed once
        at the end of the search, I doubt that it has any effect major effect on the
        results.

        I agree.

        BTW, as I scanned through the code I noticed that previously TSDC returned
        maxScore = Float.NEGATIVE_INFINITY in case there were 0 results to the
        query, and now it returns Float.NaN. I'm not sure however if this breaks
        anything, since maxScore is probably used (if at all) for normalization of
        scores, and in case there are 0 results you don't really have anything to
        normalize? However I'm not sure ...

        Hmm good question. Float.NAN seems more correct, so let's stick with
        that?

        Regarding TopFieldDocs I am quite surprised. I assume the test uses the
        OneComparatorScoringCollector, which means scores are computed:

        Yes, and we need to test multiple field sorting too.

        It has the same issue as in TSDC regarding topDocs(). So I think it should
        be changed here as well, however I doubt that's the cause for the
        performance hit.

        What should be changed here?

        It computes the score and then does super.collect(), which adds a method call

        Yes.

        It doesn't check if the score is > 0

        Should be faster.

        It calls comparator.setScorer, which is ignored in all comparators besides RelevanceComparator. Not sure if it has any performance effects (|)

        Shouldn't have an effect.

        The rest of the code in collect() is exactly the same.
        Can it be that super.collect() has such an effect? When I think on the
        results of TSDC (-3%) vs. TFC (-28% on avg.), I think it might be since
        setScorer() is called once before the series of collect() calls, however
        super.collect() is called for every document. Your index is large (>2M
        documents, right?) and I don't know how many results are for each query, if
        they are in the range of 100Ks, then that could be the explanation.

        The table shows number of hits per query – query "1" is large.

        Mike - in case it's faster for you to run it, can you try to run the test
        again with a change in the code which inlines super.collect() into
        OneComparatorScoringCollector and compare the results again? I will run it
        also after you tell me which algorithm you used, but only tomorrow morning,
        so if you get to do it before then, that'd be great.

        I'll test this.

        I doubt that the change in topDocs() affects the query time that much, since
        it's called at the end of the search, and doing 4-5 'if' statements is
        really not that expensive (I mean once per the entire search), comparing to
        ScoreDoc[] array allocation, fetching Stored fields from the index etc. So
        I'd hate to implement all 3 topDocs() in each of the TopDocsCollector
        extensions unless it proves to be a problem.

        I also cannot imagine this is a problem. It's the slowdown of big
        searches that spooks me (it's fine if fast searches pick up some added
        cost).

        Show
        Michael McCandless added a comment - adds one collector.setScorer() call to each query. Actually, it's one setScorer() call per segment (my index had 14 segments). But I can't imagine this slows things down. The scorer.score() call in collect() was just moved from whoever called collect() to inside collect(), so I don't think there's a difference. (|) I would think so, though, maybe the compiler can no longer inline the call. We may need to peek at the asm. Does not check if score > 0.0f in each collect implements the new topDocs() method. Yes, should be a speedup. Previously, it just implemented topDocs() which returned everything. Now, topDocs() calls topDocs(0, pq.size()), which verifies parameters and such - since that's executed once at the end of the search, I doubt that it has any effect major effect on the results. I agree. BTW, as I scanned through the code I noticed that previously TSDC returned maxScore = Float.NEGATIVE_INFINITY in case there were 0 results to the query, and now it returns Float.NaN. I'm not sure however if this breaks anything, since maxScore is probably used (if at all) for normalization of scores, and in case there are 0 results you don't really have anything to normalize? However I'm not sure ... Hmm good question. Float.NAN seems more correct, so let's stick with that? Regarding TopFieldDocs I am quite surprised. I assume the test uses the OneComparatorScoringCollector, which means scores are computed: Yes, and we need to test multiple field sorting too. It has the same issue as in TSDC regarding topDocs(). So I think it should be changed here as well, however I doubt that's the cause for the performance hit. What should be changed here? It computes the score and then does super.collect(), which adds a method call Yes. It doesn't check if the score is > 0 Should be faster. It calls comparator.setScorer, which is ignored in all comparators besides RelevanceComparator. Not sure if it has any performance effects (|) Shouldn't have an effect. The rest of the code in collect() is exactly the same. Can it be that super.collect() has such an effect? When I think on the results of TSDC (-3%) vs. TFC (-28% on avg.), I think it might be since setScorer() is called once before the series of collect() calls, however super.collect() is called for every document. Your index is large (>2M documents, right?) and I don't know how many results are for each query, if they are in the range of 100Ks, then that could be the explanation. The table shows number of hits per query – query "1" is large. Mike - in case it's faster for you to run it, can you try to run the test again with a change in the code which inlines super.collect() into OneComparatorScoringCollector and compare the results again? I will run it also after you tell me which algorithm you used, but only tomorrow morning, so if you get to do it before then, that'd be great. I'll test this. I doubt that the change in topDocs() affects the query time that much, since it's called at the end of the search, and doing 4-5 'if' statements is really not that expensive (I mean once per the entire search), comparing to ScoreDoc[] array allocation, fetching Stored fields from the index etc. So I'd hate to implement all 3 topDocs() in each of the TopDocsCollector extensions unless it proves to be a problem. I also cannot imagine this is a problem. It's the slowdown of big searches that spooks me (it's fine if fast searches pick up some added cost).
        Hide
        Michael McCandless added a comment -

        BTW, I can change FieldValueHitQueue like I changed TopFieldCollector by
        introducing a factory create() method which will return a
        OneComparaterFieldValueHitQueue and MultiComparatorsFieldValueHitQueue.

        Today, FVHQ.lessThan checks the numComparators in each call, which is
        redundant.

        Seems good, unless the extra subclassing (and additions of super.XXX()) is somehow cause our performance loss.

        Also the class isn't final and I'm not sure if we want to change it.

        Yes let's make it final. We need to eek...

        Show
        Michael McCandless added a comment - BTW, I can change FieldValueHitQueue like I changed TopFieldCollector by introducing a factory create() method which will return a OneComparaterFieldValueHitQueue and MultiComparatorsFieldValueHitQueue. Today, FVHQ.lessThan checks the numComparators in each call, which is redundant. Seems good, unless the extra subclassing (and additions of super.XXX()) is somehow cause our performance loss. Also the class isn't final and I'm not sure if we want to change it. Yes let's make it final. We need to eek...
        Hide
        Michael McCandless added a comment -

        How problematic is this break in back-compat, given it will be documented in
        CHANGES?

        • Have search(W, F, C) on Searchable? I don't think it will have such a
          great impact as I don't believe too many actually implement Searchable.
        • Have search(W, F, C) on Searcher as abstract? I know you offered, Mike, to
          create an empty impl which throws UOE, but I'm not sure what's worse: having
          a compilation error or UOE at runtime (which can happen at the customer's).
          After all, all the search methods call this one eventually, and if you did
          extend Searcher (rather than IndexSearcher), you'll get UOE on every search.

        OK let's add both and call it out in CHANGES.txt?

        Show
        Michael McCandless added a comment - How problematic is this break in back-compat, given it will be documented in CHANGES? Have search(W, F, C) on Searchable? I don't think it will have such a great impact as I don't believe too many actually implement Searchable. Have search(W, F, C) on Searcher as abstract? I know you offered, Mike, to create an empty impl which throws UOE, but I'm not sure what's worse: having a compilation error or UOE at runtime (which can happen at the customer's). After all, all the search methods call this one eventually, and if you did extend Searcher (rather than IndexSearcher), you'll get UOE on every search. OK let's add both and call it out in CHANGES.txt?
        Hide
        Shai Erera added a comment -

        There are no super.XXX calls. The two FVHQ implementations just implement
        lessThan according to whether it's a single comparator or muli case. This
        removes the check of numComparators == 1.

        On Sat, Apr 4, 2009 at 12:53 PM, Michael McCandless (JIRA)

        Show
        Shai Erera added a comment - There are no super.XXX calls. The two FVHQ implementations just implement lessThan according to whether it's a single comparator or muli case. This removes the check of numComparators == 1. On Sat, Apr 4, 2009 at 12:53 PM, Michael McCandless (JIRA)
        Hide
        Shai Erera added a comment -

        OK let's add both and call it out in CHANGES.txt?

        great. so I leave them as they are in the latest patch and add a note to CHANGES.

        Yes let's make it final. We need to eek...

        This isn't necessary after all, since the class is now abstract, with a private ctor and two private final internal classes, which will be the concrete objects returned by create().

        Before submitting the next patch version, I'd like to verify if super.collect() in TFC is the cause of the perf. degradation. We should also perf. test sorting w/o score tracking and note if there is any improvement over trunk. I'm downloading the latest enwiki xml (20090306), so I hope that sometime tomorrow it will finish the download, extract, indexing and search tests.

        Show
        Shai Erera added a comment - OK let's add both and call it out in CHANGES.txt? great. so I leave them as they are in the latest patch and add a note to CHANGES. Yes let's make it final. We need to eek... This isn't necessary after all, since the class is now abstract, with a private ctor and two private final internal classes, which will be the concrete objects returned by create(). Before submitting the next patch version, I'd like to verify if super.collect() in TFC is the cause of the perf. degradation. We should also perf. test sorting w/o score tracking and note if there is any improvement over trunk. I'm downloading the latest enwiki xml (20090306), so I hope that sometime tomorrow it will finish the download, extract, indexing and search tests.
        Hide
        Michael McCandless added a comment -

        There are no super.XXX calls. The two FVHQ implementations just implement
        lessThan according to whether it's a single comparator or muli case. This
        removes the check of numComparators == 1.

        Excellent!

        Before submitting the next patch version, I'd like to verify if super.collect() in TFC is the cause of the perf. degradation

        I'll run this & post back.

        Show
        Michael McCandless added a comment - There are no super.XXX calls. The two FVHQ implementations just implement lessThan according to whether it's a single comparator or muli case. This removes the check of numComparators == 1. Excellent! Before submitting the next patch version, I'd like to verify if super.collect() in TFC is the cause of the perf. degradation I'll run this & post back.
        Hide
        Michael McCandless added a comment -

        Odd – inlining super.collect into OCSC, and making OCSC final, did not alter the numbers much (I re-ran trunk baseline to confirm its close to prior trunk baseline):

        query sort hits qps qpsnew pctg
        147 score 6953 3635.8 3650.1 0.4%
        147 title 6953 2915.7 2297.6 -21.2%
        147 doc 6953 3265.6 2665.8 -18.4%
        text score 157101 208.5 202.9 -2.7%
        text title 157101 97.0 85.4 -12.0%
        text doc 157101 174.3 125.0 -28.3%
        1 score 565452 58.2 56.6 -2.7%
        1 title 565452 44.6 34.6 -22.4%
        1 doc 565452 49.2 35.2 -28.5%
        1 OR 2 score 784928 14.1 13.7 -2.8%
        1 OR 2 title 784928 12.6 11.5 -8.7%
        1 OR 2 doc 784928 13.0 11.9 -8.5%
        1 AND 2 score 333153 15.6 15.5 -0.6%
        1 AND 2 title 333153 14.8 13.7 -7.4%
        1 AND 2 doc 333153 15.2 14.2 -6.6%
        Show
        Michael McCandless added a comment - Odd – inlining super.collect into OCSC, and making OCSC final, did not alter the numbers much (I re-ran trunk baseline to confirm its close to prior trunk baseline): query sort hits qps qpsnew pctg 147 score 6953 3635.8 3650.1 0.4% 147 title 6953 2915.7 2297.6 -21.2% 147 doc 6953 3265.6 2665.8 -18.4% text score 157101 208.5 202.9 -2.7% text title 157101 97.0 85.4 -12.0% text doc 157101 174.3 125.0 -28.3% 1 score 565452 58.2 56.6 -2.7% 1 title 565452 44.6 34.6 -22.4% 1 doc 565452 49.2 35.2 -28.5% 1 OR 2 score 784928 14.1 13.7 -2.8% 1 OR 2 title 784928 12.6 11.5 -8.7% 1 OR 2 doc 784928 13.0 11.9 -8.5% 1 AND 2 score 333153 15.6 15.5 -0.6% 1 AND 2 title 333153 14.8 13.7 -7.4% 1 AND 2 doc 333153 15.2 14.2 -6.6%
        Hide
        Michael McCandless added a comment -

        We should also perf. test sorting w/o score tracking and note if there is any improvement over trunk.

        Let's wait a bit until we sort things out (eg, w/ current patch, TermScorer will still compute its score even if I don't need it).

        Show
        Michael McCandless added a comment - We should also perf. test sorting w/o score tracking and note if there is any improvement over trunk. Let's wait a bit until we sort things out (eg, w/ current patch, TermScorer will still compute its score even if I don't need it).
        Hide
        Michael McCandless added a comment -

        Shai can you post your latest patch, where TermScorer itself is passed down to the collector?

        Show
        Michael McCandless added a comment - Shai can you post your latest patch, where TermScorer itself is passed down to the collector?
        Hide
        Shai Erera added a comment -
        • Changed TermScorer.score() method to not call Similarity.decodeNorm. If we can change Scorer.similarity to be protected, we can give up getSimilarity() call in score(). Also changed TermScorer.score(Collector) to set 'this' as the collector's scorer.
        • Deprecated TimeLimitedCollector, created new TimeLimitingCollector, renamed TestTimeLimitedCollector to TestTimeLimitingCollector and used the new TimeLimitingCollector.
        • Changed FVHQ to have a static create which returns One/MultiComparatorFieldValueHitQueue version.
        • Changed TopFieldCollector setNextReader versions to not call pq.size() but rather use numHits.
        Show
        Shai Erera added a comment - Changed TermScorer.score() method to not call Similarity.decodeNorm. If we can change Scorer.similarity to be protected, we can give up getSimilarity() call in score(). Also changed TermScorer.score(Collector) to set 'this' as the collector's scorer. Deprecated TimeLimitedCollector, created new TimeLimitingCollector, renamed TestTimeLimitedCollector to TestTimeLimitingCollector and used the new TimeLimitingCollector. Changed FVHQ to have a static create which returns One/MultiComparatorFieldValueHitQueue version. Changed TopFieldCollector setNextReader versions to not call pq.size() but rather use numHits.
        Hide
        Michael McCandless added a comment -

        OK thanks. Numbers w/ new patch:

        query sort hits qps qpsnew pctg
        147 score 6953 3635.8 3704.1 1.9%
        147 title 6953 2915.7 2262.9 -22.4%
        147 doc 6953 3265.6 2655.1 -18.7%
        text score 157101 208.5 199.9 -4.1%
        text title 157101 97.0 87.1 -10.2%
        text doc 157101 174.3 134.6 -22.8%
        1 score 565452 58.2 56.5 -2.9%
        1 title 565452 44.6 35.3 -20.9%
        1 doc 565452 49.2 38.0 -22.8%
        1 OR 2 score 784928 14.1 13.8 -2.1%
        1 OR 2 title 784928 12.6 11.6 -7.9%
        1 OR 2 doc 784928 13.0 11.9 -8.5%
        1 AND 2 score 333153 15.6 15.4 -1.3%
        1 AND 2 title 333153 14.8 13.7 -7.4%
        1 AND 2 doc 333153 15.2 14.2 -6.6%
        Show
        Michael McCandless added a comment - OK thanks. Numbers w/ new patch: query sort hits qps qpsnew pctg 147 score 6953 3635.8 3704.1 1.9% 147 title 6953 2915.7 2262.9 -22.4% 147 doc 6953 3265.6 2655.1 -18.7% text score 157101 208.5 199.9 -4.1% text title 157101 97.0 87.1 -10.2% text doc 157101 174.3 134.6 -22.8% 1 score 565452 58.2 56.5 -2.9% 1 title 565452 44.6 35.3 -20.9% 1 doc 565452 49.2 38.0 -22.8% 1 OR 2 score 784928 14.1 13.8 -2.1% 1 OR 2 title 784928 12.6 11.6 -7.9% 1 OR 2 doc 784928 13.0 11.9 -8.5% 1 AND 2 score 333153 15.6 15.4 -1.3% 1 AND 2 title 333153 14.8 13.7 -7.4% 1 AND 2 doc 333153 15.2 14.2 -6.6%
        Hide
        Michael McCandless added a comment -

        Attached patch; only differences are:

        • Under contrib/benchmark I made changes so you can specify non-scoring field sorting
        • Fixed the rename of TestTimeLimitedCollector --> Limiting to be patch-friendly

        OK I ran performance with score tracking disabled during field sorted search:

        query sort hits qps qpsnew pctg
        147 title 6953 2915.7 4043.3 38.7%
        147 doc 6953 3265.6 4840.1 48.2%
        text title 157101 97.0 128.0 32.0%
        text doc 157101 174.3 273.2 56.7%
        1 title 565452 44.6 60.2 35.0%
        1 doc 565452 49.2 75.3 53.0%
        1 OR 2 title 784928 12.6 14.8 17.5%
        1 OR 2 doc 784928 13.0 15.2 16.9%
        1 AND 2 title 333153 14.8 17.9 20.9%
        1 AND 2 doc 333153 15.2 18.9 24.3%

        Very nice speedups! We just have to figure out why the score-tracking variant got slower...

        Show
        Michael McCandless added a comment - Attached patch; only differences are: Under contrib/benchmark I made changes so you can specify non-scoring field sorting Fixed the rename of TestTimeLimitedCollector --> Limiting to be patch-friendly OK I ran performance with score tracking disabled during field sorted search: query sort hits qps qpsnew pctg 147 title 6953 2915.7 4043.3 38.7% 147 doc 6953 3265.6 4840.1 48.2% text title 157101 97.0 128.0 32.0% text doc 157101 174.3 273.2 56.7% 1 title 565452 44.6 60.2 35.0% 1 doc 565452 49.2 75.3 53.0% 1 OR 2 title 784928 12.6 14.8 17.5% 1 OR 2 doc 784928 13.0 15.2 16.9% 1 AND 2 title 333153 14.8 17.9 20.9% 1 AND 2 doc 333153 15.2 18.9 24.3% Very nice speedups! We just have to figure out why the score-tracking variant got slower...
        Hide
        Shai Erera added a comment -

        Mike - I've been trying to see this performance hit, without indexing the full wikipedia content, but failed. What I did is:

        • Create an index with 10M documents with one term (all with the same term). The index had 51 segments (I deliberately aimed at a high number).
        • Run a sort-by-doc (in reverse order so that collect() will actually do some work) search using TermQuery (so that TermScorer will be used).
        • Measure the avg. time of 20 query executions (I deliberately omit the first execution because I've noticed that the first run sometimes take much longer, even after I kill the JVM. I guess it has to do with OS caches).

        The time I measure between the trunk and the current version are very much comparable:
        trunk-patched: (1) 1248.45 ms (2) 1255.45 ms
        trunk-orig: (1) 1314.1 ms (2) 1265.65 ms
        In both runs the patched trunk ran faster. Both did not have any large differences.

        Here's the code:

        	public static void main(String[] args) throws Exception {
        
        		File indexDir = new File(args[0]);
        		Directory dir = new FSDirectory(indexDir, new NativeFSLockFactory(indexDir));
        		
        		if (!indexDir.exists() || indexDir.list().length == 0) {
        			int numDocs = 10000000;
        			IndexWriter writer = new IndexWriter(dir, null, MaxFieldLength.UNLIMITED);
        			writer.setMergeFactor(50);
        			writer.setMaxBufferedDocs(numDocs/100);
        			System.out.println("populating index");
        			long time = System.currentTimeMillis();
        			for (int i = 0; i < numDocs; i++) {
        				Document doc = new Document();
        				doc.add(new Field("c", "test", Store.NO, Index.NOT_ANALYZED));
        				writer.addDocument(doc);
        			}
        			writer.close(false);
        			System.out.println("time=" + (System.currentTimeMillis() - time));
        		}
        		
        		System.out.println("searching");
        		IndexSearcher searcher = new IndexSearcher(dir);
        		System.out.println("numSegments=" + searcher.getIndexReader().getSequentialSubReaders().length);
        		Sort sort = new Sort(new SortField(null, SortField.DOC, true));
        		searcher.search(new TermQuery(new Term("c", "test")), null, 10, sort);
        		long time = System.currentTimeMillis();
        		double numQueries = 20;
        		for (int i = 0; i < numQueries; i++) {
        			searcher.search(new TermQuery(new Term("c", "test")), null, 10, sort);
        		}
        		time = System.currentTimeMillis() - time;
        		System.out.println("avg. time=" + (time / numQueries) + " ms");
        		searcher.close();
        	}
        
        

        Can you try running this code on your machine?

        Show
        Shai Erera added a comment - Mike - I've been trying to see this performance hit, without indexing the full wikipedia content, but failed. What I did is: Create an index with 10M documents with one term (all with the same term). The index had 51 segments (I deliberately aimed at a high number). Run a sort-by-doc (in reverse order so that collect() will actually do some work) search using TermQuery (so that TermScorer will be used). Measure the avg. time of 20 query executions (I deliberately omit the first execution because I've noticed that the first run sometimes take much longer, even after I kill the JVM. I guess it has to do with OS caches). The time I measure between the trunk and the current version are very much comparable: trunk-patched: (1) 1248.45 ms (2) 1255.45 ms trunk-orig: (1) 1314.1 ms (2) 1265.65 ms In both runs the patched trunk ran faster. Both did not have any large differences. Here's the code: public static void main( String [] args) throws Exception { File indexDir = new File(args[0]); Directory dir = new FSDirectory(indexDir, new NativeFSLockFactory(indexDir)); if (!indexDir.exists() || indexDir.list().length == 0) { int numDocs = 10000000; IndexWriter writer = new IndexWriter(dir, null , MaxFieldLength.UNLIMITED); writer.setMergeFactor(50); writer.setMaxBufferedDocs(numDocs/100); System .out.println( "populating index" ); long time = System .currentTimeMillis(); for ( int i = 0; i < numDocs; i++) { Document doc = new Document(); doc.add( new Field( "c" , "test" , Store.NO, Index.NOT_ANALYZED)); writer.addDocument(doc); } writer.close( false ); System .out.println( "time=" + ( System .currentTimeMillis() - time)); } System .out.println( "searching" ); IndexSearcher searcher = new IndexSearcher(dir); System .out.println( "numSegments=" + searcher.getIndexReader().getSequentialSubReaders().length); Sort sort = new Sort( new SortField( null , SortField.DOC, true )); searcher.search( new TermQuery( new Term( "c" , "test" )), null , 10, sort); long time = System .currentTimeMillis(); double numQueries = 20; for ( int i = 0; i < numQueries; i++) { searcher.search( new TermQuery( new Term( "c" , "test" )), null , 10, sort); } time = System .currentTimeMillis() - time; System .out.println( "avg. time=" + (time / numQueries) + " ms" ); searcher.close(); } Can you try running this code on your machine?
        Hide
        Michael McCandless added a comment -

        OK I ran it like this:

        java -Xms1024M -Xmx1024M -Xbatch -server -cp build/classes/java:../lucene.collection PerfTest /lucene/fake
        

        Across these OS/JRE versions:

        OS JRE Trunk Patch %tg change
        Linux 1.6.0_10 1075 ms 1120 ms -4.2%
        Linux 1.5.0_08 1156 ms 1199 ms -3.7%
        OS X 1.6.0_07 (64bit) 774 ms 823 ms -6.3%
        OS X 1.5.0_16 987 ms 936 ms 5.2%
        Win Svr 2003 (64bit) 1.6.0_11 (64bit) 1209 ms 1308 ms -8.2%
        Win Svr 2003 (64bit) 1.5.0_14 (64bit) 1559 ms 1564 ms -0.3%

        Shai which OS/JRE did you run on?

        Except for the JRE 1.5 cases on OS X and Win Svr 2003, things seem
        generally slower with the patch, though I think the test is a little
        too synthetic (sorting by doc reversed is far worse than a "normal"
        fielded sort where I think the queue would typically fairly quickly
        "converge"). Also, I think it's better to take min rather than avg
        across all the runs.

        Show
        Michael McCandless added a comment - OK I ran it like this: java -Xms1024M -Xmx1024M -Xbatch -server -cp build/classes/java:../lucene.collection PerfTest /lucene/fake Across these OS/JRE versions: OS JRE Trunk Patch %tg change Linux 1.6.0_10 1075 ms 1120 ms -4.2% Linux 1.5.0_08 1156 ms 1199 ms -3.7% OS X 1.6.0_07 (64bit) 774 ms 823 ms -6.3% OS X 1.5.0_16 987 ms 936 ms 5.2% Win Svr 2003 (64bit) 1.6.0_11 (64bit) 1209 ms 1308 ms -8.2% Win Svr 2003 (64bit) 1.5.0_14 (64bit) 1559 ms 1564 ms -0.3% Shai which OS/JRE did you run on? Except for the JRE 1.5 cases on OS X and Win Svr 2003, things seem generally slower with the patch, though I think the test is a little too synthetic (sorting by doc reversed is far worse than a "normal" fielded sort where I think the queue would typically fairly quickly "converge"). Also, I think it's better to take min rather than avg across all the runs.
        Hide
        Shai Erera added a comment -

        Shai which OS/JRE did you run on?

        Windows XP, 1.5 (32 bit)

        I chose the sort by doc since I was very curious as to the major drop in performance your queries show. When you sort by doc, the queue converges exactly after 10 documents (assuming you ask for 10 results). After that, every document that's collected fails at the comparator level (first line in the "if (queueFull)" code) and the collector literally does nothing. So all that's left are the calls to scorer.score().
        I chose to sort by docs reversed so that every call will actually do something.
        I measured scorer.score() calls in OCSC and they were insignificant (~14 ms total on a single query run).

        Also, I think it's better to take min rather than avg across all the runs.

        I don't know. Some runs showed 112 ms and some 256 ms ... Taking the minimum just measures "when the OS cache is fully optimized for your run" case, that's why I prefer to take avg. Anyway, I don't think it matters since we're taking the same for both runs.

        I reviewed the code in TopFieldCollector and it's almost identical to before the patch, besides I differences I noted above, which don't seem like to have an effect. That's why I wanted to have a synthetic test, which we know exactly what it'll do.

        BTW, this test shows far smaller drop in performance comparing to your previous sort-by-doc runs, and it does collect ~x20 times more documents ...

        Show
        Shai Erera added a comment - Shai which OS/JRE did you run on? Windows XP, 1.5 (32 bit) I chose the sort by doc since I was very curious as to the major drop in performance your queries show. When you sort by doc, the queue converges exactly after 10 documents (assuming you ask for 10 results). After that, every document that's collected fails at the comparator level (first line in the "if (queueFull)" code) and the collector literally does nothing. So all that's left are the calls to scorer.score(). I chose to sort by docs reversed so that every call will actually do something. I measured scorer.score() calls in OCSC and they were insignificant (~14 ms total on a single query run). Also, I think it's better to take min rather than avg across all the runs. I don't know. Some runs showed 112 ms and some 256 ms ... Taking the minimum just measures "when the OS cache is fully optimized for your run" case, that's why I prefer to take avg. Anyway, I don't think it matters since we're taking the same for both runs. I reviewed the code in TopFieldCollector and it's almost identical to before the patch, besides I differences I noted above, which don't seem like to have an effect. That's why I wanted to have a synthetic test, which we know exactly what it'll do. BTW, this test shows far smaller drop in performance comparing to your previous sort-by-doc runs, and it does collect ~x20 times more documents ...
        Hide
        Michael McCandless added a comment -

        > Shai which OS/JRE did you run on?

        Windows XP, 1.5 (32 bit)

        Did you run with -server, -Xbatch, -Xms/-Xmx set to same value, etc?
        The goal is to minimize any source of noise in the measurement.

        I also got mixed results on 1.5...

        I chose the sort by doc since I was very curious as to the major drop in performance your queries show. When you sort by doc, the queue converges exactly after 10 documents (assuming you ask for 10 results). After that, every document that's collected fails at the comparator level (first line in the "if (queueFull)" code) and the collector literally does nothing. So all that's left are the calls to scorer.score().

        I agree, I think sort by doc is overly easy, but sort by doc reversed
        is overly hard. I think sort by title & docdate (in wikipedia) are
        reasonable. Let's just leave doc entirely out of future wikipedia
        tests.

        I measured scorer.score() calls in OCSC and they were insignificant (~14 ms total on a single query run).

        That's very odd, because I see big speedups when we don't score.

        > Also, I think it's better to take min rather than avg across all the runs.

        I don't know. Some runs showed 112 ms and some 256 ms ... Taking the minimum just measures "when the OS cache is fully optimized for your run" case, that's why I prefer to take avg. Anyway, I don't think it matters since we're taking the same for both runs.

        If you really had such wide variance on your runs (I didn't), it's
        even more important to use min not avg. It means there's alot of
        noise in your measurement.

        For this test we do want full OS caching, no page misses, no daemon
        that wakes up a steals a few cycles, etc. This is why min, across
        many iterations, is good: it makes the numbers less noisy and more
        comparable. A random unlucky outlier won't increase the time.

        True, min is not a good way to measure what you'll typically see when
        running such searches in your app (the real world has lots of noise),
        but that's not what we're after here.

        I reviewed the code in TopFieldCollector and it's almost identical to before the patch, besides I differences I noted above, which don't seem like to have an effect. That's why I wanted to have a synthetic test, which we know exactly what it'll do.

        I think it may be the switch to an abstract Scorer – the JIT can no
        longer inline that code (maybe).

        Or maybe it's the switch to subclassing-not-if for the 4 impls of
        TopFieldCollector?

        BTW, this test shows far smaller drop in performance comparing to your previous sort-by-doc runs, and it does collect ~x20 times more documents ...

        Yeah but we changed alot of things. I think sort by title, docdate
        are more realistic tests. It is nice that your test is ~20X
        larger... I wish we had a real standard dataset that large, and real
        queries instead of the few specific ones we are testing, would be
        better.

        Show
        Michael McCandless added a comment - > Shai which OS/JRE did you run on? Windows XP, 1.5 (32 bit) Did you run with -server, -Xbatch, -Xms/-Xmx set to same value, etc? The goal is to minimize any source of noise in the measurement. I also got mixed results on 1.5... I chose the sort by doc since I was very curious as to the major drop in performance your queries show. When you sort by doc, the queue converges exactly after 10 documents (assuming you ask for 10 results). After that, every document that's collected fails at the comparator level (first line in the "if (queueFull)" code) and the collector literally does nothing. So all that's left are the calls to scorer.score(). I agree, I think sort by doc is overly easy, but sort by doc reversed is overly hard. I think sort by title & docdate (in wikipedia) are reasonable. Let's just leave doc entirely out of future wikipedia tests. I measured scorer.score() calls in OCSC and they were insignificant (~14 ms total on a single query run). That's very odd, because I see big speedups when we don't score. > Also, I think it's better to take min rather than avg across all the runs. I don't know. Some runs showed 112 ms and some 256 ms ... Taking the minimum just measures "when the OS cache is fully optimized for your run" case, that's why I prefer to take avg. Anyway, I don't think it matters since we're taking the same for both runs. If you really had such wide variance on your runs (I didn't), it's even more important to use min not avg. It means there's alot of noise in your measurement. For this test we do want full OS caching, no page misses, no daemon that wakes up a steals a few cycles, etc. This is why min, across many iterations, is good: it makes the numbers less noisy and more comparable. A random unlucky outlier won't increase the time. True, min is not a good way to measure what you'll typically see when running such searches in your app (the real world has lots of noise), but that's not what we're after here. I reviewed the code in TopFieldCollector and it's almost identical to before the patch, besides I differences I noted above, which don't seem like to have an effect. That's why I wanted to have a synthetic test, which we know exactly what it'll do. I think it may be the switch to an abstract Scorer – the JIT can no longer inline that code (maybe). Or maybe it's the switch to subclassing-not-if for the 4 impls of TopFieldCollector? BTW, this test shows far smaller drop in performance comparing to your previous sort-by-doc runs, and it does collect ~x20 times more documents ... Yeah but we changed alot of things. I think sort by title, docdate are more realistic tests. It is nice that your test is ~20X larger... I wish we had a real standard dataset that large, and real queries instead of the few specific ones we are testing, would be better.
        Hide
        Michael McCandless added a comment -

        OK I modified your test (attached):

        • Make a random int field and sort on that
        • Switch to System.nanoTime to measure time
        • Print min time in addition to avg
        • Gather silly sum just to make sure JRE can't optimize away things

        Here are the results:

        OS JRE Trunk Patch %tg
        OS X 1.5.0_16 389 334 14.1%
        OS X 1.6.0_07 (64 bit) 312 430 -27.4%
        Linux 1.5.0_08 403 337 16.4%
        Linux 1.6.0_10 337 303 10.1%
        Win Svr 2003 (64 bit) 1.5.0_14 (64 bit) 535 727 -36.6%
        Win Svr 2003 (64 bit) 1.6.0_11 (64 bit) 477 682 -43.0%

        Now I'm even more baffled. The Win Svr 2003 times became especially
        awful... in fact it's as if 64-bit JREs don't like this change.

        Or we may simply be chasing Java ghosts at this point... though these
        are awfully big ghosts.

        Show
        Michael McCandless added a comment - OK I modified your test (attached): Make a random int field and sort on that Switch to System.nanoTime to measure time Print min time in addition to avg Gather silly sum just to make sure JRE can't optimize away things Here are the results: OS JRE Trunk Patch %tg OS X 1.5.0_16 389 334 14.1% OS X 1.6.0_07 (64 bit) 312 430 -27.4% Linux 1.5.0_08 403 337 16.4% Linux 1.6.0_10 337 303 10.1% Win Svr 2003 (64 bit) 1.5.0_14 (64 bit) 535 727 -36.6% Win Svr 2003 (64 bit) 1.6.0_11 (64 bit) 477 682 -43.0% Now I'm even more baffled. The Win Svr 2003 times became especially awful... in fact it's as if 64-bit JREs don't like this change. Or we may simply be chasing Java ghosts at this point... though these are awfully big ghosts.
        Hide
        Mark Miller added a comment -

        Gave the bench a try here on 64-bit linux with open-jdk 6 and sun-java 5.

        I did one throw away run with the new patch version, and then recorded the times for both - got 30% drop with the new patch.

        Reran both many times after and got approx the same times for both every attempt, using both JRE's mentioned above.

        Dunno know what that implies, but another data point.

        • Mark
        Show
        Mark Miller added a comment - Gave the bench a try here on 64-bit linux with open-jdk 6 and sun-java 5. I did one throw away run with the new patch version, and then recorded the times for both - got 30% drop with the new patch. Reran both many times after and got approx the same times for both every attempt, using both JRE's mentioned above. Dunno know what that implies, but another data point. Mark
        Hide
        Shai Erera added a comment -

        Reran both many times after and got approx the same times for both every attempt, using both JRE's mentioned above.

        Just for verification - by "same time" did you mean that your successive runs show ~X ms for both patched and non-patched trunk, or by "same time" you meant that the successive runs show -30% drop with the patch?

        Show
        Shai Erera added a comment - Reran both many times after and got approx the same times for both every attempt, using both JRE's mentioned above. Just for verification - by "same time" did you mean that your successive runs show ~X ms for both patched and non-patched trunk, or by "same time" you meant that the successive runs show -30% drop with the patch?
        Hide
        Mark Miller added a comment -

        As in, the 30% drop appeared to be an aberration. Both patched and un-patched performed the same on subsequent runs. It appears there is not a loss on my setup - or its very small if there is.

        Show
        Mark Miller added a comment - As in, the 30% drop appeared to be an aberration. Both patched and un-patched performed the same on subsequent runs. It appears there is not a loss on my setup - or its very small if there is.
        Hide
        Shai Erera added a comment -

        Yes, that's what I saw too. I will run the test with both 1.5 and 1.6, 32/64 bit versions and post the results.

        BTW, if you look at Mike's table above, it's a black and white thing: the 1.5 JRE really like this patch and 1.6 really hate it. Maybe we should not move to 1.6 then? (kidding)

        Show
        Shai Erera added a comment - Yes, that's what I saw too. I will run the test with both 1.5 and 1.6, 32/64 bit versions and post the results. BTW, if you look at Mike's table above, it's a black and white thing: the 1.5 JRE really like this patch and 1.6 really hate it. Maybe we should not move to 1.6 then? (kidding)
        Hide
        Shai Erera added a comment -

        I wasn't able to run the test on 64-bit JRE. Here are the results on 32-bit JREs:

        OS JRE Trunk Patch %tg
        XP IBM 1.5 573 571 0.34%
        XP 1.6.07 (32 bit) 752 804 -6.4 %
        SRV 2003 IBM 1.5 530/469 536/493 1%/-4.86%
        SRV 2003 1.6.07 (32 bit) 858 699 22.7%

        I ran each twice, and just in the SRV-2003-1.5 case there were differences between the two runs. Also, it's important to notice that unlike Mike's results, the SRV2003-JRE1.6 run had 22.7% improvement with the patched version. I re-ran the 2003 runs a couple of times and the results were consistent.

        Show
        Shai Erera added a comment - I wasn't able to run the test on 64-bit JRE. Here are the results on 32-bit JREs: OS JRE Trunk Patch %tg XP IBM 1.5 573 571 0.34% XP 1.6.07 (32 bit) 752 804 -6.4 % SRV 2003 IBM 1.5 530/469 536/493 1% / -4.86% SRV 2003 1.6.07 (32 bit) 858 699 22.7% I ran each twice, and just in the SRV-2003-1.5 case there were differences between the two runs. Also, it's important to notice that unlike Mike's results, the SRV2003-JRE1.6 run had 22.7% improvement with the patched version. I re-ran the 2003 runs a couple of times and the results were consistent.
        Hide
        Michael McCandless added a comment -

        Mark and Shai, are you guys using the last version of the bench (that sorts by random int field)? Are you using the "best" time for your results? How are you launching the JRE?

        BTW, if you look at Mike's table above, it's a black and white thing: the 1.5 JRE really like this patch and 1.6 really hate it. Maybe we should not move to 1.6 then?

        Actually, for my run on Linux, the patch was faster for both 1.5 & 1.6 JREs.

        Show
        Michael McCandless added a comment - Mark and Shai, are you guys using the last version of the bench (that sorts by random int field)? Are you using the "best" time for your results? How are you launching the JRE? BTW, if you look at Mike's table above, it's a black and white thing: the 1.5 JRE really like this patch and 1.6 really hate it. Maybe we should not move to 1.6 then? Actually, for my run on Linux, the patch was faster for both 1.5 & 1.6 JREs.
        Hide
        Shai Erera added a comment -

        Added JustCompileSearch, JustCompileSearchFunction and JustCompileSearchSpans that extend/implement all abstract classes/interfaces in o.a.l.s, o.a.l.s.s and o.a.l.s.f. Those are not unit tests per-sei, however if anyone will change the interfaces/abstract classes in a way that it breaks back-compat, we'll know it right away. I think that in general this is something good to have for Lucene overall, however I only took care of the search.* packages in this patch.

        Show
        Shai Erera added a comment - Added JustCompileSearch, JustCompileSearchFunction and JustCompileSearchSpans that extend/implement all abstract classes/interfaces in o.a.l.s, o.a.l.s.s and o.a.l.s.f. Those are not unit tests per-sei, however if anyone will change the interfaces/abstract classes in a way that it breaks back-compat, we'll know it right away. I think that in general this is something good to have for Lucene overall, however I only took care of the search.* packages in this patch.
        Hide
        Shai Erera added a comment -

        I'm using the latest version which sorts by that random field (the output includes the prints of best, avg. and sum, so I'm sure of that). Also, the times I reported are the 'best' time. I launch the JRE like you posted with those args: "-Xms1024M -Xmx1024M -Xbatch -server".

        I reran now, and the results are consistent.

        Show
        Shai Erera added a comment - I'm using the latest version which sorts by that random field (the output includes the prints of best, avg. and sum, so I'm sure of that). Also, the times I reported are the 'best' time. I launch the JRE like you posted with those args: "-Xms1024M -Xmx1024M -Xbatch -server". I reran now, and the results are consistent.
        Hide
        Mark Miller added a comment -

        I just used the defaults for cmd line - I can give it another go ensuring server and more RAM. I used the latest perf code provided by Mike and the latest patch.

        I didn't look at the numbers too closely - my plan was to do a quick profile with each, but eyeballing runs with each over and over, they were approx the same (both best and avg), so I skipped the profiling.

        Show
        Mark Miller added a comment - I just used the defaults for cmd line - I can give it another go ensuring server and more RAM. I used the latest perf code provided by Mike and the latest patch. I didn't look at the numbers too closely - my plan was to do a quick profile with each, but eyeballing runs with each over and over, they were approx the same (both best and avg), so I skipped the profiling.
        Hide
        Michael McCandless added a comment -

        I ran 2 more JREs under linux:

        OS JRE Trunk Patch %tg
        Linux 1.7.0 ea 333 ms 320 ms 3.9%
        Linux IBM JRE 1.5.0 401 ms 352 ms 12.2%
        Show
        Michael McCandless added a comment - I ran 2 more JREs under linux: OS JRE Trunk Patch %tg Linux 1.7.0 ea 333 ms 320 ms 3.9% Linux IBM JRE 1.5.0 401 ms 352 ms 12.2%
        Hide
        Shai Erera added a comment -

        So how do we proceed? It looks like we get inconsistent results, sometimes over the same OS and JRE, just different machine. Perhaps the test is too synthetic, although it does capture the essence of the changes. Mike, can you post your Wikipedia index somewhere so I can download and run your previous queries and compare the results?

        Show
        Shai Erera added a comment - So how do we proceed? It looks like we get inconsistent results, sometimes over the same OS and JRE, just different machine. Perhaps the test is too synthetic, although it does capture the essence of the changes. Mike, can you post your Wikipedia index somewhere so I can download and run your previous queries and compare the results?
        Hide
        Michael McCandless added a comment -

        So how do we proceed?

        The results are definitely highly varying...

        It seems like I'm the only one seeing sizable performance loss with the patch,
        and then only with 64bit JREs (on OS X and Windows Server 2004 x64).

        Mark when you saw no performance loss on 64 bit linux, was the JRE
        64 bit?

        If so, then maybe we should simply proceed with the patch as is.
        These differences are clearly java ghosts and there's not much we can
        do about that....

        The index is a little too large (2.6G) to schlepp around – instead,
        here's the alg I used to create it:

        analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
        
        doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker
        
        merge.policy=org.apache.lucene.index.LogDocMergePolicy
        
        docs.file=/Volumes/External/lucene/wiki.txt
        doc.stored = false
        doc.term.vector = false
        doc.add.log.step=1000
        max.field.length=2147483647
        
        directory=FSDirectory
        autocommit=false
        compound=false
        ram.flush.mb = 128
        doc.maker.forever = false
        
        work.dir=/lucene/work
        
        { "Rounds"
          ResetSystemErase
          { "BuildIndex"
            - CreateIndex
             { "AddDocs" AddDoc > : *
            - CloseIndex
          }
          NewRound
        } : 1
        
        RepSumByPrefRound BuildIndex
        
        Show
        Michael McCandless added a comment - So how do we proceed? The results are definitely highly varying... It seems like I'm the only one seeing sizable performance loss with the patch, and then only with 64bit JREs (on OS X and Windows Server 2004 x64). Mark when you saw no performance loss on 64 bit linux, was the JRE 64 bit? If so, then maybe we should simply proceed with the patch as is. These differences are clearly java ghosts and there's not much we can do about that.... The index is a little too large (2.6G) to schlepp around – instead, here's the alg I used to create it: analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker merge.policy=org.apache.lucene.index.LogDocMergePolicy docs.file=/Volumes/External/lucene/wiki.txt doc.stored = false doc.term.vector = false doc.add.log.step=1000 max.field.length=2147483647 directory=FSDirectory autocommit= false compound= false ram.flush.mb = 128 doc.maker.forever = false work.dir=/lucene/work { "Rounds" ResetSystemErase { "BuildIndex" - CreateIndex { "AddDocs" AddDoc > : * - CloseIndex } NewRound } : 1 RepSumByPrefRound BuildIndex
        Hide
        Mark Miller added a comment -

        Yes, both 64-bit versions - openjdk 6 and sun java 1.5. I appeared to be getting the same results with both jvm's and patched or not. I figured I'd try a bit of profiling, since I have a 64-bit setup, but doesnt appear I'd learn much. I'm going to try a bit more testing tonight for the heck of it - I've got sun 1.6 and a 32-bit 1.5 I could check with as well.

        Show
        Mark Miller added a comment - Yes, both 64-bit versions - openjdk 6 and sun java 1.5. I appeared to be getting the same results with both jvm's and patched or not. I figured I'd try a bit of profiling, since I have a 64-bit setup, but doesnt appear I'd learn much. I'm going to try a bit more testing tonight for the heck of it - I've got sun 1.6 and a 32-bit 1.5 I could check with as well.
        Hide
        Michael McCandless added a comment -

        Attached new patch:

        • Changed members & methods in TopFieldCollector from "protected" to
          package-private.
        • Tweaked javadocs, CHANGES.txt
        • Removed some dead code, nocommits
        • Re-added TestTimeLimitedCollector

        Besides the java ghosts, for which we will close our eyes and hope
        they disappear, I think this is ready to go in!

        I'll way a few days and then commit.

        Show
        Michael McCandless added a comment - Attached new patch: Changed members & methods in TopFieldCollector from "protected" to package-private. Tweaked javadocs, CHANGES.txt Removed some dead code, nocommits Re-added TestTimeLimitedCollector Besides the java ghosts, for which we will close our eyes and hope they disappear, I think this is ready to go in! I'll way a few days and then commit.
        Hide
        Michael McCandless added a comment -

        New patch which just fixes contrib/spatial's cutover to the new API to further cutover to the new new API.

        Show
        Michael McCandless added a comment - New patch which just fixes contrib/spatial's cutover to the new API to further cutover to the new new API.
        Hide
        Michael McCandless added a comment -

        I wonder if we should break out tracking of max score, which is far more costly, from tracking scores of hits inserted into the queue?

        Typically the number of inserts is very low (ie, the queue "converges" quickly) and so only track scores of inserted hits would be much lower cost than also tracking max score.

        Show
        Michael McCandless added a comment - I wonder if we should break out tracking of max score, which is far more costly, from tracking scores of hits inserted into the queue? Typically the number of inserts is very low (ie, the queue "converges" quickly) and so only track scores of inserted hits would be much lower cost than also tracking max score.
        Hide
        Shai Erera added a comment -

        That's actually what's done in TopScoreDocCollector. maxScore is not tracked at all, and only when topDocs() is called, maxScore is set to the largest element in the queue. It has some overhead in topDocs() (might require a pop of all elements, even if the range is "give me the last 10"), which I personally don't care about since it happens once per search.
        maxScore is only tracked in TopFieldCollector, which indeed adds several instructions to each collect, including a call to Math.max.
        Were you thinking along those lines? If so, I can remove maxScore tracking from TFC and compute it in topDocs only.

        Another idea, which will prevent popping out all the elements from the queue just to compare maxScore is to actually make it a method of TopDocsCollector (e.g., TDC.maxScore()). It will be the responsibility of the implementation (with TDC providing a base for all) to ensure that if the largest element is extracted, maxScore is set (so that we don't lose it). The reason I'm thinking about it is that because maxScore is probably used only for normalization, and if a certain application never uses it, why should we compute it?

        What do you think?

        Show
        Shai Erera added a comment - That's actually what's done in TopScoreDocCollector. maxScore is not tracked at all, and only when topDocs() is called, maxScore is set to the largest element in the queue. It has some overhead in topDocs() (might require a pop of all elements, even if the range is "give me the last 10"), which I personally don't care about since it happens once per search. maxScore is only tracked in TopFieldCollector, which indeed adds several instructions to each collect, including a call to Math.max. Were you thinking along those lines? If so, I can remove maxScore tracking from TFC and compute it in topDocs only. Another idea, which will prevent popping out all the elements from the queue just to compare maxScore is to actually make it a method of TopDocsCollector (e.g., TDC.maxScore()). It will be the responsibility of the implementation (with TDC providing a base for all) to ensure that if the largest element is extracted, maxScore is set (so that we don't lose it). The reason I'm thinking about it is that because maxScore is probably used only for normalization, and if a certain application never uses it, why should we compute it? What do you think?
        Hide
        Michael McCandless added a comment -

        maxScore is only tracked in TopFieldCollector, which indeed adds several instructions to each collect, including a call to Math.max.

        Right, but most costly is the call to Scorer.score(). If you didn't
        care to know the maxScore, but still wanted the raw score for each hit,
        it'd be much cheaper to only call Scorer.score() if the hit is
        competitive (ie, is inserted into the pq).

        Were you thinking along those lines? If so, I can remove maxScore tracking from TFC and compute it in topDocs only.

        We can't do that because that's a change in behavior (ie, maxScore
        includes max across all hits, not just the final topN).

        The reason I'm thinking about it is that because maxScore is probably used only for normalization, and if a certain application never uses it, why should we compute it?

        Right that's my reasoning too. So I think we should be able to separately turn off "track max score".

        Show
        Michael McCandless added a comment - maxScore is only tracked in TopFieldCollector, which indeed adds several instructions to each collect, including a call to Math.max. Right, but most costly is the call to Scorer.score(). If you didn't care to know the maxScore, but still wanted the raw score for each hit, it'd be much cheaper to only call Scorer.score() if the hit is competitive (ie, is inserted into the pq). Were you thinking along those lines? If so, I can remove maxScore tracking from TFC and compute it in topDocs only. We can't do that because that's a change in behavior (ie, maxScore includes max across all hits, not just the final topN). The reason I'm thinking about it is that because maxScore is probably used only for normalization, and if a certain application never uses it, why should we compute it? Right that's my reasoning too. So I think we should be able to separately turn off "track max score".
        Hide
        Shai Erera added a comment -

        Right ... so basically we're talking about changes in just TFC:

        1. Add another parameter to TFC.create - trackMaxScore (default to true in IndexSearcher, but change to false in 3.0).
        2. Keeping the non-scoring collectors as they are.
        3. Create a ScoringNoMaxScoreCollector, which extends NonScoringCollector. It will compute score only if the doc has a chance to enter the queue. This means implementing collect() entirely, and not relying on super.collect.
        4. Rename ScoringCollector to ScoringMaxScoreCollector and leave it as is: track both scores and maxScore and call super.collect().

        I figure there's no reason to create a NonScoringTrackMaxScore, since if we're calling score() to track maxScore, there's no reason not to set it on the docs that enter the pq.

        This does not affect TopScoreDocCollector.

        Show
        Shai Erera added a comment - Right ... so basically we're talking about changes in just TFC: Add another parameter to TFC.create - trackMaxScore (default to true in IndexSearcher, but change to false in 3.0). Keeping the non-scoring collectors as they are. Create a ScoringNoMaxScoreCollector, which extends NonScoringCollector. It will compute score only if the doc has a chance to enter the queue. This means implementing collect() entirely, and not relying on super.collect. Rename ScoringCollector to ScoringMaxScoreCollector and leave it as is: track both scores and maxScore and call super.collect(). I figure there's no reason to create a NonScoringTrackMaxScore, since if we're calling score() to track maxScore, there's no reason not to set it on the docs that enter the pq. This does not affect TopScoreDocCollector.
        Hide
        Michael McCandless added a comment -

        Sounds right! Wanna update the patch w/ that?

        Show
        Michael McCandless added a comment - Sounds right! Wanna update the patch w/ that?
        Hide
        Shai Erera added a comment -

        of course !

        Show
        Shai Erera added a comment - of course !
        Hide
        Michael McCandless added a comment -

        I came across another simple search optimization I think we should do:
        if IndexSearcher can guarantee it visits docs in docID order (ie, turn
        off the "sort by numDocs()" now done by default, that was added in
        LUCENE-1483), then in TopFieldCollector and TopScoreDocCollector we
        can remove careful tie-breaking in the "equals" cases.

        This means in TopScoreDocCollector we could change this:

        } else if (score >= reusableSD.score) {
        

        (inside collect) to:

        } else if (score > reusableSD.score) {
        

        because if the new hit's score is == it can't possibly compete since
        its docID is guaranteed to be greater than what's in the queue.

        Likewise, in TopFieldCollector.OneComparatorNonScoringCollector, we
        can change:

        if (cmp < 0 || (cmp == 0 && doc + docBase > bottom.docID)) {
          return;
        }
        

        to:

        if (cmp <= 0) {
          return;
        }
        

        and in MultiComparatorNonScoringCollector, we can change:

        } else if (i == comparators.length - 1) {
          // This is the equals case.
          if (doc + docBase > bottom.docID) {
            // Definitely not competitive
            return;
          }
          break;
        }
        

        to:

        } else if (i == comparators.length - 1) {
          return;
        }
        

        (or possibly refactor this one to return if c == 0 on exiting the
        loop).

        We do still need the "break tie" logic in the lessThan methods of
        HitQueue/FieldValueHitQueue, since PQ uses this to up/down heap itself
        after we insert/delete.

        Show
        Michael McCandless added a comment - I came across another simple search optimization I think we should do: if IndexSearcher can guarantee it visits docs in docID order (ie, turn off the "sort by numDocs()" now done by default, that was added in LUCENE-1483 ), then in TopFieldCollector and TopScoreDocCollector we can remove careful tie-breaking in the "equals" cases. This means in TopScoreDocCollector we could change this: } else if (score >= reusableSD.score) { (inside collect) to: } else if (score > reusableSD.score) { because if the new hit's score is == it can't possibly compete since its docID is guaranteed to be greater than what's in the queue. Likewise, in TopFieldCollector.OneComparatorNonScoringCollector, we can change: if (cmp < 0 || (cmp == 0 && doc + docBase > bottom.docID)) { return ; } to: if (cmp <= 0) { return ; } and in MultiComparatorNonScoringCollector, we can change: } else if (i == comparators.length - 1) { // This is the equals case . if (doc + docBase > bottom.docID) { // Definitely not competitive return ; } break ; } to: } else if (i == comparators.length - 1) { return ; } (or possibly refactor this one to return if c == 0 on exiting the loop). We do still need the "break tie" logic in the lessThan methods of HitQueue/FieldValueHitQueue, since PQ uses this to up/down heap itself after we insert/delete.
        Hide
        Shai Erera added a comment -

        Hey Mike. I actually planned to open another issue about optimizations to TSDC. Besides what you propose, I think that if we pre-fill pq with dummy objects (with score = Float.NEGATIVE_INF), we can skip the checks of "if (reusableSD == null)" and always assume the queue is full. We can even optimize it further by not calling pq.insertWithOverflow, but instead change top() and call adjustTop().

        I do think though that these belong in a different issue, since this one is about the refactoring. What do you say that I'll to this one the changes about maxScore we agreed, and then open a new one "optimizations to TSDC and TFC" and include those changes? I also want to do perf. measurements, ensure the unit tests cover everything etc. That issue already contains many changes.

        Show
        Shai Erera added a comment - Hey Mike. I actually planned to open another issue about optimizations to TSDC. Besides what you propose, I think that if we pre-fill pq with dummy objects (with score = Float.NEGATIVE_INF), we can skip the checks of "if (reusableSD == null)" and always assume the queue is full. We can even optimize it further by not calling pq.insertWithOverflow, but instead change top() and call adjustTop(). I do think though that these belong in a different issue, since this one is about the refactoring. What do you say that I'll to this one the changes about maxScore we agreed, and then open a new one "optimizations to TSDC and TFC" and include those changes? I also want to do perf. measurements, ensure the unit tests cover everything etc. That issue already contains many changes.
        Hide
        Michael McCandless added a comment -

        OK that sounds like a good plan.

        I'd be interested to see if pre-filling with sentinels gives better performance, too.

        Here's another one to add: some methods in Sort explicitly add SortField.FIELD_DOC as a "tie breaker" for the last SortField. But, doing so should not be necessary (since we already break ties by docID), and is in fact less efficient (once the above optimization is in).

        Show
        Michael McCandless added a comment - OK that sounds like a good plan. I'd be interested to see if pre-filling with sentinels gives better performance, too. Here's another one to add: some methods in Sort explicitly add SortField.FIELD_DOC as a "tie breaker" for the last SortField. But, doing so should not be necessary (since we already break ties by docID), and is in fact less efficient (once the above optimization is in).
        Hide
        Shai Erera added a comment -

        Does HitQueue favor documents with smaller ids? In case the score is equal I mean ... if not (I was under the impression larger doc Ids are favored), then if the scores are equal shouldn't the new doc replace the current doc? Isn't that's why the >= exists there now (othrewise it could be just >)? Or was it added as part of 1483, since it wasn't guaranteed documents are returned in increasing order?

        Show
        Shai Erera added a comment - Does HitQueue favor documents with smaller ids? In case the score is equal I mean ... if not (I was under the impression larger doc Ids are favored), then if the scores are equal shouldn't the new doc replace the current doc? Isn't that's why the >= exists there now (othrewise it could be just >)? Or was it added as part of 1483, since it wasn't guaranteed documents are returned in increasing order?
        Hide
        Michael McCandless added a comment -

        Does HitQueue favor documents with smaller ids?

        Yes. In it's lessThan method, when the scores are equal, it does this:

        return hitA.doc > hitB.doc; 
        

        Eg, hitB is "favored" (lessThan returns false) if it has a smaller doc than hitA.

        Or was it added as part of 1483, since it wasn't guaranteed documents are returned in increasing order?

        This was added as part of LUCENE-1483, but can be optimized away once we go back to in-order docID processing from IndexSearcher.

        Show
        Michael McCandless added a comment - Does HitQueue favor documents with smaller ids? Yes. In it's lessThan method, when the scores are equal, it does this: return hitA.doc > hitB.doc; Eg, hitB is "favored" (lessThan returns false) if it has a smaller doc than hitA. Or was it added as part of 1483, since it wasn't guaranteed documents are returned in increasing order? This was added as part of LUCENE-1483 , but can be optimized away once we go back to in-order docID processing from IndexSearcher.
        Hide
        Michael McCandless added a comment -

        Eg, hitB is "favored" (lessThan returns false) if it has a smaller doc than hitA.

        Duh, I got that backwards. This stuff is hard to think about!

        Corrected: if hitB has smaller doc than hitA, lessThan returns true, meaning hitB is preferred.

        Show
        Michael McCandless added a comment - Eg, hitB is "favored" (lessThan returns false) if it has a smaller doc than hitA. Duh, I got that backwards. This stuff is hard to think about! Corrected: if hitB has smaller doc than hitA, lessThan returns true, meaning hitB is preferred.
        Hide
        Shai Erera added a comment -
        • Adds the ScoringNoMaxScore collectors
        • Adds some tests to TestSort in order to test that functionality
        • Fixes a bug which existed since this issue - when maxScore is set to Float.NaN, Math.max always returns NaN. Therefore I set the ScoringMaxScore collectors to set it to NEG_INF (to accommodate scorers which assign negative scores to documents).
        • Added "nomaxscore" property to SearchWithSortTask.
        • Changed ReadTask.doLogic() to always use search(Query, Collector). The reason is that currently if scoring is set to true, it uses the default search method, however in 3.0 that method will be changed to not compute scores, and we might forget to change the logic in ReadTask.

        BTW, I wonder if we can replace the call to Math.max with just 'if (score > maxScore)'? Looking at Math.max, it checks if the fist parameter is NaN (which I assume Scorer.score() will not return), and then if both equal 0.0f returns their sum, otherwise returns the biggest. Calling this method is quite expensive, and I think we will be safe with replacing it with 'if', however now that maxScore is decoupled as well, only the trackMaxScore collectors will suffer from it ....
        Anyway, I added a TODO in case we don't want to change it now.

        Show
        Shai Erera added a comment - Adds the ScoringNoMaxScore collectors Adds some tests to TestSort in order to test that functionality Fixes a bug which existed since this issue - when maxScore is set to Float.NaN, Math.max always returns NaN. Therefore I set the ScoringMaxScore collectors to set it to NEG_INF (to accommodate scorers which assign negative scores to documents). Added "nomaxscore" property to SearchWithSortTask. Changed ReadTask.doLogic() to always use search(Query, Collector). The reason is that currently if scoring is set to true, it uses the default search method, however in 3.0 that method will be changed to not compute scores, and we might forget to change the logic in ReadTask. BTW, I wonder if we can replace the call to Math.max with just 'if (score > maxScore)'? Looking at Math.max, it checks if the fist parameter is NaN (which I assume Scorer.score() will not return), and then if both equal 0.0f returns their sum, otherwise returns the biggest. Calling this method is quite expensive, and I think we will be safe with replacing it with 'if', however now that maxScore is decoupled as well, only the trackMaxScore collectors will suffer from it .... Anyway, I added a TODO in case we don't want to change it now.
        Hide
        Shai Erera added a comment -

        added another test case to TestSort

        Show
        Shai Erera added a comment - added another test case to TestSort
        Hide
        Michael McCandless added a comment -

        BTW, I wonder if we can replace the call to Math.max with just 'if (score > maxScore)'?

        I would just go ahead and do that.

        Looks great! All tests pass. Here's all I found... can you make a
        new patch Shai?:

        • Patch is missing the fix to contrib/spatial
        • Did you add a test case verifying maxScore is correct (so that the
          Float.NaN issue would trip the test)?
        • Javadoc of searchWithSort needs to describe nomaxscore param
        • HitCollector isn't deprecated
        Show
        Michael McCandless added a comment - BTW, I wonder if we can replace the call to Math.max with just 'if (score > maxScore)'? I would just go ahead and do that. Looks great! All tests pass. Here's all I found... can you make a new patch Shai?: Patch is missing the fix to contrib/spatial Did you add a test case verifying maxScore is correct (so that the Float.NaN issue would trip the test)? Javadoc of searchWithSort needs to describe nomaxscore param HitCollector isn't deprecated
        Hide
        Shai Erera added a comment -

        Did you add a test case verifying maxScore is correct (so that the Float.NaN issue would trip the test)?

        I added to following tests:

        • testSortWithoutScoreTracking - asserts that ScoreDoc.score is set to Float.NaN as well as maxScore.
        • testSortWithScoreNoMaxScoreTracking - asserts that ScoreDoc.score is not Float.NaN, but maxScore is.
        • testSortWithScoreAndMaxScoreTracking - asserts that both ScoreDoc.score and maxScore are not set to NaN.
        • testSortWithScoreAndMaxScoreTrackingNoResults - asserts that in case of a maxScore tracking collector with 0 results, maxScore is set to Float.NaN, rather than NEG_INF.

        HitCollector isn't deprecated

        Somehow when I applied your patch, this change wasn't taken in. Anyway, I noticed its javadocs also referenced MRHC, so I fixed it also.

        Show
        Shai Erera added a comment - Did you add a test case verifying maxScore is correct (so that the Float.NaN issue would trip the test)? I added to following tests: testSortWithoutScoreTracking - asserts that ScoreDoc.score is set to Float.NaN as well as maxScore. testSortWithScoreNoMaxScoreTracking - asserts that ScoreDoc.score is not Float.NaN, but maxScore is. testSortWithScoreAndMaxScoreTracking - asserts that both ScoreDoc.score and maxScore are not set to NaN. testSortWithScoreAndMaxScoreTrackingNoResults - asserts that in case of a maxScore tracking collector with 0 results, maxScore is set to Float.NaN, rather than NEG_INF. HitCollector isn't deprecated Somehow when I applied your patch, this change wasn't taken in. Anyway, I noticed its javadocs also referenced MRHC, so I fixed it also.
        Hide
        Michael McCandless added a comment -

        I added to following tests:

        Great, I'll have a look, thanks!

        Somehow when I applied your patch, this change wasn't taken in. Anyway, I noticed its javadocs also referenced MRHC, so I fixed it also.

        Most likely the hunk was rejected, due to the $Id$ keyword in that javadoc. "svn patch" should fix this.

        Show
        Michael McCandless added a comment - I added to following tests: Great, I'll have a look, thanks! Somehow when I applied your patch, this change wasn't taken in. Anyway, I noticed its javadocs also referenced MRHC, so I fixed it also. Most likely the hunk was rejected, due to the $Id$ keyword in that javadoc. "svn patch" should fix this.
        Hide
        Michael McCandless added a comment -

        Patch looks good, and all tests patch. I'll commit in a day or two, barring any more sudden improvements

        Show
        Michael McCandless added a comment - Patch looks good, and all tests patch. I'll commit in a day or two, barring any more sudden improvements
        Hide
        Michael McCandless added a comment -

        Thanks Shai!

        Show
        Michael McCandless added a comment - Thanks Shai!
        Hide
        Uwe Schindler added a comment -

        Hi,
        Shalin found a backwards-incompatible change in the Searcher abstract class, I noticed this from his SVN comment for SOLR-940 (where he updated to Lucene trunk):

        abstract public void search(Weight weight, Filter filter, Collector results) throws IOException;
        

        This should not be abstract for backwards compatibility, but instead throw an UnsupportedOperationException or have a default implementation that somehow wraps the Collector using an old HitCollector (not very nice, I do not know how to fix this in any other way). Before 3.0, where this change would be ok, the Javadocs should note, that the deprecated HitCollector API will be removed and the Collector part will be made abstract.
        If this method stays abstract, you cannot compile old code or replace lucene jars (this is seldom, as almost nobody creates private implementations of Searcher, but Solr does...

        Show
        Uwe Schindler added a comment - Hi, Shalin found a backwards-incompatible change in the Searcher abstract class, I noticed this from his SVN comment for SOLR-940 (where he updated to Lucene trunk): abstract public void search(Weight weight, Filter filter, Collector results) throws IOException; This should not be abstract for backwards compatibility, but instead throw an UnsupportedOperationException or have a default implementation that somehow wraps the Collector using an old HitCollector (not very nice, I do not know how to fix this in any other way). Before 3.0, where this change would be ok, the Javadocs should note, that the deprecated HitCollector API will be removed and the Collector part will be made abstract. If this method stays abstract, you cannot compile old code or replace lucene jars (this is seldom, as almost nobody creates private implementations of Searcher, but Solr does...
        Hide
        Shai Erera added a comment -

        If you read somewhere above, you'll see that we've discussed this change. It seems that whatever we do, anyone who upgrades to 2.9 will need to touch his code. If you extend Searcher, you'll need to override that new method, regardless of what we choose to do. That's because if it's abstract, you need to implement it, and it it's concrete (throwing UOE), you'll need to override it since all the Searcher methods call this one at the end.

        I'm not sure wrapping a Collector with HitCollector will work, because all of the other classes now expect Collector, and their HitCollector variants call the Collector one with a HitCollectorWrapper (which is also deprecated).

        We agreed that making it abstract is the lesser of all evils, since you'll spot it at compile time, rather than at runtime, when you'll hit a UOE.

        Show
        Shai Erera added a comment - If you read somewhere above, you'll see that we've discussed this change. It seems that whatever we do, anyone who upgrades to 2.9 will need to touch his code. If you extend Searcher, you'll need to override that new method, regardless of what we choose to do. That's because if it's abstract, you need to implement it, and it it's concrete (throwing UOE), you'll need to override it since all the Searcher methods call this one at the end. I'm not sure wrapping a Collector with HitCollector will work, because all of the other classes now expect Collector, and their HitCollector variants call the Collector one with a HitCollectorWrapper (which is also deprecated). We agreed that making it abstract is the lesser of all evils, since you'll spot it at compile time, rather than at runtime, when you'll hit a UOE.
        Hide
        Michael McCandless added a comment -

        Shalin found a backwards-incompatible change in the Searcher abstract class

        We could go either way on this... the evils were strong with either choice, and we struggled and eventually went with adding abstract method today, for the reasons Shai enumerated.

        Show
        Michael McCandless added a comment - Shalin found a backwards-incompatible change in the Searcher abstract class We could go either way on this... the evils were strong with either choice, and we struggled and eventually went with adding abstract method today, for the reasons Shai enumerated.
        Hide
        Uwe Schindler added a comment -

        OK, I resolve the issue again. I was just wondering and wanted to be sure. I also missed the first entry in CHANGES.txt, which explains it.
        It is the same like with the Fieldable interface in the past, it is seldom implemented/overwritten and so the "normal" user will not be affected. And those who implement Fieldable or extend Searcher must implement it.

        Show
        Uwe Schindler added a comment - OK, I resolve the issue again. I was just wondering and wanted to be sure. I also missed the first entry in CHANGES.txt, which explains it. It is the same like with the Fieldable interface in the past, it is seldom implemented/overwritten and so the "normal" user will not be affected. And those who implement Fieldable or extend Searcher must implement it.

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Shai Erera
          • Votes:
            1 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development