Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.0, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Spin-off of LUCENE-5299.

      LUCENE-5229 proposes different changes, some of them being controversial, but there is one of them that I really really like that consists in refactoring the Collector API in order to have a different Collector per segment.

      The idea is, instead of having a single Collector object that needs to be able to take care of all segments, to have a top-level Collector:

      public interface Collector {
      
        AtomicCollector setNextReader(AtomicReaderContext context) throws IOException;
        
      }
      

      and a per-AtomicReaderContext collector:

      public interface AtomicCollector {
      
        void setScorer(Scorer scorer) throws IOException;
      
        void collect(int doc) throws IOException;
      
        boolean acceptsDocsOutOfOrder();
      
      }
      

      I think it makes the API clearer since it is now obious setScorer and acceptDocsOutOfOrder need to be called after setNextReader which is otherwise unclear.

      It also makes things more flexible. For example, a collector could much more easily decide to use different strategies on different segments. In particular, it makes the early-termination collector much cleaner since it can return different atomic collectors implementations depending on whether the current segment is sorted or not.

      Even if we have lots of collectors all over the place, we could make it easier to migrate by having a Collector that would implement both Collector and AtomicCollector, return this in setNextReader and make current concrete Collector implementations extend this class instead of directly extending Collector.

      1. LUCENE-5527.patch
        205 kB
        Adrien Grand
      2. LUCENE-5527.patch
        201 kB
        Adrien Grand

        Activity

        Hide
        David Smiley added a comment -

        +1 I like it!

        Show
        David Smiley added a comment - +1 I like it!
        Hide
        Shai Erera added a comment -

        Maybe we can get rid of setScorer, passing Scorer to setNextReader(AtomicReaderContext,Scorer)?

        Show
        Shai Erera added a comment - Maybe we can get rid of setScorer, passing Scorer to setNextReader(AtomicReaderContext,Scorer) ?
        Hide
        Adrien Grand added a comment -

        Here is a patch, all tests pass.

        As much as possible, I tried to only change the API, not implementations so top-docs collectors, etc. still work exactly the same way. The helper SimpleCollector implements both Collector and AtomicCollector and returns itself in setNextReader. It is useful for collectors that don't need to change their behavior based on the current reader context such as TotalHitCountCollector, but it is also quite handy for the migration since migrating a Collector to the new API is just a matter of extending SimpleCollector instead of Collector and renaming setNextReader to doSetNextReader.

        I had initially thought about doing this change for 5.0 only but since the migration ends up being easy and since Collector is an experimental API, we could also have it for 4.8.

        Feedback is highly welcome!

        Maybe we can get rid of setScorer, passing Scorer to setNextReader(AtomicReaderContext,Scorer)?

        This can't be done today because there is a circular dependency between the Collector and the Scorer. If you look at IndexSearcher.searchsearch(List<AtomicReaderContext>, Weight, Collector), it needs to know whether the collector supports out-of-order collection in order to know which BulkScorer to instantiate, and then the BulkScorer needs to let the Collector know about the Scorer that is being used.

        Show
        Adrien Grand added a comment - Here is a patch, all tests pass. As much as possible, I tried to only change the API, not implementations so top-docs collectors, etc. still work exactly the same way. The helper SimpleCollector implements both Collector and AtomicCollector and returns itself in setNextReader . It is useful for collectors that don't need to change their behavior based on the current reader context such as TotalHitCountCollector , but it is also quite handy for the migration since migrating a Collector to the new API is just a matter of extending SimpleCollector instead of Collector and renaming setNextReader to doSetNextReader . I had initially thought about doing this change for 5.0 only but since the migration ends up being easy and since Collector is an experimental API, we could also have it for 4.8. Feedback is highly welcome! Maybe we can get rid of setScorer, passing Scorer to setNextReader(AtomicReaderContext,Scorer)? This can't be done today because there is a circular dependency between the Collector and the Scorer . If you look at IndexSearcher.searchsearch(List<AtomicReaderContext>, Weight, Collector) , it needs to know whether the collector supports out-of-order collection in order to know which BulkScorer to instantiate, and then the BulkScorer needs to let the Collector know about the Scorer that is being used.
        Hide
        Robert Muir added a comment -

        In my opinion we should rethink the naming here.

        Collector now no longer actually collects anything, and there is no 'atomic' property about AtomicCollector...

        Show
        Robert Muir added a comment - In my opinion we should rethink the naming here. Collector now no longer actually collects anything, and there is no 'atomic' property about AtomicCollector...
        Hide
        Adrien Grand added a comment -

        I chose these names to make it clear what these classes relate to:

        • IndexReader -> Collector
        • AtomicReader -> AtomicCollector

        But I'm totally open to better names!

        Show
        Adrien Grand added a comment - I chose these names to make it clear what these classes relate to: IndexReader -> Collector AtomicReader -> AtomicCollector But I'm totally open to better names!
        Hide
        Robert Muir added a comment -

        That makes sense. Honestly I think my issue is not specific to Collector, but more general.

        I guess I wish AtomicReader were named LeafReader.

        So maybe we shouldnt try to tackle that on this issue.

        Show
        Robert Muir added a comment - That makes sense. Honestly I think my issue is not specific to Collector, but more general. I guess I wish AtomicReader were named LeafReader. So maybe we shouldnt try to tackle that on this issue.
        Hide
        Ryan Ernst added a comment -

        I guess I wish AtomicReader were named LeafReader.
        So maybe we shouldnt try to tackle that on this issue.

        I agree 'atomic' is a bad name. Why not name this LeafCollector now, since it makes more sense on its own?

        Show
        Ryan Ernst added a comment - I guess I wish AtomicReader were named LeafReader. So maybe we shouldnt try to tackle that on this issue. I agree 'atomic' is a bad name. Why not name this LeafCollector now, since it makes more sense on its own?
        Hide
        Adrien Grand added a comment -

        I like LeafCollector better too. So let's name it this way now and open an issue to rename AtomicReader to LeafReader in 5.0?

        Show
        Adrien Grand added a comment - I like LeafCollector better too. So let's name it this way now and open an issue to rename AtomicReader to LeafReader in 5.0?
        Hide
        Ryan Ernst added a comment -

        +1

        Show
        Ryan Ernst added a comment - +1
        Hide
        Adrien Grand added a comment -

        I opened LUCENE-5569 to discuss the renaming.

        Show
        Adrien Grand added a comment - I opened LUCENE-5569 to discuss the renaming.
        Hide
        Michael McCandless added a comment -

        +1 for LeafCollector and the patch.

        I tested if there are search performance impacts from this:

        Report after iter 10:
                            Task    QPS base      StdDev    QPS comp      StdDev                Pct diff
                         Respell       49.44      (3.3%)       48.10      (3.7%)   -2.7% (  -9% -    4%)
                          Fuzzy2       46.74      (3.2%)       45.73      (3.1%)   -2.2% (  -8% -    4%)
                          Fuzzy1       59.25      (3.7%)       58.08      (3.5%)   -2.0% (  -8% -    5%)
                          IntNRQ        3.42      (3.8%)        3.40      (3.8%)   -0.7% (  -7% -    7%)
                         Prefix3       86.67      (2.6%)       86.17      (2.6%)   -0.6% (  -5% -    4%)
                 LowSloppyPhrase       44.44      (2.3%)       44.42      (2.5%)   -0.1% (  -4% -    4%)
                        Wildcard       19.08      (3.5%)       19.07      (3.0%)   -0.1% (  -6% -    6%)
                      AndHighMed       34.38      (1.0%)       34.38      (1.0%)   -0.0% (  -2% -    2%)
                     LowSpanNear       10.41      (3.1%)       10.41      (2.3%)    0.0% (  -5% -    5%)
                HighSloppyPhrase        3.49      (7.9%)        3.49      (6.6%)    0.1% ( -13% -   15%)
                     AndHighHigh       28.35      (1.1%)       28.39      (1.0%)    0.1% (  -1% -    2%)
                     MedSpanNear       31.06      (2.8%)       31.12      (2.7%)    0.2% (  -5% -    5%)
                      AndHighLow      391.44      (2.9%)      392.73      (2.6%)    0.3% (  -5% -    6%)
                 MedSloppyPhrase        3.54      (5.2%)        3.56      (4.6%)    0.4% (  -8% -   10%)
                       OrHighMed       26.51      (4.0%)       26.66      (5.7%)    0.6% (  -8% -   10%)
                    OrHighNotLow       24.84      (4.1%)       24.98      (5.8%)    0.6% (  -9% -   10%)
                       LowPhrase       13.19      (1.6%)       13.27      (2.3%)    0.6% (  -3% -    4%)
                       OrHighLow       18.78      (4.1%)       18.91      (5.8%)    0.7% (  -8% -   11%)
                   OrNotHighHigh        8.87      (4.5%)        8.93      (6.0%)    0.7% (  -9% -   11%)
                    OrHighNotMed       30.63      (4.1%)       30.85      (5.5%)    0.7% (  -8% -   10%)
                      OrHighHigh        8.21      (4.1%)        8.27      (5.8%)    0.7% (  -8% -   11%)
                       MedPhrase      203.10      (6.6%)      204.77      (6.3%)    0.8% ( -11% -   14%)
                   OrHighNotHigh       11.09      (4.5%)       11.18      (5.9%)    0.8% (  -9% -   11%)
                         LowTerm      322.74      (5.6%)      325.67      (5.6%)    0.9% (  -9% -   12%)
                        HighTerm       63.88     (12.8%)       64.55     (12.2%)    1.1% ( -21% -   29%)
                         MedTerm      100.19      (9.8%)      101.31      (9.5%)    1.1% ( -16% -   22%)
                    HighSpanNear        8.09      (4.0%)        8.18      (4.9%)    1.1% (  -7% -   10%)
                      HighPhrase        4.27      (7.1%)        4.32      (6.5%)    1.2% ( -11% -   15%)
                    OrNotHighMed       19.00      (7.0%)       19.30      (7.6%)    1.6% ( -12% -   17%)
                    OrNotHighLow       19.63      (7.4%)       19.96      (8.0%)    1.7% ( -12% -   18%)
        

        Looks like just noise!

        Show
        Michael McCandless added a comment - +1 for LeafCollector and the patch. I tested if there are search performance impacts from this: Report after iter 10: Task QPS base StdDev QPS comp StdDev Pct diff Respell 49.44 (3.3%) 48.10 (3.7%) -2.7% ( -9% - 4%) Fuzzy2 46.74 (3.2%) 45.73 (3.1%) -2.2% ( -8% - 4%) Fuzzy1 59.25 (3.7%) 58.08 (3.5%) -2.0% ( -8% - 5%) IntNRQ 3.42 (3.8%) 3.40 (3.8%) -0.7% ( -7% - 7%) Prefix3 86.67 (2.6%) 86.17 (2.6%) -0.6% ( -5% - 4%) LowSloppyPhrase 44.44 (2.3%) 44.42 (2.5%) -0.1% ( -4% - 4%) Wildcard 19.08 (3.5%) 19.07 (3.0%) -0.1% ( -6% - 6%) AndHighMed 34.38 (1.0%) 34.38 (1.0%) -0.0% ( -2% - 2%) LowSpanNear 10.41 (3.1%) 10.41 (2.3%) 0.0% ( -5% - 5%) HighSloppyPhrase 3.49 (7.9%) 3.49 (6.6%) 0.1% ( -13% - 15%) AndHighHigh 28.35 (1.1%) 28.39 (1.0%) 0.1% ( -1% - 2%) MedSpanNear 31.06 (2.8%) 31.12 (2.7%) 0.2% ( -5% - 5%) AndHighLow 391.44 (2.9%) 392.73 (2.6%) 0.3% ( -5% - 6%) MedSloppyPhrase 3.54 (5.2%) 3.56 (4.6%) 0.4% ( -8% - 10%) OrHighMed 26.51 (4.0%) 26.66 (5.7%) 0.6% ( -8% - 10%) OrHighNotLow 24.84 (4.1%) 24.98 (5.8%) 0.6% ( -9% - 10%) LowPhrase 13.19 (1.6%) 13.27 (2.3%) 0.6% ( -3% - 4%) OrHighLow 18.78 (4.1%) 18.91 (5.8%) 0.7% ( -8% - 11%) OrNotHighHigh 8.87 (4.5%) 8.93 (6.0%) 0.7% ( -9% - 11%) OrHighNotMed 30.63 (4.1%) 30.85 (5.5%) 0.7% ( -8% - 10%) OrHighHigh 8.21 (4.1%) 8.27 (5.8%) 0.7% ( -8% - 11%) MedPhrase 203.10 (6.6%) 204.77 (6.3%) 0.8% ( -11% - 14%) OrHighNotHigh 11.09 (4.5%) 11.18 (5.9%) 0.8% ( -9% - 11%) LowTerm 322.74 (5.6%) 325.67 (5.6%) 0.9% ( -9% - 12%) HighTerm 63.88 (12.8%) 64.55 (12.2%) 1.1% ( -21% - 29%) MedTerm 100.19 (9.8%) 101.31 (9.5%) 1.1% ( -16% - 22%) HighSpanNear 8.09 (4.0%) 8.18 (4.9%) 1.1% ( -7% - 10%) HighPhrase 4.27 (7.1%) 4.32 (6.5%) 1.2% ( -11% - 15%) OrNotHighMed 19.00 (7.0%) 19.30 (7.6%) 1.6% ( -12% - 17%) OrNotHighLow 19.63 (7.4%) 19.96 (8.0%) 1.7% ( -12% - 18%) Looks like just noise!
        Hide
        Hoss Man added a comment -

        This isn't an area of Lucene i tend to think much about, but if we're talking about changing the API, i have a few questions i wanted to put out there for discussion:

        • is setNextReader really an appropriate name anymore? or should it be something like getLeafCollector(AtomicReaderContext)
        • Should we bite the bullet and make LeafCollector and Collector extend Closable ?
          • I believe this was discussed at one point before, and it was considered an obnoxious interface change with minimal gain - but if we're going to change the interface API now anyway, it may be worth reconsidering
          • Solr works around this by always using DelegatingCollector which has a finish() method - i can't imagine other lucene apps don't have similar wishes for close() like this.
        • Should we go ahead and think about if/how this API should be tweaked (now or in the future) to allow a Collector to indicate that it's LeafCollectors can be used to collect documents from different segments in parallel threads? If we think a marker interface would be the best way to go, so we can easily add that later – but i wanted to ask the question
          • One possibility that just occurred to me: Document right now that LeafCollectors may be processed in parallel, and that if you want to force single threaded collection you should implement Collector.getLeafCollector(AtomicReaderContext) such that it blocks until the previously returned LeafCollector has been closed (which should be easy enough to do in things like SimpleCollector using a basic sync lock, right?)

        (I have no strong feelings about most of this)

        Show
        Hoss Man added a comment - This isn't an area of Lucene i tend to think much about, but if we're talking about changing the API, i have a few questions i wanted to put out there for discussion: is setNextReader really an appropriate name anymore? or should it be something like getLeafCollector(AtomicReaderContext) Should we bite the bullet and make LeafCollector and Collector extend Closable ? I believe this was discussed at one point before, and it was considered an obnoxious interface change with minimal gain - but if we're going to change the interface API now anyway, it may be worth reconsidering Solr works around this by always using DelegatingCollector which has a finish() method - i can't imagine other lucene apps don't have similar wishes for close() like this. Should we go ahead and think about if/how this API should be tweaked (now or in the future) to allow a Collector to indicate that it's LeafCollectors can be used to collect documents from different segments in parallel threads? If we think a marker interface would be the best way to go, so we can easily add that later – but i wanted to ask the question One possibility that just occurred to me: Document right now that LeafCollectors may be processed in parallel, and that if you want to force single threaded collection you should implement Collector.getLeafCollector(AtomicReaderContext) such that it blocks until the previously returned LeafCollector has been closed (which should be easy enough to do in things like SimpleCollector using a basic sync lock, right?) (I have no strong feelings about most of this)
        Hide
        Adrien Grand added a comment -

        is setNextReader really an appropriate name anymore? or should it be something like getLeafCollector(AtomicReaderContext)

        I think this name is better indeed!

        Should we bite the bullet and make LeafCollector and Collector extend Closable ?

        I would like having such callbacks too. LUCENE-4370 discusses such a change and the challenges.

        Should we go ahead and think about if/how this API should be tweaked (now or in the future) to allow a Collector to indicate that it's LeafCollectors can be used to collect documents from different segments in parallel threads

        This was the main point of LUCENE-5299 that I forked this issue/patch from. But I think this is more controversial: I think that in general, it is a better idea to shard the index (even locally) and merge per-shard results using something like TopDocs.merge than trying to collect segments concurrently because it allows to have individual tasks of equal cost.

        Show
        Adrien Grand added a comment - is setNextReader really an appropriate name anymore? or should it be something like getLeafCollector(AtomicReaderContext) I think this name is better indeed! Should we bite the bullet and make LeafCollector and Collector extend Closable ? I would like having such callbacks too. LUCENE-4370 discusses such a change and the challenges. Should we go ahead and think about if/how this API should be tweaked (now or in the future) to allow a Collector to indicate that it's LeafCollectors can be used to collect documents from different segments in parallel threads This was the main point of LUCENE-5299 that I forked this issue/patch from. But I think this is more controversial: I think that in general, it is a better idea to shard the index (even locally) and merge per-shard results using something like TopDocs.merge than trying to collect segments concurrently because it allows to have individual tasks of equal cost.
        Hide
        Shikhar Bhushan added a comment -

        Thanks for picking this up Adrien! I always wanted to push forward at least the API refactoring but did not get the chance to do so.

        +1 on adding a method like LeafCollector.done() / finish() or such, and making that part of the usage contract.

        It's not just Solr with DelegatingCollector that has something like this, I think I remember seeing this pattern even in ES.

        LUCENE-5299 had this as a SubCollector.done() method and it led to a lot of code-cleanup at various places where we were trying to detect a transition to the next segment based on a call to setNextReader(). In some cases, the result finalization was being done lazily when result retrieval methods are being called, because there is no other good way of knowing that the last segment has been processed.

        Show
        Shikhar Bhushan added a comment - Thanks for picking this up Adrien! I always wanted to push forward at least the API refactoring but did not get the chance to do so. +1 on adding a method like LeafCollector.done() / finish() or such, and making that part of the usage contract. It's not just Solr with DelegatingCollector that has something like this, I think I remember seeing this pattern even in ES. LUCENE-5299 had this as a SubCollector.done() method and it led to a lot of code-cleanup at various places where we were trying to detect a transition to the next segment based on a call to setNextReader(). In some cases, the result finalization was being done lazily when result retrieval methods are being called, because there is no other good way of knowing that the last segment has been processed.
        Hide
        Adrien Grand added a comment -

        Thanks for picking this up Adrien! I always wanted to push forward at least the API refactoring but did not get the chance to do so.

        You're welcome.

        It's not just Solr with DelegatingCollector that has something like this, I think I remember seeing this pattern even in ES.

        Indeed there is! I think it would be nice to have it in Lucene, I would just like to keep this issue contained in order to make it easier to make progress. We can iterate on LUCENE-4370 once this one is in!

        Show
        Adrien Grand added a comment - Thanks for picking this up Adrien! I always wanted to push forward at least the API refactoring but did not get the chance to do so. You're welcome. It's not just Solr with DelegatingCollector that has something like this, I think I remember seeing this pattern even in ES. Indeed there is! I think it would be nice to have it in Lucene, I would just like to keep this issue contained in order to make it easier to make progress. We can iterate on LUCENE-4370 once this one is in!
        Hide
        Hoss Man added a comment -

        Indeed there is! I think it would be nice to have it in Lucene, I would just like to keep this issue contained in order to make it easier to make progress.

        Understood – but my point is that already in this issue:

        • LeafCollector is completely new, so it can have whatever methods we think it needs – if it should have close() let's give it close.
        • we're discussing major changes to both the api and the functional semantics of Collector – so if the semantics are on the table, and we think having close() is an appropriate semantic for Collector to have (in it's new context as a think that returns LeafCollectors) then lets make sure that's on the table as well.
        Show
        Hoss Man added a comment - Indeed there is! I think it would be nice to have it in Lucene, I would just like to keep this issue contained in order to make it easier to make progress. Understood – but my point is that already in this issue: LeafCollector is completely new, so it can have whatever methods we think it needs – if it should have close() let's give it close. we're discussing major changes to both the api and the functional semantics of Collector – so if the semantics are on the table, and we think having close() is an appropriate semantic for Collector to have (in it's new context as a think that returns LeafCollectors ) then lets make sure that's on the table as well.
        Hide
        Adrien Grand added a comment -

        The patch only moves 3 methods of the Collector class into a new LeafCollector class that gets instantiated per-segment. It doesn't add any functionnality but just changes the way documents are collected.

        I think having callbacks upon collection completion would be useful and help clean up code as Shikhar Bhushan mentionned. But this LeafCollector refactoring has already been stalled in LUCENE-5299 because LUCENE-5299 also discussed parallelizing collection so I would like to keep this change as small as possible and get it in.

        Show
        Adrien Grand added a comment - The patch only moves 3 methods of the Collector class into a new LeafCollector class that gets instantiated per-segment. It doesn't add any functionnality but just changes the way documents are collected. I think having callbacks upon collection completion would be useful and help clean up code as Shikhar Bhushan mentionned. But this LeafCollector refactoring has already been stalled in LUCENE-5299 because LUCENE-5299 also discussed parallelizing collection so I would like to keep this change as small as possible and get it in.
        Hide
        Yonik Seeley added a comment -

        I had initially thought about doing this change for 5.0 only but since the migration ends up being easy and since Collector is an experimental API, we could also have it for 4.8.

        That surprised me that it was marked as "experimental"! It's been around in it's current form for 3 years, and anyone doing anything "interesting" with Lucene would most likely be using it. Perhaps this change should only be targeted toward trunk (5.0)?

        Show
        Yonik Seeley added a comment - I had initially thought about doing this change for 5.0 only but since the migration ends up being easy and since Collector is an experimental API, we could also have it for 4.8. That surprised me that it was marked as "experimental"! It's been around in it's current form for 3 years, and anyone doing anything "interesting" with Lucene would most likely be using it. Perhaps this change should only be targeted toward trunk (5.0)?
        Hide
        Uwe Schindler added a comment -

        That surprised me that it was marked as "experimental"! It's been around in it's current form for 3 years, and anyone doing anything "interesting" with Lucene would most likely be using it. Perhaps this change should only be targeted toward trunk (5.0)?

        +1 - Yes, please. This would make upgrading a pain for lot of people. Like the renaming issue, I would only ever do this for 5.0. Why should we do it before 5.0? It is mainly a refactoring.

        Show
        Uwe Schindler added a comment - That surprised me that it was marked as "experimental"! It's been around in it's current form for 3 years, and anyone doing anything "interesting" with Lucene would most likely be using it. Perhaps this change should only be targeted toward trunk (5.0)? +1 - Yes, please. This would make upgrading a pain for lot of people. Like the renaming issue, I would only ever do this for 5.0. Why should we do it before 5.0? It is mainly a refactoring.
        Hide
        Adrien Grand added a comment -

        Uwe, are you talking about the LeafReader or the LeafCollector renaming?

        Show
        Adrien Grand added a comment - Uwe, are you talking about the LeafReader or the LeafCollector renaming?
        Hide
        Hoss Man added a comment -

        The patch only moves 3 methods of the Collector class into a new LeafCollector class that gets instantiated per-segment. It doesn't add any functionnality but just changes the way documents are collected.

        sure ... but "in for a penny in for a pound" ... this change is going to require every existing implementation of "Collector" to change, so if we're already going to require that of all existing classes, why not also add a close() method (that can easily be a No-Op for most existing implementations) at the same time?

        I think having callbacks upon collection completion would be useful and help clean up code as Shikhar Bhushan mentionned. But this LeafCollector refactoring has already been stalled in LUCENE-5299 because LUCENE-5299 also discussed parallelizing collection so I would like to keep this change as small as possible and get it in.

        Fair enough – as i said, i don't have strong opinions about any of this, i just wanted to raise the question. But personally i think that adding Closable here and now as part of this API change issue makes sense (since it's about the "life cycle" of the Collector/LeafCollector) independent of any question of parallelization., Even if parallelization is controversial, adding close() doesn't seem to be, so why not do all of the known desirable, non-controversial, Collector API changes at once?

        If i'm wrong, and there is some dissent about whether close() makes sense on these classes, then by all means yes: ignore the suggestion and focus purely on what you've got in the current patch.

        Show
        Hoss Man added a comment - The patch only moves 3 methods of the Collector class into a new LeafCollector class that gets instantiated per-segment. It doesn't add any functionnality but just changes the way documents are collected. sure ... but "in for a penny in for a pound" ... this change is going to require every existing implementation of "Collector" to change, so if we're already going to require that of all existing classes, why not also add a close() method (that can easily be a No-Op for most existing implementations) at the same time? I think having callbacks upon collection completion would be useful and help clean up code as Shikhar Bhushan mentionned. But this LeafCollector refactoring has already been stalled in LUCENE-5299 because LUCENE-5299 also discussed parallelizing collection so I would like to keep this change as small as possible and get it in. Fair enough – as i said, i don't have strong opinions about any of this, i just wanted to raise the question. But personally i think that adding Closable here and now as part of this API change issue makes sense (since it's about the "life cycle" of the Collector/LeafCollector) independent of any question of parallelization., Even if parallelization is controversial, adding close() doesn't seem to be, so why not do all of the known desirable, non-controversial, Collector API changes at once? If i'm wrong, and there is some dissent about whether close() makes sense on these classes, then by all means yes: ignore the suggestion and focus purely on what you've got in the current patch.
        Hide
        Adrien Grand added a comment -

        I think there are things to discuss regarding proper wrapping and how to deal with exceptions that are thrown during collection: should close/finish be called in case an exception is thrown during collection (do we want to do the same in case of an IOException or a TimeExceededException)?

        Show
        Adrien Grand added a comment - I think there are things to discuss regarding proper wrapping and how to deal with exceptions that are thrown during collection: should close/finish be called in case an exception is thrown during collection (do we want to do the same in case of an IOException or a TimeExceededException)?
        Hide
        Robert Muir added a comment -

        Even if parallelization is controversial, adding close() doesn't seem to be, so why not do all of the known desirable, non-controversial, Collector API changes at once?

        Its controversial. Its going to be some work to do it correctly, and its arguably completely unnecessary (after IS.search returns, you are done with the collector).

        I described these problems here: LUCENE-4370

        If we can add close() in a way thats consistent, fine. But adding this "lifecycle" stuff in a way that is inconsistent is very bad. We need semantics that can be relied upon.

        Show
        Robert Muir added a comment - Even if parallelization is controversial, adding close() doesn't seem to be, so why not do all of the known desirable, non-controversial, Collector API changes at once? Its controversial. Its going to be some work to do it correctly, and its arguably completely unnecessary (after IS.search returns, you are done with the collector). I described these problems here: LUCENE-4370 If we can add close() in a way thats consistent, fine. But adding this "lifecycle" stuff in a way that is inconsistent is very bad. We need semantics that can be relied upon.
        Hide
        Hoss Man added a comment -

        Its controversial. ...

        Fair enough, i retract the suggestion.

        Off my list of questions, that just leaves...

        is setNextReader really an appropriate name anymore? or should it be something like getLeafCollector(AtomicReaderContext)

        ...which Adrien seemed to be on board with. Any other concerns?

        Show
        Hoss Man added a comment - Its controversial. ... Fair enough, i retract the suggestion. Off my list of questions, that just leaves... is setNextReader really an appropriate name anymore? or should it be something like getLeafCollector(AtomicReaderContext) ...which Adrien seemed to be on board with. Any other concerns?
        Hide
        Robert Muir added a comment -

        I dont think we have to "give up" on the close() idea. it just has its own issue and its own set of problems. probably a more annoying/larger issue than this one to get right. I think its ok to just tackle it there...

        Show
        Robert Muir added a comment - I dont think we have to "give up" on the close() idea. it just has its own issue and its own set of problems. probably a more annoying/larger issue than this one to get right. I think its ok to just tackle it there...
        Hide
        Hoss Man added a comment -

        I dont think we have to "give up" on the close() idea.

        Well, I only brought it up because i didn't realize the full context of existing discussion that already took place in LUCENE-5299 and it seemed like an "easy" win if we were already going to change the api/lifecycle of Collector – but i don't want to derail Adrien's current goal.

        If, as you say, ...

        ...it (adding Closable) just has its own issue and its own set of problems....

        ...that seems like good grounds for it to be tackled in it's own Jira, and let this one (splitting out LeafCollector) move forward first.

        (Adrien: sorry for stiring up so much shit )

        Show
        Hoss Man added a comment - I dont think we have to "give up" on the close() idea. Well, I only brought it up because i didn't realize the full context of existing discussion that already took place in LUCENE-5299 and it seemed like an "easy" win if we were already going to change the api/lifecycle of Collector – but i don't want to derail Adrien's current goal. If, as you say, ... ...it (adding Closable) just has its own issue and its own set of problems.... ...that seems like good grounds for it to be tackled in it's own Jira, and let this one (splitting out LeafCollector) move forward first. (Adrien: sorry for stiring up so much shit )
        Hide
        Adrien Grand added a comment -

        Chris Hostetter (Unused) no worries!

        Michael McCandless Thanks for running this benchmark!

        Here is an updated patch:

        • AtomicCollector has been renamed to LeafCollector
        • Collector.setNextReader has been renamed to Collector.getLeafCollector.

        precommit and all tests pass. I plan to commit this patch to 5.0 only for the reasons that Yonik and Uwe mentioned.

        Show
        Adrien Grand added a comment - Chris Hostetter (Unused) no worries! Michael McCandless Thanks for running this benchmark! Here is an updated patch: AtomicCollector has been renamed to LeafCollector Collector.setNextReader has been renamed to Collector.getLeafCollector . precommit and all tests pass. I plan to commit this patch to 5.0 only for the reasons that Yonik and Uwe mentioned.
        Hide
        ASF subversion and git services added a comment -

        Commit 1584747 from jpountz@apache.org in branch 'dev/trunk'
        [ https://svn.apache.org/r1584747 ]

        LUCENE-5527: Refactor Collector API to use a dedicated Collector per leaf.

        Show
        ASF subversion and git services added a comment - Commit 1584747 from jpountz@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1584747 ] LUCENE-5527 : Refactor Collector API to use a dedicated Collector per leaf.
        Hide
        Adrien Grand added a comment -

        Thanks everyone!

        Show
        Adrien Grand added a comment - Thanks everyone!
        Hide
        Anshum Gupta added a comment -

        Bulk close after 5.0 release.

        Show
        Anshum Gupta added a comment - Bulk close after 5.0 release.

          People

          • Assignee:
            Adrien Grand
            Reporter:
            Adrien Grand
          • Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development