Lucene - Core
  1. Lucene - Core
  2. LUCENE-2542

TopDocsCollector should be abstract super class that is the real "TopDocsCollector" contract, a subclass should implement the priority-queue logic. e.g. PQTopDocsCollector

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      TopDocsCollector is both an abstract interface for producing TopDocs as well as a PriorityQueue based implementation.
      Not all Collectors that could produce TopDocs must use a PriorityQueue, and it would be advantageous to allow the TopDocsCollector to be an "interface" type abstract class, with a PQTopDocsCollector sub-class.
      While doing this, it'd be good to clean up the generics uses in these classes. As it's odd to create a TopFieldCollector and have to case the TopDocs object, when this can be fixed with generics.

      1. LUCENE-2542.patch
        3 kB
        Woody Anderson
      2. LUCENE-2542.patch
        5 kB
        Woody Anderson
      3. LUCENE-2542.patch
        38 kB
        Woody Anderson
      4. LUCENE_3.0.2-2542.patch
        35 kB
        Woody Anderson

        Activity

        Hide
        Woody Anderson added a comment -

        I'm providing this patch in case you want to have this fix for 3.0.2 based lucene branch. It breaks the back compatibility test, b/c i changed the 'meaning' of TopDocsCollector. You could probably introduce a new class that is a superclass of TDC, but that makes code/files less readable and less representative.
        So, since few people are probably patching 3.0.2 in this case anyway, it's here only for those few want that. There will be another patch for 4.0 release target, where the back compatibility is not enforced.

        Show
        Woody Anderson added a comment - I'm providing this patch in case you want to have this fix for 3.0.2 based lucene branch. It breaks the back compatibility test, b/c i changed the 'meaning' of TopDocsCollector. You could probably introduce a new class that is a superclass of TDC, but that makes code/files less readable and less representative. So, since few people are probably patching 3.0.2 in this case anyway, it's here only for those few want that. There will be another patch for 4.0 release target, where the back compatibility is not enforced.
        Hide
        Woody Anderson added a comment -

        this patch was generated against lucene/trunk

        ~/g/lucene-trunk -> svn info
        Path: .
        URL: http://svn.apache.org/repos/asf/lucene/dev/trunk
        Repository Root: http://svn.apache.org/repos/asf
        Repository UUID: 13f79535-47bb-0310-9956-ffa450edef68
        Revision: 964467
        

        I generated the diff with svn, but did not "add with history" when i moved TopDocs* to PQTopDocs*, this allows the diff to apply cleanly with patch, but is not optimal for svn history management.

        with svn i guess, it's technically better to preserve the history and svn merge, which can track file adds w/ history etc.
        i track my local changes with git, which does that automatically, so if there is a "preferred" way to generate patches wrt to svn that can actually apply with patch i'll generate that way.

        Or if there is a way to apply a history preserving patch with svn, i'd love to know what it is. And i can figure out how to jostle that diff into my git repo on my own.

        Show
        Woody Anderson added a comment - this patch was generated against lucene/trunk ~/g/lucene-trunk -> svn info Path: . URL: http: //svn.apache.org/repos/asf/lucene/dev/trunk Repository Root: http: //svn.apache.org/repos/asf Repository UUID: 13f79535-47bb-0310-9956-ffa450edef68 Revision: 964467 I generated the diff with svn, but did not "add with history" when i moved TopDocs* to PQTopDocs*, this allows the diff to apply cleanly with patch, but is not optimal for svn history management. with svn i guess, it's technically better to preserve the history and svn merge, which can track file adds w/ history etc. i track my local changes with git, which does that automatically, so if there is a "preferred" way to generate patches wrt to svn that can actually apply with patch i'll generate that way. Or if there is a way to apply a history preserving patch with svn, i'd love to know what it is. And i can figure out how to jostle that diff into my git repo on my own.
        Hide
        Shai Erera added a comment -

        I am not sure that we should do this change. The purpose behind TDC was its PQ usage. On top of it we created TSDC and TFC. What is the benefit of having TDC w/o PQ, just so that we can have a PQTFC?

        Collector is the base class for all, and if someone wants to create a non-PQ TDC he can do so ... not sure this should be in Lucene core though.

        Show
        Shai Erera added a comment - I am not sure that we should do this change. The purpose behind TDC was its PQ usage. On top of it we created TSDC and TFC. What is the benefit of having TDC w/o PQ, just so that we can have a PQTFC? Collector is the base class for all, and if someone wants to create a non-PQ TDC he can do so ... not sure this should be in Lucene core though.
        Hide
        Woody Anderson added a comment -

        Well, the idea of "top N documents" is, imo, distinct from how it is implemented, (e.g. with lucene's internal PQ impl).

        I think, at some level we disagree on what TopDocsCollector is "for", you say it's for a PQ implementation; whereas, I think it's for the public interface, namely:
        getTopDocs(..) and getTotalHits(). The that fact that the implementation uses org.apache.lucene.util.PriorityQueue is not necessarily a positive thing for my purposes.

        Alternate impls certainly could subclass Collector directly, but much of the time when getting documents, getting a TopDocs is preferred, and there is value in expressing that capability abstractly. And there is value in expressing accomplishing that w/o being tied it to a single possible implementation. Overall, the PQ based impl is effective, but that doesn't mean that it's always the most effective impl capable of yielding TopDocCollector functionality, and by that i mean the public interface.

        I have uses of this code that require referring abstractly to Collectors that produce TopDocs, doing that without any common interface or abstract class (whole separate issue with the fact that Collector is not in iface & ditto TDC) requires wrapping a TopDocsCollector in a proxy, which is just kind of silly to me.

        Also, i'm entirely missing the downside?

        Show
        Woody Anderson added a comment - Well, the idea of "top N documents" is, imo, distinct from how it is implemented, (e.g. with lucene's internal PQ impl). I think, at some level we disagree on what TopDocsCollector is "for", you say it's for a PQ implementation; whereas, I think it's for the public interface, namely: getTopDocs(..) and getTotalHits(). The that fact that the implementation uses org.apache.lucene.util.PriorityQueue is not necessarily a positive thing for my purposes. Alternate impls certainly could subclass Collector directly, but much of the time when getting documents, getting a TopDocs is preferred, and there is value in expressing that capability abstractly. And there is value in expressing accomplishing that w/o being tied it to a single possible implementation. Overall, the PQ based impl is effective, but that doesn't mean that it's always the most effective impl capable of yielding TopDocCollector functionality, and by that i mean the public interface. I have uses of this code that require referring abstractly to Collectors that produce TopDocs, doing that without any common interface or abstract class (whole separate issue with the fact that Collector is not in iface & ditto TDC) requires wrapping a TopDocsCollector in a proxy, which is just kind of silly to me. Also, i'm entirely missing the downside?
        Hide
        Shai Erera added a comment -

        First, we didn't make an interface out of Collector or any of its derivatives on purpose - that's the approach that's adopted in many places of the code - when you want to add a method, it's impossible to do it with an interface w/o breaking back-compat, while abstract classes give us the freedom to do so w/ a default impl (when possible).

        You can already sub-class TDC and not rely on its PQ, by passing null to its ctor and override all of the topDocs methods. I agree that looks odd (why pass null in the first place). But then, there is a point where you need to ask yourself "what must be in core Lucene", "what is a nice to have" and "what is completely app-specific usage". Collector is a "must have in core", no arguments about it. Previously, TDC was actually TSDC and had only one derivative – TFC. But a while ago, when the Collector API was changed, TDC was created for TSDC and TFC to override it (rather than TFC overriding TSDC) and a "nice to have" side effect was that app-specific PQ-based collectors can override TDC as well.

        The question I'm asking myself is whether a generic "TDC", which will be entirely abstract – no chance for common impl, perhaps except for getTotalHits(), which is redundant 'cause TopDocs has totalHits already – is a "nice to have" in Lucene. It's certainly not a "must have" ... and it adds complexity more than it simplifies the API. After all, Lucene's external API accepts and handles Collector. Internally it creates TSDC or TFC. It couldn't care less what those two extend.

        It seems to me that only the app will benefit from having this abstraction, and seems to me you can do so w/o having TDC in core Lucene. You can have your own TDC, w/ a wrapper for "Lucene TDCs" and your own extensions ...

        Show
        Shai Erera added a comment - First, we didn't make an interface out of Collector or any of its derivatives on purpose - that's the approach that's adopted in many places of the code - when you want to add a method, it's impossible to do it with an interface w/o breaking back-compat, while abstract classes give us the freedom to do so w/ a default impl (when possible). You can already sub-class TDC and not rely on its PQ, by passing null to its ctor and override all of the topDocs methods. I agree that looks odd (why pass null in the first place). But then, there is a point where you need to ask yourself "what must be in core Lucene", "what is a nice to have" and "what is completely app-specific usage". Collector is a "must have in core", no arguments about it. Previously, TDC was actually TSDC and had only one derivative – TFC. But a while ago, when the Collector API was changed, TDC was created for TSDC and TFC to override it (rather than TFC overriding TSDC) and a "nice to have" side effect was that app-specific PQ-based collectors can override TDC as well. The question I'm asking myself is whether a generic "TDC", which will be entirely abstract – no chance for common impl, perhaps except for getTotalHits(), which is redundant 'cause TopDocs has totalHits already – is a "nice to have" in Lucene. It's certainly not a "must have" ... and it adds complexity more than it simplifies the API. After all, Lucene's external API accepts and handles Collector. Internally it creates TSDC or TFC. It couldn't care less what those two extend. It seems to me that only the app will benefit from having this abstraction, and seems to me you can do so w/o having TDC in core Lucene. You can have your own TDC, w/ a wrapper for "Lucene TDCs" and your own extensions ...
        Hide
        Woody Anderson added a comment -

        I'm pretty sure that you cannot usefully pass null to the TDC constructor due to various final methods (all the public methods are final), e.g.

          /** Returns the top docs that were collected by this collector. */
          public final TopDocs topDocs() {
            // In case pq was populated with sentinel values, there might be less
            // results than pq.size(). Therefore return all results until either
            // pq.size() or totalHits.
            return topDocs(0, totalHits < pq.size() ? totalHits : pq.size());
          }
        

        Which actually begs the question, should the constructor validate that the PQ is non-null, as it clearly cannot produce a TopDocs if it is
        I imagine there were performance reasons that it was made final? or is extension in this case frowned upon?

        At any rate, I agree that this isn't a "must have", i do think that since a lot of useful collectors conform to the TDC contract, it's useful to allow that definition to be extensible to implementations not provided by core lucene. Code "sharing" is not the only reason to allow for oo abstractions, and though the implementations don't all belong in core lucene, the ability to refer to them collectively does have value. I would suggest that the value of the abstraction is there for more developers than just me, otherwise i wouldn't have submitted the patch.

        I agree with your last comment the most, that IF TDC were extensible for my purposes, then that would be far superior to my submitted patch of adding classes.
        We could un-final all the public methods of the TDC, which would basically allow someone to fullfil the contract without using the PQ system TDC is built upon.

        What do you think of that?

        Show
        Woody Anderson added a comment - I'm pretty sure that you cannot usefully pass null to the TDC constructor due to various final methods (all the public methods are final), e.g. /** Returns the top docs that were collected by this collector. */ public final TopDocs topDocs() { // In case pq was populated with sentinel values, there might be less // results than pq.size(). Therefore return all results until either // pq.size() or totalHits. return topDocs(0, totalHits < pq.size() ? totalHits : pq.size()); } Which actually begs the question, should the constructor validate that the PQ is non-null, as it clearly cannot produce a TopDocs if it is I imagine there were performance reasons that it was made final? or is extension in this case frowned upon? At any rate, I agree that this isn't a "must have", i do think that since a lot of useful collectors conform to the TDC contract, it's useful to allow that definition to be extensible to implementations not provided by core lucene. Code "sharing" is not the only reason to allow for oo abstractions, and though the implementations don't all belong in core lucene, the ability to refer to them collectively does have value. I would suggest that the value of the abstraction is there for more developers than just me, otherwise i wouldn't have submitted the patch. I agree with your last comment the most, that IF TDC were extensible for my purposes, then that would be far superior to my submitted patch of adding classes. We could un-final all the public methods of the TDC, which would basically allow someone to fullfil the contract without using the PQ system TDC is built upon. What do you think of that?
        Hide
        Shai Erera added a comment -

        Personally I don't mind removing the final modifier from the methods. I see that jdocs need to be fixed anyway, cause they mention one can override topDocs which is wrong. What do others think?

        Show
        Shai Erera added a comment - Personally I don't mind removing the final modifier from the methods. I see that jdocs need to be fixed anyway, cause they mention one can override topDocs which is wrong. What do others think?
        Hide
        Woody Anderson added a comment -

        simple un-final patch with simple unit test that won't compile if they are final.
        updated javadocs as well. Though, i defer final wording in that regard.

        Show
        Woody Anderson added a comment - simple un-final patch with simple unit test that won't compile if they are final. updated javadocs as well. Though, i defer final wording in that regard.
        Hide
        Shai Erera added a comment -

        Can you please add the TDC impl to JustCompileSearch (instead of the testcase you wrote)? Look at other internal classes there for an example. This class is not a test per se, however it's there to ensure certain extensions are possible.

        Also, can you change the jdoc from "Most extending classes ..." to "Extending classes can override any of the methods to provide their own implementation, as well as avoid the use of the priority queue entirely by passing null to

        {@link #TopDocsCollector(PriorityQueue)}

        . In that case however, you might want to consider overriding all methods, in order to avoid an NullPointerException.

        Then, you can remove the jdoc of the ctor.

        Show
        Shai Erera added a comment - Can you please add the TDC impl to JustCompileSearch (instead of the testcase you wrote)? Look at other internal classes there for an example. This class is not a test per se, however it's there to ensure certain extensions are possible. Also, can you change the jdoc from "Most extending classes ..." to "Extending classes can override any of the methods to provide their own implementation, as well as avoid the use of the priority queue entirely by passing null to {@link #TopDocsCollector(PriorityQueue)} . In that case however, you might want to consider overriding all methods, in order to avoid an NullPointerException. Then, you can remove the jdoc of the ctor.
        Hide
        Woody Anderson added a comment -

        went with your suggestions verbatim.
        as a "just compile" test, we lose the test/proof that super(null) will not NPE, but that's presumably a fairly minor issue, and was previously not enforced by a test case anyway.
        adding that test would be trivial, we could simply "new JustCompileTopDocsCollector(null)", though that would go against "just compile", otherwise we're back to a test pretty much the same as the previous patch.

        Show
        Woody Anderson added a comment - went with your suggestions verbatim. as a "just compile" test, we lose the test/proof that super(null) will not NPE, but that's presumably a fairly minor issue, and was previously not enforced by a test case anyway. adding that test would be trivial, we could simply "new JustCompileTopDocsCollector(null)", though that would go against "just compile", otherwise we're back to a test pretty much the same as the previous patch.
        Hide
        Shai Erera added a comment -

        Patch looks good. I think it's ready to commit, but it will take me another day or two until I can get to it. So if someone else wants to commit it, please do.

        Show
        Shai Erera added a comment - Patch looks good. I think it's ready to commit, but it will take me another day or two until I can get to it. So if someone else wants to commit it, please do.
        Hide
        Yonik Seeley added a comment -

        committed.

        Show
        Yonik Seeley added a comment - committed.
        Hide
        Grant Ingersoll added a comment -

        Bulk close for 3.1

        Show
        Grant Ingersoll added a comment - Bulk close for 3.1

          People

          • Assignee:
            Unassigned
            Reporter:
            Woody Anderson
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development