Lucene - Core
  1. Lucene - Core
  2. LUCENE-3282

BlockJoinQuery: Allow to add a custom child collector, and customize the parent bitset extraction

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 3.4, 4.0-ALPHA
    • Fix Version/s: None
    • Component/s: core/search
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      It would be nice to allow to add a custom child collector to the BlockJoinQuery to be called on every matching doc (so we can do things with it, like counts and such). Also, allow to extend BlockJoinQuery to have a custom code that converts the filter bitset to an OpenBitSet.

      1. LUCENE-3282.patch
        9 kB
        Shay Banon
      2. LUCENE-3282.patch
        7 kB
        Shay Banon

        Activity

        Shay Banon created issue -
        Shay Banon made changes -
        Field Original Value New Value
        Attachment LUCENE-3282.patch [ 12485519 ]
        Hide
        Michael McCandless added a comment -

        This looks great Shay!

        What was the use case for subclassing to translate the filter into OBS? Is it a custom filter cache? Makes me nervous because the app really should create & reuse this OBS filter, usually...

        On the Collector: we try to keep our Querys IR-state-free... so it makes me nervous to stick a Collector right on the Query. Can we add a CollectorProvider that the Query invokes when it makes the Weight/Scorer?

        Instead of NoOpCollector can we just check for null?

        Show
        Michael McCandless added a comment - This looks great Shay! What was the use case for subclassing to translate the filter into OBS? Is it a custom filter cache? Makes me nervous because the app really should create & reuse this OBS filter, usually... On the Collector: we try to keep our Querys IR-state-free... so it makes me nervous to stick a Collector right on the Query. Can we add a CollectorProvider that the Query invokes when it makes the Weight/Scorer? Instead of NoOpCollector can we just check for null?
        Hide
        Shay Banon added a comment -

        Heya,

        In my app, I have a wrapper around OBS, that has a common interface that allows to access bits by index (similar to Bits in trunk), so I need to extract from it the OBS.

        Regarding the Collector, I will work on CollectorProvider interface. I liked the NoOpCollector option since then you don't have to check for nulls each time...

        Show
        Shay Banon added a comment - Heya, In my app, I have a wrapper around OBS, that has a common interface that allows to access bits by index (similar to Bits in trunk), so I need to extract from it the OBS. Regarding the Collector, I will work on CollectorProvider interface. I liked the NoOpCollector option since then you don't have to check for nulls each time...
        Hide
        Shay Banon added a comment -

        New version, with CollectorProvider.

        Show
        Shay Banon added a comment - New version, with CollectorProvider.
        Shay Banon made changes -
        Attachment LUCENE-3282.patch [ 12486124 ]
        Hide
        Michael McCandless added a comment -

        Thanks Shay... looking close, but:

        Mulling on this more... I think having the BlockJoinQuery invoke
        the child collector is not the right place, because docs collected
        here don't necessarily match the overall query being executed. And
        also we can miss some docs, eg we don't collect in advance.

        Should we instead move the child collector into BlockJoinCollector?
        It has access to all the scorers for BlockJoinQuery involved in the
        parent query, and to all child docs for each parent doc collected.

        Show
        Michael McCandless added a comment - Thanks Shay... looking close, but: Mulling on this more... I think having the BlockJoinQuery invoke the child collector is not the right place, because docs collected here don't necessarily match the overall query being executed. And also we can miss some docs, eg we don't collect in advance. Should we instead move the child collector into BlockJoinCollector? It has access to all the scorers for BlockJoinQuery involved in the parent query, and to all child docs for each parent doc collected.
        Hide
        Shay Banon added a comment -

        The idea of this is to collect matching child docs regardless of what matches parent wise, and yea, we might miss some depending on the type of query that is actually "wrapping" it, but I think its still useful.

        Show
        Shay Banon added a comment - The idea of this is to collect matching child docs regardless of what matches parent wise, and yea, we might miss some depending on the type of query that is actually "wrapping" it, but I think its still useful.
        Hide
        Michael McCandless added a comment -

        I really don't like that this child collector would be buggy (lose results when involved in a parent query that uses advance); this will cause problems for users, asking why some hits are missing.

        Maybe, instead, we could make a generic wrapper class, taking any Query and a CollectorProvider, producing a Query, so that all hits "visited" by the sub-query are sent to the collector, with clear warnings that this collector will hit false positives (it'll see hits that don't match the top-level Query) and false negatives (it'll miss hits that did match the wrapped Query)?

        How are you using this child collector such that the false positives/negatives aren't a problem? EG do you know the parent query will never use advance?

        Show
        Michael McCandless added a comment - I really don't like that this child collector would be buggy (lose results when involved in a parent query that uses advance); this will cause problems for users, asking why some hits are missing. Maybe, instead, we could make a generic wrapper class, taking any Query and a CollectorProvider, producing a Query, so that all hits "visited" by the sub-query are sent to the collector, with clear warnings that this collector will hit false positives (it'll see hits that don't match the top-level Query) and false negatives (it'll miss hits that did match the wrapped Query)? How are you using this child collector such that the false positives/negatives aren't a problem? EG do you know the parent query will never use advance?
        Hide
        Shay Banon added a comment -

        Hi, sorry for the late response, I the comment.

        Yea, I agree that there will be false positives, but thats the idea of it (sometimes you want to run facets for example on "sub queries"). Btw, I got your point on advance, do you think if a collector exists, then advance should be implemented by iterating over all docs up to the provided doc to advance to.

        Regarding the wrapper, interesting!. I need to have a look at how to generalize it, but it should be simple, I think, I'll try and work on it.

        Show
        Shay Banon added a comment - Hi, sorry for the late response, I the comment. Yea, I agree that there will be false positives, but thats the idea of it (sometimes you want to run facets for example on "sub queries"). Btw, I got your point on advance, do you think if a collector exists, then advance should be implemented by iterating over all docs up to the provided doc to advance to. Regarding the wrapper, interesting!. I need to have a look at how to generalize it, but it should be simple, I think, I'll try and work on it.

          People

          • Assignee:
            Unassigned
            Reporter:
            Shay Banon
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:

              Development