Details

    • Type: Task
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 7.0, 6.5
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      This collector uses the getWeight() and getChildren() methods from the passed in Scorer, which are not always available (eg. disjunctions expose fake scorers) hence the need for a dedicated IndexSearcher (ToParentBlockJoinIndexSearcher). Given that this is the only collector in this case, I would like to remove it.

      1. LUCENE_6959.patch
        83 kB
        Martijn van Groningen
      2. LUCENE_6959.patch
        76 kB
        Martijn van Groningen
      3. LUCENE_6959.patch
        76 kB
        Martijn van Groningen
      4. LUCENE_6959.patch
        76 kB
        Martijn van Groningen
      5. LUCENE_6959.patch
        77 kB
        Martijn van Groningen
      6. LUCENE-6959.patch
        59 kB
        Adrien Grand

        Issue Links

          Activity

          Hide
          mkhludnev Mikhail Khludnev added a comment -

          Here is the recently arrived BlockJoinFacetCollector also calls getChildren(). It seems it works under SolrIndexSearcher, and it's not easy to use a dedicated searcher in Solr.

          Show
          mkhludnev Mikhail Khludnev added a comment - Here is the recently arrived BlockJoinFacetCollector also calls getChildren(). It seems it works under SolrIndexSearcher, and it's not easy to use a dedicated searcher in Solr.
          Hide
          mkhludnev Mikhail Khludnev added a comment -

          Note. SOLR-8643 removed BlockJoinFacetCollector which calls scorer getChildren()

          Show
          mkhludnev Mikhail Khludnev added a comment - Note. SOLR-8643 removed BlockJoinFacetCollector which calls scorer getChildren()
          Hide
          jpountz Adrien Grand added a comment -

          This collector is also blocking improvements to block-join queries since it prevents them from using two-phase iteration: LUCENE-7654.

          Show
          jpountz Adrien Grand added a comment - This collector is also blocking improvements to block-join queries since it prevents them from using two-phase iteration: LUCENE-7654 .
          Hide
          jpountz Adrien Grand added a comment -

          Here is a patch that removes it. The class was exprimental, so I think it is fine to remove in 6.x, but if there are concerns, I am ok with waiting for 7.0.

          Show
          jpountz Adrien Grand added a comment - Here is a patch that removes it. The class was exprimental, so I think it is fine to remove in 6.x, but if there are concerns, I am ok with waiting for 7.0.
          Hide
          martijn.v.groningen Martijn van Groningen added a comment -

          +1 To remove this collector in the master and 6x branches.

          As a follow up issue we can add back to ability to include child docs, but in a different way than is done today. A subsequent search after the main search, that selects child docs for specific parents. For example like is done here: https://github.com/elastic/elasticsearch/blob/master/core/src/main/java/org/elasticsearch/search/fetch/subphase/InnerHitsContext.java#L160

          Show
          martijn.v.groningen Martijn van Groningen added a comment - +1 To remove this collector in the master and 6x branches. As a follow up issue we can add back to ability to include child docs, but in a different way than is done today. A subsequent search after the main search, that selects child docs for specific parents. For example like is done here: https://github.com/elastic/elasticsearch/blob/master/core/src/main/java/org/elasticsearch/search/fetch/subphase/InnerHitsContext.java#L160
          Hide
          mikemccand Michael McCandless added a comment -

          I agree it's messy that one must use a special IndexSearcher to use this collector, and that it's overly invasive that it needs access to child scorers, and we should find a better way.

          This collector enables you to get the matching child hits when running a joined search against child hits up to parent documents. This is very important for people "joining up" with document blocks, e.g. I use this in http://jirasearch.mikemccandless.com, where child documents are each jira comment, and parent documents are the whole jira issue.

          I'd like to understand a bit better how exactly we can re-implement this functionality once we remove the collector. That ES query class seems to be created for each parent doc that made the top N hits, right? And then you search with that query, which is very quick since it will hone in on just the children of that one parent, and build up your child hits that way ... I think this is indeed a better approach!

          It's sort of weird to remove this class without first making the cleaner implementation available, but since neither Solr nor ES use it, I think it's OK (and it is experimental). Can we at least open an issue to add the cleaner option? I can try to tackle it, certainly when I next upgrade jirasearch Thanks, +1 to remove in 6.x and 7.0.

          Show
          mikemccand Michael McCandless added a comment - I agree it's messy that one must use a special IndexSearcher to use this collector, and that it's overly invasive that it needs access to child scorers, and we should find a better way. This collector enables you to get the matching child hits when running a joined search against child hits up to parent documents. This is very important for people "joining up" with document blocks, e.g. I use this in http://jirasearch.mikemccandless.com , where child documents are each jira comment, and parent documents are the whole jira issue. I'd like to understand a bit better how exactly we can re-implement this functionality once we remove the collector. That ES query class seems to be created for each parent doc that made the top N hits, right? And then you search with that query, which is very quick since it will hone in on just the children of that one parent, and build up your child hits that way ... I think this is indeed a better approach! It's sort of weird to remove this class without first making the cleaner implementation available, but since neither Solr nor ES use it, I think it's OK (and it is experimental). Can we at least open an issue to add the cleaner option? I can try to tackle it, certainly when I next upgrade jirasearch Thanks, +1 to remove in 6.x and 7.0.
          Hide
          martijn.v.groningen Martijn van Groningen added a comment - - edited

          I've modified Adrien's patch and ported the ES query mentioned in my previous comment to Lucene join module.

          I'd like to understand a bit better how exactly we can re-implement this functionality once we remove the collector. That ES query class seems to be created for each parent doc that made the top N hits, right?

          Yes, that is what it is doing.

          Show
          martijn.v.groningen Martijn van Groningen added a comment - - edited I've modified Adrien's patch and ported the ES query mentioned in my previous comment to Lucene join module. I'd like to understand a bit better how exactly we can re-implement this functionality once we remove the collector. That ES query class seems to be created for each parent doc that made the top N hits, right? Yes, that is what it is doing.
          Hide
          mikemccand Michael McCandless added a comment -

          Oooh, thank you Martijn van Groningen! I'll review the patch...

          Show
          mikemccand Michael McCandless added a comment - Oooh, thank you Martijn van Groningen ! I'll review the patch...
          Hide
          mikemccand Michael McCandless added a comment -

          Thanks Martijn van Groningen.

          I think it's dangerous that we hold onto a LeafReader in the new ParentChildrenBlockJoinQuery?

          Can we maybe change the new query to instead hold the parent's docID in the top-level reader's space, and then in the scorer method, check the incoming reader context to see if this is the segment that holds the parent? This would also simplify usage, so users wouldn't have to create their own weights? Then I think you don't need the LeafReader reference.

          Also, the TestBlockJoin tests got a little over decimated I think Can we restore at least some of the places that were verifying children? Or maybe we could make a simple sugar API that returns TopGroups again, and then we wouldn't need to change the tests (except to switch to this sugar API)?

          Show
          mikemccand Michael McCandless added a comment - Thanks Martijn van Groningen . I think it's dangerous that we hold onto a LeafReader in the new ParentChildrenBlockJoinQuery ? Can we maybe change the new query to instead hold the parent's docID in the top-level reader's space, and then in the scorer method, check the incoming reader context to see if this is the segment that holds the parent? This would also simplify usage, so users wouldn't have to create their own weights? Then I think you don't need the LeafReader reference. Also, the TestBlockJoin tests got a little over decimated I think Can we restore at least some of the places that were verifying children? Or maybe we could make a simple sugar API that returns TopGroups again, and then we wouldn't need to change the tests (except to switch to this sugar API)?
          Hide
          jpountz Adrien Grand added a comment -

          I think I'm the one responsible for the decimated tests. Agreed it would be nice to restore them using an API that is able to compute top groups.

          Show
          jpountz Adrien Grand added a comment - I think I'm the one responsible for the decimated tests. Agreed it would be nice to restore them using an API that is able to compute top groups.
          Hide
          martijn.v.groningen Martijn van Groningen added a comment -

          Can we maybe change the new query to instead hold the parent's docID in the top-level reader's space, and then in the scorer method, check the incoming reader context to see if this is the segment that holds the parent? This would also simplify usage, so users wouldn't have to create their own weights? Then I think you don't need the LeafReader reference.

          +1 That is much better.

          Show
          martijn.v.groningen Martijn van Groningen added a comment - Can we maybe change the new query to instead hold the parent's docID in the top-level reader's space, and then in the scorer method, check the incoming reader context to see if this is the segment that holds the parent? This would also simplify usage, so users wouldn't have to create their own weights? Then I think you don't need the LeafReader reference. +1 That is much better.
          Hide
          martijn.v.groningen Martijn van Groningen added a comment -

          I've changed the ParentChildrenBlockJoinQuery query, so that it no longer requires to accept a LeafReader and it figures out by itself which leaf reader the parent doc is in.

          Show
          martijn.v.groningen Martijn van Groningen added a comment - I've changed the ParentChildrenBlockJoinQuery query, so that it no longer requires to accept a LeafReader and it figures out by itself which leaf reader the parent doc is in.
          Hide
          martijn.v.groningen Martijn van Groningen added a comment -

          I wonder if we should continue using TopGroups in the TestBlockJoin test case? If we stop using it then we can remove the the dependency on the grouping module the join module has.

          The `TopGroups` has logic for merging other groups, but in case for block join there should be no need for that as the parent and its children are always in the same Lucene index. That makes sense, because in grouping arbitrary document can belong to a group.

          Mike, would just using a ScoreDoc with TopDocs be sufficient to represent a jira issue with comments in jirasearch?

          Show
          martijn.v.groningen Martijn van Groningen added a comment - I wonder if we should continue using TopGroups in the TestBlockJoin test case? If we stop using it then we can remove the the dependency on the grouping module the join module has. The `TopGroups` has logic for merging other groups, but in case for block join there should be no need for that as the parent and its children are always in the same Lucene index. That makes sense, because in grouping arbitrary document can belong to a group. Mike, would just using a ScoreDoc with TopDocs be sufficient to represent a jira issue with comments in jirasearch?
          Hide
          mikemccand Michael McCandless added a comment -

          Mike, would just using a ScoreDoc with TopDocs be sufficient to represent a jira issue with comments in jirasearch?

          I think that's fine? I agree it would be nice to remove the dependency.

          I'll have a look at the patch! Thanks for iterating.

          Show
          mikemccand Michael McCandless added a comment - Mike, would just using a ScoreDoc with TopDocs be sufficient to represent a jira issue with comments in jirasearch? I think that's fine? I agree it would be nice to remove the dependency. I'll have a look at the patch! Thanks for iterating.
          Hide
          mikemccand Michael McCandless added a comment -

          The new ParentChildrenBlockJoinQuery looks great! Much easier to use.

          Do we really need the origChildQuery anymore?

          Show
          mikemccand Michael McCandless added a comment - The new ParentChildrenBlockJoinQuery looks great! Much easier to use. Do we really need the origChildQuery anymore?
          Hide
          martijn.v.groningen Martijn van Groningen added a comment - - edited

          Do we really need the origChildQuery anymore?

          In case the query gets rewritten and if the query happens to get cached?

          Show
          martijn.v.groningen Martijn van Groningen added a comment - - edited Do we really need the origChildQuery anymore? In case the query gets rewritten and if the query happens to get cached?
          Hide
          martijn.v.groningen Martijn van Groningen added a comment - - edited

          I was wrong, `origChildQuery` can be removed as rewritten queries are used as cache key in query cache.

          Show
          martijn.v.groningen Martijn van Groningen added a comment - - edited I was wrong, `origChildQuery` can be removed as rewritten queries are used as cache key in query cache.
          Hide
          martijn.v.groningen Martijn van Groningen added a comment -

          Removed `origChildQuery ` field.

          Show
          martijn.v.groningen Martijn van Groningen added a comment - Removed `origChildQuery ` field.
          Hide
          jpountz Adrien Grand added a comment -

          I think the new query should take childQuery into account for equals/hashcode? This makes me wonder whether we really need to wrap the child query: we could just provide a query that matches all children for a given parent and then users could build conjunctions using BooleanQuery to decide what particular child documents should match?

          Show
          jpountz Adrien Grand added a comment - I think the new query should take childQuery into account for equals/hashcode? This makes me wonder whether we really need to wrap the child query: we could just provide a query that matches all children for a given parent and then users could build conjunctions using BooleanQuery to decide what particular child documents should match?
          Hide
          jpountz Adrien Grand added a comment -

          Actually scratch that last comment, that would probably make the feature too hard to use, I think a wrapper query like you propose is better. Some questions/comments:

          • should it take the childQuery into account for equals/hashcode?
          • it looks buggy to me that we do not convert parentDocId to parentDocId-context.docBase in the scorer?
          • you use ConstantScoreWeight but then return a Scorer that actually scores, you should extend Weight directly instead.
          // If parentDocId == 0 then we a parent doc doesn't have child docs
          

          Let's remove the "we"?

          Show
          jpountz Adrien Grand added a comment - Actually scratch that last comment, that would probably make the feature too hard to use, I think a wrapper query like you propose is better. Some questions/comments: should it take the childQuery into account for equals/hashcode? it looks buggy to me that we do not convert parentDocId to parentDocId-context.docBase in the scorer? you use ConstantScoreWeight but then return a Scorer that actually scores, you should extend Weight directly instead. // If parentDocId == 0 then we a parent doc doesn't have child docs Let's remove the "we"?
          Hide
          martijn.v.groningen Martijn van Groningen added a comment -

          I've updated the patch. Thanks for reviewing!

          should it take the childQuery into account for equals/hashcode?

          Oops, I forgot to add that back when removing `origChildQuery `.

          it looks buggy to me that we do not convert parentDocId to parentDocId-context.docBase in the scorer?

          Good catch. I didn't catch this in the initially, but after running the provided test in the patch a 100 times it did fail, because the `parentDocId` wasn't converted.

          you use ConstantScoreWeight but then return a Scorer that actually scores, you should extend Weight directly instead.

          Good point, I've changed that.

          Let's remove the "we"?

          Done.

          Show
          martijn.v.groningen Martijn van Groningen added a comment - I've updated the patch. Thanks for reviewing! should it take the childQuery into account for equals/hashcode? Oops, I forgot to add that back when removing `origChildQuery `. it looks buggy to me that we do not convert parentDocId to parentDocId-context.docBase in the scorer? Good catch. I didn't catch this in the initially, but after running the provided test in the patch a 100 times it did fail, because the `parentDocId` wasn't converted. you use ConstantScoreWeight but then return a Scorer that actually scores, you should extend Weight directly instead. Good point, I've changed that. Let's remove the "we"? Done.
          Hide
          mikemccand Michael McCandless added a comment -

          New patch looks great, except for the TestBlockJoin tests ... I can take a crack at putting back some of the child hit checking there, if you all haven't started on that yet?

          Show
          mikemccand Michael McCandless added a comment - New patch looks great, except for the TestBlockJoin tests ... I can take a crack at putting back some of the child hit checking there, if you all haven't started on that yet?
          Hide
          martijn.v.groningen Martijn van Groningen added a comment -

          I can take a crack at putting back some of the child hit checking there, if you all haven't started on that yet?

          I can add that back.

          Show
          martijn.v.groningen Martijn van Groningen added a comment - I can take a crack at putting back some of the child hit checking there, if you all haven't started on that yet? I can add that back.
          Hide
          mikemccand Michael McCandless added a comment -

          I can add that back.

          Thanks Martijn van Groningen!

          Show
          mikemccand Michael McCandless added a comment - I can add that back. Thanks Martijn van Groningen !
          Hide
          martijn.v.groningen Martijn van Groningen added a comment -

          Michael McCandless I added back the child hit checking in TestBlockJoin test suite.

          Show
          martijn.v.groningen Martijn van Groningen added a comment - Michael McCandless I added back the child hit checking in TestBlockJoin test suite.
          Hide
          jpountz Adrien Grand added a comment -

          +1

          Show
          jpountz Adrien Grand added a comment - +1
          Hide
          mikemccand Michael McCandless added a comment -

          +1, thank you Martijn van Groningen!

          Show
          mikemccand Michael McCandless added a comment - +1, thank you Martijn van Groningen !
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 91a620793731ba1ae50d730df5381867bf2fdf3f in lucene-solr's branch refs/heads/branch_6x from Martijn van Groningen
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=91a6207 ]

          LUCENE-6959: Removed ToParentBlockJoinCollector in favour of ParentChildrenBlockJoinQuery, that can return the matching children documents per parent document.
          This query should be executed for each matching parent document after the main query has been executed.

          Show
          jira-bot ASF subversion and git services added a comment - Commit 91a620793731ba1ae50d730df5381867bf2fdf3f in lucene-solr's branch refs/heads/branch_6x from Martijn van Groningen [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=91a6207 ] LUCENE-6959 : Removed ToParentBlockJoinCollector in favour of ParentChildrenBlockJoinQuery, that can return the matching children documents per parent document. This query should be executed for each matching parent document after the main query has been executed.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit e327efb676e04f72c39e902f08c0d11497b4c57d in lucene-solr's branch refs/heads/master from Martijn van Groningen
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=e327efb ]

          LUCENE-6959: Removed ToParentBlockJoinCollector in favour of ParentChildrenBlockJoinQuery, that can return the matching children documents per parent document.
          This query should be executed for each matching parent document after the main query has been executed.

          Show
          jira-bot ASF subversion and git services added a comment - Commit e327efb676e04f72c39e902f08c0d11497b4c57d in lucene-solr's branch refs/heads/master from Martijn van Groningen [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=e327efb ] LUCENE-6959 : Removed ToParentBlockJoinCollector in favour of ParentChildrenBlockJoinQuery, that can return the matching children documents per parent document. This query should be executed for each matching parent document after the main query has been executed.

            People

            • Assignee:
              jpountz Adrien Grand
              Reporter:
              jpountz Adrien Grand
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development