Details

    • Type: Improvement
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: 6.0
    • Fix Version/s: master (7.0)
    • Component/s: search
    • Labels:
      None

      Description

      Currently it is not possible to use RankQuery [1] and Grouping [2] together (see also [3]). In some situations Grouping can be replaced by Collapse and Expand Results [4] (that supports reranking), but i) collapse cannot guarantee that at least a minimum number of groups will be returned for a query, and ii) in the Solr Cloud setting you will have constraints on how to partition the documents among the shards.

      I'm going to start working on supporting RankQuery in grouping. I'll start attaching a patch with a test that fails because grouping does not support the rank query and then I'll try to fix the problem, starting from the non distributed setting (GroupingSearch).

      My feeling is that since grouping is mostly performed by Lucene, RankQuery should be refactored and moved (or partially moved) there.

      Any feedback is welcome.

      [1] https://cwiki.apache.org/confluence/display/solr/RankQuery+API
      [2] https://cwiki.apache.org/confluence/display/solr/Result+Grouping
      [3] http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201507.mbox/%3CCAHM-LpuvsPEsT-Sw63_8a6gt-wOr6dS_T_Nb2rOpe93e4+sTNQ@mail.gmail.com%3E
      [4] https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results

        Issue Links

          Activity

          Hide
          joel.bernstein Joel Bernstein added a comment -

          Moving the RankQuery into Lucene is a showstopper I believe because of the MergeStrategy which is very specific to how Solr merges documents from the shards.

          There are portions of the grouping code in Solr. It might be best to first start looking to see if you can get the RankQuery added on the Solr side.

          Show
          joel.bernstein Joel Bernstein added a comment - Moving the RankQuery into Lucene is a showstopper I believe because of the MergeStrategy which is very specific to how Solr merges documents from the shards. There are portions of the grouping code in Solr. It might be best to first start looking to see if you can get the RankQuery added on the Solr side.
          Hide
          diegoceccarelli Diego Ceccarelli added a comment -

          add a unit test that fails since grouping ignores the RankQuery

          Show
          diegoceccarelli Diego Ceccarelli added a comment - add a unit test that fails since grouping ignores the RankQuery
          Hide
          diegoceccarelli Diego Ceccarelli added a comment - - edited

          Joel Bernstein thanks for pointing out about the MergeStrategy. I uploaded a new patch with a first step. I agree that merge strategy must stay there, that's why I wrote "partially moved" as well as there's IndexSearcher and SolrIndexSearcher, I moved RankQuery into Lucene and created SolrRankQuery. The reason is that the RankQuery works by manipulating the collector, through this method:

          public abstract TopDocsCollector getTopDocsCollector(int len, QueryCommand cmd, IndexSearcher searcher) throws IOException;
          

          At the moment in SolrIndexSearcher there's a special case if a query is a RankQuery,

            private TopDocsCollector buildTopDocsCollector(int len, QueryCommand cmd) throws IOException {
          
              Query q = cmd.getQuery();
              if (q instanceof RankQuery) {
                RankQuery rq = (RankQuery) q;
                return rq.getTopDocsCollector(len, cmd, this);
              }
              ..
          

          Instead of creating a top collector using the TopScoreDocCollector.create, we wrap a topScoreCollector into a 'RankQuery collector'.

          Let me remind that grouping works in two separate stages:

          • in the first stage, we iterate on the documents scoring them and keep a map <group -> score> where score is the highest score of a document in the group (the map contains only the TOP-k groups with the highest scores);
          • for each group, the documents in the group are ranked and TOP-n documents for each group are returned.

          This logic is mainly implemented into Abstract(First|Second)PassGroupingCollector (within Lucene).

          We should probably discuss what means reranking for groups: in my opinion we should keep in mind that the idea behind RankQuery is that you don't want to apply the query to all the documents in the collection, so "group-reranking" should:

          • in the first stage, iterate on the documents scoring them as usual and keep a map <group -> score>;
          • for each group, apply RankQuery to the top documents in the group;
          • rerank the groups according to the new scores.

          In this patch, I'm able to perform 2. I had to move RankQuery into Lucene, because what happens in the AbstractSecondPassGroupingCollector is that for each group a collector is created:

           for (SearchGroup<GROUP_VALUE_TYPE> group : groups) {
                //System.out.println("  prep group=" + (group.groupValue == null ? "null" : group.groupValue.utf8ToString()));
                TopDocsCollector<?> collector;
                if (withinGroupSort.equals(Sort.RELEVANCE)) { // optimize to use TopScoreDocCollector
                  // Sort by score
                  collector = TopScoreDocCollector.create(maxDocsPerGroup);
              ...
          

          ... so no way to 'inject' the RankQuery collector from Solr. Moving the RankQuery into lucene I modified the code in:

                  collector = TopScoreDocCollector.create(maxDocsPerGroup);
                  if (query != null && query instanceof RankQuery){
                    collector = ((RankQuery)query).getTopDocsCollector(collector, null, searcher);
                  }
          

          and now documents in groups are reranked based on the RankQuery scores. I'll work now on 3. i.e., reordering the groups based on the new RankQuery score (I added a new test that fails at the moment).
          Happy to discuss about this first change, if you have comments.

          Minor notes:

          • At the moment SolrRankQuery doesn't extend ExtendedQueryBase, I have to check if it is a problem. Otherwise RankQuery could become an interface maybe.
          • I did some changes to the interface of RankQuery.getTopDocsCollector: QueryCommand was in Solr but used only for getting Sort, len was never used. I added in input the previous collector, instead of creating a new TopDocScore collector inside RankQuery.
          • Please keep in mind that, as starting point, I'm trying to solve the issue in the non distributed setting and if we're grouping on a field.
          Show
          diegoceccarelli Diego Ceccarelli added a comment - - edited Joel Bernstein thanks for pointing out about the MergeStrategy . I uploaded a new patch with a first step. I agree that merge strategy must stay there, that's why I wrote "partially moved" as well as there's IndexSearcher and SolrIndexSearcher , I moved RankQuery into Lucene and created SolrRankQuery . The reason is that the RankQuery works by manipulating the collector, through this method: public abstract TopDocsCollector getTopDocsCollector( int len, QueryCommand cmd, IndexSearcher searcher) throws IOException; At the moment in SolrIndexSearcher there's a special case if a query is a RankQuery , private TopDocsCollector buildTopDocsCollector( int len, QueryCommand cmd) throws IOException { Query q = cmd.getQuery(); if (q instanceof RankQuery) { RankQuery rq = (RankQuery) q; return rq.getTopDocsCollector(len, cmd, this ); } .. Instead of creating a top collector using the TopScoreDocCollector.create , we wrap a topScoreCollector into a 'RankQuery collector'. Let me remind that grouping works in two separate stages: in the first stage, we iterate on the documents scoring them and keep a map <group -> score> where score is the highest score of a document in the group (the map contains only the TOP-k groups with the highest scores); for each group, the documents in the group are ranked and TOP-n documents for each group are returned. This logic is mainly implemented into Abstract(First|Second)PassGroupingCollector (within Lucene). We should probably discuss what means reranking for groups: in my opinion we should keep in mind that the idea behind RankQuery is that you don't want to apply the query to all the documents in the collection, so "group-reranking" should: in the first stage, iterate on the documents scoring them as usual and keep a map <group -> score> ; for each group, apply RankQuery to the top documents in the group; rerank the groups according to the new scores. In this patch, I'm able to perform 2. I had to move RankQuery into Lucene, because what happens in the AbstractSecondPassGroupingCollector is that for each group a collector is created: for (SearchGroup<GROUP_VALUE_TYPE> group : groups) { // System .out.println( " prep group=" + (group.groupValue == null ? " null " : group.groupValue.utf8ToString())); TopDocsCollector<?> collector; if (withinGroupSort.equals(Sort.RELEVANCE)) { // optimize to use TopScoreDocCollector // Sort by score collector = TopScoreDocCollector.create(maxDocsPerGroup); ... ... so no way to 'inject' the RankQuery collector from Solr. Moving the RankQuery into lucene I modified the code in: collector = TopScoreDocCollector.create(maxDocsPerGroup); if (query != null && query instanceof RankQuery){ collector = ((RankQuery)query).getTopDocsCollector(collector, null , searcher); } and now documents in groups are reranked based on the RankQuery scores. I'll work now on 3. i.e., reordering the groups based on the new RankQuery score (I added a new test that fails at the moment). Happy to discuss about this first change, if you have comments. Minor notes: At the moment SolrRankQuery doesn't extend ExtendedQueryBase , I have to check if it is a problem. Otherwise RankQuery could become an interface maybe. I did some changes to the interface of RankQuery.getTopDocsCollector : QueryCommand was in Solr but used only for getting Sort , len was never used. I added in input the previous collector, instead of creating a new TopDocScore collector inside RankQuery . Please keep in mind that, as starting point, I'm trying to solve the issue in the non distributed setting and if we're grouping on a field.
          Hide
          diegoceccarelli Diego Ceccarelli added a comment -

          I uploaded a new patch, now groups are reranked according to the reranking max scores, in the finish() method of the grouping CommandField I added:

          if (result != null && query instanceof RankQuery && groupSort == Sort.RELEVANCE){
                  // if we are sorting for relevance and query is a RankQuery, it may be that
                  // the order of the groups changed, we need to reorder
                  GroupDocs[] groups = result.groups;
                  Arrays.sort(groups, new Comparator<GroupDocs>() {
                    @Override
                    public int compare(GroupDocs o1, GroupDocs o2) {
                        if (o1.maxScore > o2.maxScore) return -1;
                        if (o1.maxScore < o2.maxScore) return 1; 
                        return 0;
                    }});
                }
          

          This will reorder the groups if we re-rank the documents with the rank query. The second test succeeds.

          I'm still thinking what it should be the correct semantic to implement reranking + grouping:

          When you apply a query q and then a rank-query rq , you first score all the documents and then rescore top-N documents with the rank-query. The problem with grouping is that in order to get the top-groups you first need to score the collection: you may have a document that scored really low with q but got a high score with rq, but the only way to find it is to rerank the whole collection (impracticable). There are two possible solutions then:

          • if we want to apply rq on the top 1000 documents, we can collect the groups in the top-1000 documents, and they will be the same obtained scoring directly with rq, but in a different order;
          • we can collect more groups than what we need, and then rerank the top documents in each group - I would call this solution: *Group Reranking*.

          In my opinion group reranking is a better solution: imagine we have a group containing the top-1000 documents ranked with q we will rerank them maybe just to return one document. I guess the best would be, assuming that we want to apply rerank query to N documents and return the top K groups you can retrieve top K*y groups and then rerank N/(K*y) documents in each group.

          Show
          diegoceccarelli Diego Ceccarelli added a comment - I uploaded a new patch, now groups are reranked according to the reranking max scores, in the finish() method of the grouping CommandField I added: if (result != null && query instanceof RankQuery && groupSort == Sort.RELEVANCE){ // if we are sorting for relevance and query is a RankQuery, it may be that // the order of the groups changed, we need to reorder GroupDocs[] groups = result.groups; Arrays.sort(groups, new Comparator<GroupDocs>() { @Override public int compare(GroupDocs o1, GroupDocs o2) { if (o1.maxScore > o2.maxScore) return -1; if (o1.maxScore < o2.maxScore) return 1; return 0; }}); } This will reorder the groups if we re-rank the documents with the rank query. The second test succeeds. I'm still thinking what it should be the correct semantic to implement reranking + grouping: When you apply a query q and then a rank-query rq , you first score all the documents and then rescore top-N documents with the rank-query. The problem with grouping is that in order to get the top-groups you first need to score the collection: you may have a document that scored really low with q but got a high score with rq , but the only way to find it is to rerank the whole collection (impracticable). There are two possible solutions then: if we want to apply rq on the top 1000 documents, we can collect the groups in the top-1000 documents, and they will be the same obtained scoring directly with rq , but in a different order; we can collect more groups than what we need, and then rerank the top documents in each group - I would call this solution: * Group Reranking *. In my opinion group reranking is a better solution: imagine we have a group containing the top-1000 documents ranked with q we will rerank them maybe just to return one document. I guess the best would be, assuming that we want to apply rerank query to N documents and return the top K groups you can retrieve top K*y groups and then rerank N/(K*y) documents in each group.
          Hide
          diegoceccarelli Diego Ceccarelli added a comment -

          Update: I found a way to change the behavior of the collectors without moving RankQuery into Lucene. This new patch performs the group reranking without changing Lucene. The only difference is that if the Query is a RankQuery instead of using TermSecondPassGroupingCollector I'll use a RerankTermSecondPassGroupingCollector. The new collector will scan the groups collectors and wrap them in 'rerank collectors':

          
          for (SearchGroup<BytesRef> group : groups) {
          if (query != null) {
                    collector = groupMap.get(group.groupValue).collector;
                    collector = query.getTopDocsCollector(collector, groupSort, searcher);
                    groupMap.put(group.groupValue, new SearchGroupDocs<BytesRef>(group.groupValue, collector));
                  }
          }
          
          Show
          diegoceccarelli Diego Ceccarelli added a comment - Update: I found a way to change the behavior of the collectors without moving RankQuery into Lucene. This new patch performs the group reranking without changing Lucene. The only difference is that if the Query is a RankQuery instead of using TermSecondPassGroupingCollector I'll use a RerankTermSecondPassGroupingCollector . The new collector will scan the groups collectors and wrap them in 'rerank collectors': for (SearchGroup<BytesRef> group : groups) { if (query != null ) { collector = groupMap.get(group.groupValue).collector; collector = query.getTopDocsCollector(collector, groupSort, searcher); groupMap.put(group.groupValue, new SearchGroupDocs<BytesRef>(group.groupValue, collector)); } }
          Hide
          aanilpala Ahmet Anil Pala added a comment -

          Hi,

          Which branch is this patch compatible with? I've tried branch_6x and branch_6_0. Although, the patch was successfully applied, it prevented the source from compiling.

          Show
          aanilpala Ahmet Anil Pala added a comment - Hi, Which branch is this patch compatible with? I've tried branch_6x and branch_6_0. Although, the patch was successfully applied, it prevented the source from compiling.
          Hide
          diegoceccarelli Diego Ceccarelli added a comment - - edited

          Thanks Ahmet Anil Pala, a file was missing in the patch, I just submitted a new patch with the missing file, and I tested it on the latest upstream version (last commit 268da5be4), please do not hesitate to contact me if you have comments

          Show
          diegoceccarelli Diego Ceccarelli added a comment - - edited Thanks Ahmet Anil Pala , a file was missing in the patch, I just submitted a new patch with the missing file, and I tested it on the latest upstream version (last commit 268da5be4), please do not hesitate to contact me if you have comments
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user diegoceccarelli opened a pull request:

          https://github.com/apache/lucene-solr/pull/162

          SOLR-8776: Support RankQuery in grouping

          Update SOLR-8776 to current master

          • Reranking and grouping work together in non-distributed setting when grouping is done by field
          • Still have to fix for distribute setting and for grouping based on the unique values of a function query.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/bloomberg/lucene-solr master-solr-8776

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/lucene-solr/pull/162.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #162


          commit cd33172184c3889dfe95c631e7c30729f1c752a3
          Author: diego <diego.ceccarelli@gmail.com>
          Date: 2017-02-28T10:28:32Z

          SOLR-8776: Support RankQuery in grouping


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user diegoceccarelli opened a pull request: https://github.com/apache/lucene-solr/pull/162 SOLR-8776 : Support RankQuery in grouping Update SOLR-8776 to current master Reranking and grouping work together in non-distributed setting when grouping is done by field Still have to fix for distribute setting and for grouping based on the unique values of a function query. You can merge this pull request into a Git repository by running: $ git pull https://github.com/bloomberg/lucene-solr master-solr-8776 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/lucene-solr/pull/162.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #162 commit cd33172184c3889dfe95c631e7c30729f1c752a3 Author: diego <diego.ceccarelli@gmail.com> Date: 2017-02-28T10:28:32Z SOLR-8776 : Support RankQuery in grouping
          Hide
          diegoceccarelli Diego Ceccarelli added a comment -

          Hi all, I updated the PR (https://github.com/apache/lucene-solr/pull/162) now it supports reranking by field and by function query, and reranking by field in distribute setting - I have working tests for all the use cases.
          There are still some details to fix:

          • I had to remove final from GroupDocs::maxScore and GroupDocs::score (I can easily fix this I think)
          • Rerank(Function|Group)SecondPassGroupingCollector have the number of documents to rerank hardcoded in the class because RankQuery doesn't expose that value (I think it should)

          Joel Bernstein, Martijn van Groningen, Alan Woodward any feedback?

          Show
          diegoceccarelli Diego Ceccarelli added a comment - Hi all, I updated the PR ( https://github.com/apache/lucene-solr/pull/162 ) now it supports reranking by field and by function query, and reranking by field in distribute setting - I have working tests for all the use cases. There are still some details to fix: I had to remove final from GroupDocs::maxScore and GroupDocs::score (I can easily fix this I think) Rerank(Function|Group)SecondPassGroupingCollector have the number of documents to rerank hardcoded in the class because RankQuery doesn't expose that value (I think it should) Joel Bernstein , Martijn van Groningen , Alan Woodward any feedback?
          Hide
          diegoceccarelli Diego Ceccarelli added a comment - - edited

          Hi all, I updated the PR (https://github.com/apache/lucene-solr/pull/162), highlights:

          Alan Woodward,Martijn van Groningen now the patch relies on the new grouping code I had to add a new protected constructor to TopGroupsCollector to inject my own GroupReducer. Could you please take a look at let me know if it makes sense? also in SecondPassGroupingCollector#L54

          SecondPassGroupingCollector.java
          public SecondPassGroupingCollector(GroupSelector<T> groupSelector, Collection<SearchGroup<T>> groups, GroupReducer<T, ?> reducer) {
          
              //System.out.println("SP init");
              //Do we want to check if groups is null here? instead of checking at line 62?
              if (groups.isEmpty()) {
                throw new IllegalArgumentException("no groups to collect (groups is empty)");
              }
          
              this.groupSelector = Objects.requireNonNull(groupSelector);
              this.groupSelector.setGroups(groups);
              this.groups = Objects.requireNonNull(groups);
          

          I would check if groups != null before groups.isEmpty().

          2. I changed the logic to rerank groups and not only documents: for example if a user ask to rerank the top 100 documents: q=greetings&rows=10&rq={!rerank reRankQuery=$rqq reRankDocs=100 reRankWeight=3}&rqq=(hi+hello+hey+hiya):

          • the top 100 groups matching greeting are retrieved;
          • top 100 groups are reranked by rqq;
          • finally the top 10 reranked groups are returned;
          • inside each group documents will be reranked as well.
            (it's worth to note that for simplicity, in distribute mode first pass will retrieve the top 100 groups from all the shards, the federator will compute the top 100 groups and send them to the shards to get the reranking scores, and finally the federator will return the top 10)

          IMO the patch is now complete and I've working unit tests. Please, can someone review it?

          Show
          diegoceccarelli Diego Ceccarelli added a comment - - edited Hi all, I updated the PR ( https://github.com/apache/lucene-solr/pull/162 ), highlights: Alan Woodward , Martijn van Groningen now the patch relies on the new grouping code I had to add a new protected constructor to TopGroupsCollector to inject my own GroupReducer . Could you please take a look at let me know if it makes sense? also in SecondPassGroupingCollector#L54 SecondPassGroupingCollector.java public SecondPassGroupingCollector(GroupSelector<T> groupSelector, Collection<SearchGroup<T>> groups, GroupReducer<T, ?> reducer) { // System .out.println( "SP init" ); //Do we want to check if groups is null here? instead of checking at line 62? if (groups.isEmpty()) { throw new IllegalArgumentException( "no groups to collect (groups is empty)" ); } this .groupSelector = Objects.requireNonNull(groupSelector); this .groupSelector.setGroups(groups); this .groups = Objects.requireNonNull(groups); I would check if groups != null before groups.isEmpty() . 2. I changed the logic to rerank groups and not only documents: for example if a user ask to rerank the top 100 documents: q=greetings&rows=10&rq={!rerank reRankQuery=$rqq reRankDocs=100 reRankWeight=3}&rqq=(hi+hello+hey+hiya) : the top 100 groups matching greeting are retrieved; top 100 groups are reranked by rqq ; finally the top 10 reranked groups are returned; inside each group documents will be reranked as well. (it's worth to note that for simplicity, in distribute mode first pass will retrieve the top 100 groups from all the shards, the federator will compute the top 100 groups and send them to the shards to get the reranking scores, and finally the federator will return the top 10) IMO the patch is now complete and I've working unit tests. Please, can someone review it?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user romseygeek commented on a diff in the pull request:

          https://github.com/apache/lucene-solr/pull/162#discussion_r117773779

          — Diff: lucene/grouping/src/java/org/apache/lucene/search/grouping/TopGroupsCollector.java —
          @@ -39,6 +39,14 @@
          private final Sort withinGroupSort;
          private final int maxDocsPerGroup;

          + protected TopGroupsCollector(GroupReducer<T, TopDocsCollector<?>> groupReducer, GroupSelector<T> groupSelector, Collection<SearchGroup<T>> groups, Sort groupSort, Sort withinGroupSort,
          + int maxDocsPerGroup, boolean getScores, boolean getMaxScores, boolean fillSortFields) {
          + super(groupSelector, groups, groupReducer);
          — End diff –

          You can remove a few of these parameters now, I think? They're for the Reducer, and as such don't need to be passed separately.

          Show
          githubbot ASF GitHub Bot added a comment - Github user romseygeek commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/162#discussion_r117773779 — Diff: lucene/grouping/src/java/org/apache/lucene/search/grouping/TopGroupsCollector.java — @@ -39,6 +39,14 @@ private final Sort withinGroupSort; private final int maxDocsPerGroup; + protected TopGroupsCollector(GroupReducer<T, TopDocsCollector<?>> groupReducer, GroupSelector<T> groupSelector, Collection<SearchGroup<T>> groups, Sort groupSort, Sort withinGroupSort, + int maxDocsPerGroup, boolean getScores, boolean getMaxScores, boolean fillSortFields) { + super(groupSelector, groups, groupReducer); — End diff – You can remove a few of these parameters now, I think? They're for the Reducer, and as such don't need to be passed separately.

            People

            • Assignee:
              Unassigned
              Reporter:
              diegoceccarelli Diego Ceccarelli
            • Votes:
              1 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:

                Development