Details

    • Type: Sub-task Sub-task
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Ability to specify functions over group documents when sorting groups. max(score) or avg(popularity), etc.

      1. SOLR-2072-01.patch
        20 kB
        George P. Stathis

        Activity

        Hide
        Yonik Seeley added a comment -

        This works when running a single non-distributed instance of Solr, but when starting Solr with numShards >= 1 it behaves differently.

        Ah, so it is an existing feature, but a bug in distributed mode. It should go to a new issue (if there isn't one already).

        Show
        Yonik Seeley added a comment - This works when running a single non-distributed instance of Solr, but when starting Solr with numShards >= 1 it behaves differently. Ah, so it is an existing feature, but a bug in distributed mode. It should go to a new issue (if there isn't one already).
        Hide
        Bryan Bende added a comment -

        According to that documentation "sort=popularity desc will cause the groups to be sorted according to the highest popularity doc in each group".

        What I want to do is something like "sort=popularity asc, group.field=label, group.sort=popularity desc, group.limit=1" so essentially the most popular document in each group, but in least popular order across the groups. This works when running a single non-distributed instance of Solr, but when starting Solr with numShards >= 1 it behaves differently.

        I wrote up a description of the problem on the mailing list, but no one provided any feedback:
        http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201306.mbox/%3CCALo_M18WVoLKvepJMu0wXk_x2H8cv3UaX9RQYtEh4-mksQHLBA%40mail.gmail.com%3E

        Show
        Bryan Bende added a comment - According to that documentation "sort=popularity desc will cause the groups to be sorted according to the highest popularity doc in each group". What I want to do is something like "sort=popularity asc, group.field=label, group.sort=popularity desc, group.limit=1" so essentially the most popular document in each group, but in least popular order across the groups. This works when running a single non-distributed instance of Solr, but when starting Solr with numShards >= 1 it behaves differently. I wrote up a description of the problem on the mailing list, but no one provided any feedback: http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201306.mbox/%3CCALo_M18WVoLKvepJMu0wXk_x2H8cv3UaX9RQYtEh4-mksQHLBA%40mail.gmail.com%3E
        Hide
        Yonik Seeley added a comment -

        I would like to able to use sort and group.sort together such that the group.sort is applied with in the group first and the first document of each group is then used as the representative document to perform the overall sorting of groups.

        That is how things currently work:
        http://wiki.apache.org/solr/FieldCollapsing#Request_Parameters

        Show
        Yonik Seeley added a comment - I would like to able to use sort and group.sort together such that the group.sort is applied with in the group first and the first document of each group is then used as the representative document to perform the overall sorting of groups. That is how things currently work: http://wiki.apache.org/solr/FieldCollapsing#Request_Parameters
        Hide
        Bryan Bende added a comment -

        Is there any further work being done on this issue ?

        I'm looking for something similar, not necessarily aggregating a score, but I would like to able to use sort and group.sort together such that the group.sort is applied with in the group first and the first document of each group is then used as the representative document to perform the overall sorting of groups.

        Show
        Bryan Bende added a comment - Is there any further work being done on this issue ? I'm looking for something similar, not necessarily aggregating a score, but I would like to able to use sort and group.sort together such that the group.sort is applied with in the group first and the first document of each group is then used as the representative document to perform the overall sorting of groups.
        Hide
        dave shor added a comment - - edited

        This would be a much appreciated feature.
        What other workarounds have people tried in the interim?

        Show
        dave shor added a comment - - edited This would be a much appreciated feature. What other workarounds have people tried in the interim?
        Hide
        Martijn van Groningen added a comment -

        Nice prototype! Good idea to use a FieldComparator/Source that sums the score.

        I think it is better not to extend AbstractFirstPassGroupingCollector and move the code you need to DocScoresFirstPassGroupingCollector. The AbstractFirstPassGroupingCollector and its other subclasses only keep the top N groups. That is an important difference compared to DocScoresFirstPassGroupingCollector.

        Another implementation idea that I like is if we create a collector that is responsible for filling the FieldComparator/Source (in this case the sums of score per group, but might be something different). This collector will create the Sort that the first pass grouping collector will use. With this approach existing code doesn't have to be changed, the logic is in the FieldComparator. The downside is that when a user wants to sort by sum(score), three searches are executed per request.

        Show
        Martijn van Groningen added a comment - Nice prototype! Good idea to use a FieldComparator/Source that sums the score. I think it is better not to extend AbstractFirstPassGroupingCollector and move the code you need to DocScoresFirstPassGroupingCollector. The AbstractFirstPassGroupingCollector and its other subclasses only keep the top N groups. That is an important difference compared to DocScoresFirstPassGroupingCollector. Another implementation idea that I like is if we create a collector that is responsible for filling the FieldComparator/Source (in this case the sums of score per group, but might be something different). This collector will create the Sort that the first pass grouping collector will use. With this approach existing code doesn't have to be changed, the logic is in the FieldComparator. The downside is that when a user wants to sort by sum(score), three searches are executed per request.
        Hide
        George P. Stathis added a comment -

        Attaching a first prototype pass with a unit test to get the conversation going. I'm pretty sure this is not going to win any performance contests but it's still worth sparking some discussion around it. The prototype shows an implementation of a sum(scores) group sort. The approach basically relaxes the visibility of some AbstractFirstPassGroupingCollector properties so that sub-classes can do more with the state encapsulated in the abstract class. A new collector is introduced based on TermFirstPassGroupingCollector but implementing its' own collect method. A specialized comparator is used to sum up the document scores for each group.

        Looking forward to some feedback.

        Show
        George P. Stathis added a comment - Attaching a first prototype pass with a unit test to get the conversation going. I'm pretty sure this is not going to win any performance contests but it's still worth sparking some discussion around it. The prototype shows an implementation of a sum(scores) group sort. The approach basically relaxes the visibility of some AbstractFirstPassGroupingCollector properties so that sub-classes can do more with the state encapsulated in the abstract class. A new collector is introduced based on TermFirstPassGroupingCollector but implementing its' own collect method. A specialized comparator is used to sum up the document scores for each group. Looking forward to some feedback.
        Hide
        George P. Stathis added a comment -

        My progress so far has been in terms of climbing the learning curve. I had to do some reading on the Lucene basics first (Lucene in Action is a great book BTW). I came to the same conclusion as Martijn which is to create a specialized implementation of an AbstractFirstPassGroupingCollector subclass. I had to pause my efforts to address some other internal sprint work but I should be tackling this again this week. My first goal is to hack a collector that can sort by the sum of the document scores in group and learn from it for the more generalized cases.

        Show
        George P. Stathis added a comment - My progress so far has been in terms of climbing the learning curve. I had to do some reading on the Lucene basics first (Lucene in Action is a great book BTW). I came to the same conclusion as Martijn which is to create a specialized implementation of an AbstractFirstPassGroupingCollector subclass. I had to pause my efforts to address some other internal sprint work but I should be tackling this again this week. My first goal is to hack a collector that can sort by the sum of the document scores in group and learn from it for the more generalized cases.
        Hide
        Martijn van Groningen added a comment -

        Currently the AbstractFirstPassGroupingCollector uses the most relevant document of a group to determine the order of a group in the result set. The values for the most relevant document are either fetched from the score or field cache (and possible also doc values in 4.x).

        I think it is best to add more first pass grouping collectors, that use different ways to order groups. Implementations could do this based on the average group score or number of hits per group. I do think that these implementations will be more expensive. Sorting by hit count or avg score requires the collector to keep track of all groups. Currently AbstractFirstPassGroupingCollector keeps only track of the top N groups.

        Show
        Martijn van Groningen added a comment - Currently the AbstractFirstPassGroupingCollector uses the most relevant document of a group to determine the order of a group in the result set. The values for the most relevant document are either fetched from the score or field cache (and possible also doc values in 4.x). I think it is best to add more first pass grouping collectors, that use different ways to order groups. Implementations could do this based on the average group score or number of hits per group. I do think that these implementations will be more expensive. Sorting by hit count or avg score requires the collector to keep track of all groups. Currently AbstractFirstPassGroupingCollector keeps only track of the top N groups.
        Hide
        yuval dotan added a comment -

        Hi George
        did you make any progress with with the group sort by sum or avg?
        Thanks

        Show
        yuval dotan added a comment - Hi George did you make any progress with with the group sort by sum or avg? Thanks
        Hide
        George P. Stathis added a comment -

        Still not clear on where to start and probably stating the obvious for folks familiar with the code: it looks like the assumption is made within AbstractFirstPassGroupingCollector and AbstractSecondPassGroupingCollector about using the highest scoring doc within each group. Perhaps using a different TopScoreDocCollector and augmenting TopDocs to account for scoring stats (e.g. sum of scores etc) might be a place to start. I'm getting warmer here or completely off?

        Show
        George P. Stathis added a comment - Still not clear on where to start and probably stating the obvious for folks familiar with the code: it looks like the assumption is made within AbstractFirstPassGroupingCollector and AbstractSecondPassGroupingCollector about using the highest scoring doc within each group. Perhaps using a different TopScoreDocCollector and augmenting TopDocs to account for scoring stats (e.g. sum of scores etc) might be a place to start. I'm getting warmer here or completely off?
        Hide
        George P. Stathis added a comment -

        Thanks Erick. That's a good enough start for me. I'll also look at the patches attached to SOLR-1297 since it's referenced in this ticket.

        Show
        George P. Stathis added a comment - Thanks Erick. That's a good enough start for me. I'll also look at the patches attached to SOLR-1297 since it's referenced in this ticket.
        Hide
        Erick Erickson added a comment -

        George:

        This isn't going to be all that much help, but you might try stepping through the test cases to get a feel for how grouping works, see: TestGroupingSearch for instance. And the files referenced in some of the patches might also be useful, e.g. SOLR-2072.

        Warning: I know very little to nothing about the code in question, but that's how I'd start to get a feel for it, hopefully enough to propose an approach...

        Show
        Erick Erickson added a comment - George: This isn't going to be all that much help, but you might try stepping through the test cases to get a feel for how grouping works, see: TestGroupingSearch for instance. And the files referenced in some of the patches might also be useful, e.g. SOLR-2072 . Warning: I know very little to nothing about the code in question, but that's how I'd start to get a feel for it, hopefully enough to propose an approach...
        Hide
        George P. Stathis added a comment - - edited

        Is this still on the radar? I'm interested in this, so I'd love to talk to someone about contributing. I'm not familiar with the code base yet, so any pointers on where to look in the code would be appreciated.

        Show
        George P. Stathis added a comment - - edited Is this still on the radar? I'm interested in this, so I'd love to talk to someone about contributing. I'm not familiar with the code base yet, so any pointers on where to look in the code would be appreciated.
        Hide
        Koji Sekiguchi added a comment -

        I see. I read that "group sort" was group.sort.

        Do we need a new parameter other than sort, to trigger the feature?

        Show
        Koji Sekiguchi added a comment - I see. I read that "group sort" was group.sort. Do we need a new parameter other than sort, to trigger the feature?
        Hide
        Yonik Seeley added a comment -

        Right now, the sort value for a group (that governs how whole groups sort relative to eachother) depends only on the top document in that group. This issue is about a different type of function that can derive a group sort value from something else. One example is being able to sort groups based on the average score in that group rather than just the top score. Another example is being able to sort by the number of hits in each group.

        Show
        Yonik Seeley added a comment - Right now, the sort value for a group (that governs how whole groups sort relative to eachother) depends only on the top document in that group. This issue is about a different type of function that can derive a group sort value from something else. One example is being able to sort groups based on the average score in that group rather than just the top score. Another example is being able to sort by the number of hits in each group.
        Hide
        Koji Sekiguchi added a comment -

        Uh, I see SOLR-1297 is reopened. Do you meat this issue depends on SOLR-1297?

        Show
        Koji Sekiguchi added a comment - Uh, I see SOLR-1297 is reopened. Do you meat this issue depends on SOLR-1297 ?
        Hide
        Koji Sekiguchi added a comment -

        I don't understand this... I think you have already implemented the ability?

        String groupSortStr = params.get(GroupParams.GROUP_SORT);
        Sort groupSort = groupSortStr != null ? QueryParsing.parseSort(groupSortStr, req.getSchema()) : null;
        

        Doesn't QueryParsing.parseSort() support sort by function?

        Show
        Koji Sekiguchi added a comment - I don't understand this... I think you have already implemented the ability? String groupSortStr = params.get(GroupParams.GROUP_SORT); Sort groupSort = groupSortStr != null ? QueryParsing.parseSort(groupSortStr, req.getSchema()) : null ; Doesn't QueryParsing.parseSort() support sort by function?

          People

          • Assignee:
            Unassigned
            Reporter:
            Yonik Seeley
          • Votes:
            11 Vote for this issue
            Watchers:
            16 Start watching this issue

            Dates

            • Created:
              Updated:

              Development