Solr
  1. Solr
  2. SOLR-3109

group=true requests result in numerous redundant shard requests

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 3.5, 4.0-ALPHA
    • Fix Version/s: 3.6, 4.0-ALPHA
    • Component/s: search
    • Labels:
    • Environment:

      64-bit Linux, sharded environment

      Description

      During the second phase of a group query, the collator sends a query to each of the shards. The purpose of this query is for shards to respond with the doc ids that match the set of group ids returned from the first phase. The problem is that it sends this second query to each shard multiple times. Specifically, in an environment with n shards, each shard will be hit with an identical query n times during the second phase of query processing, resulting in O(n 2) performance where n is the number of shards.

      I have traced this bug down to a single line in TopGroupsShardRequestFactory.java, and I am attaching a patch.

      1. SOLR-3109-lucene_solr_3_5.patch
        10 kB
        Russell Black
      2. SOLR-3109-Backport-of-grouping-performace-fix-to-3.x.patch
        12 kB
        Greg Bowyer
      3. SOLR-3109.patch
        1 kB
        Russell Black
      4. SOLR-3109.patch
        8 kB
        Martijn van Groningen
      5. SOLR-3109.patch
        11 kB
        Russell Black

        Activity

        Hide
        Russell Black added a comment -

        The patch changes this line of code in TopGroupsShardRequestFactory.java:

        sreq.actualShards = new String[] {shard};
        

        becomes

        sreq.shards = new String[] {shard};
        

        To see why this was a problem, look at SearchHandler.java line 249:

        sreq.actualShards = sreq.shards // sets actualShards to null
        if (sreq.actualShards==ShardRequest.ALL_SHARDS ) { //ALL_SHARDS is null
          sreq.actualShards = rb.shards; // every shard!
        }
        
        

        This sets actualShards to null, which means send the request to every shard.

        Show
        Russell Black added a comment - The patch changes this line of code in TopGroupsShardRequestFactory.java : sreq.actualShards = new String [] {shard}; becomes sreq.shards = new String [] {shard}; To see why this was a problem, look at SearchHandler.java line 249: sreq.actualShards = sreq.shards // sets actualShards to null if (sreq.actualShards==ShardRequest.ALL_SHARDS ) { //ALL_SHARDS is null sreq.actualShards = rb.shards; // every shard! } This sets actualShards to null, which means send the request to every shard.
        Hide
        Martijn van Groningen added a comment -

        That doesn't look good... Thanks for bringing this up!

        Show
        Martijn van Groningen added a comment - That doesn't look good... Thanks for bringing this up!
        Hide
        Martijn van Groningen added a comment - - edited

        I noticed that the distributed test failed with this patch. After some digging I found out that the TopGroupsShardResponseProcessor can't really deal with multiple ShardRequests... I've updated the patch so that only one ShardRequest is created by the TopGroupsShardRequestFactory. Test passes now and I don't see any redundant real http requests being generated.

        Russell can you confirm this as well?

        Show
        Martijn van Groningen added a comment - - edited I noticed that the distributed test failed with this patch. After some digging I found out that the TopGroupsShardResponseProcessor can't really deal with multiple ShardRequests... I've updated the patch so that only one ShardRequest is created by the TopGroupsShardRequestFactory. Test passes now and I don't see any redundant real http requests being generated. Russell can you confirm this as well?
        Hide
        Russell Black added a comment - - edited

        Martijn, I also noticed that TopGroupsShardResponseProcessor can't deal with multiple ShardRequests (although it looks like it wouldn't be to hard to add this ability). At any rate, your approach of returning a single ShardRequest containing all relevant shards sounds like the right one. I went one step further and refactored TopGroupsShardRequestFactory.java because there was significant code duplication in the class's two primary methods.

        In my testing I also discovered a closely related problem. The bug is in the data structure used to map search groups to the shards which contain them. ResponseBuilder.searchGroupToShard assumes that a given search group only resides on one shard. I could not find this assumption documented anywhere, nor can I find a reason such a restriction need be imposed. This structure is populated by SearchGroupShardResponseProcessor. There is a race condition there, wherein the last shard to report a search group will be assumed to be the only shard containing the search group. This data structure is used in TopGroupsShardRequestFactory.createRequestForSpecificShards() to know which shards to query. This means you can get a different set of shards to query depending on shard query order.

        I have changed the structure to allow a search group to be present in multiple shards.

        Patch to follow.

        Show
        Russell Black added a comment - - edited Martijn, I also noticed that TopGroupsShardResponseProcessor can't deal with multiple ShardRequests (although it looks like it wouldn't be to hard to add this ability). At any rate, your approach of returning a single ShardRequest containing all relevant shards sounds like the right one. I went one step further and refactored TopGroupsShardRequestFactory.java because there was significant code duplication in the class's two primary methods. In my testing I also discovered a closely related problem. The bug is in the data structure used to map search groups to the shards which contain them. ResponseBuilder.searchGroupToShard assumes that a given search group only resides on one shard. I could not find this assumption documented anywhere, nor can I find a reason such a restriction need be imposed. This structure is populated by SearchGroupShardResponseProcessor . There is a race condition there, wherein the last shard to report a search group will be assumed to be the only shard containing the search group. This data structure is used in TopGroupsShardRequestFactory.createRequestForSpecificShards() to know which shards to query. This means you can get a different set of shards to query depending on shard query order. I have changed the structure to allow a search group to be present in multiple shards. Patch to follow.
        Hide
        Martijn van Groningen added a comment - - edited

        Thanks for the refactoring!

        The bug is in the data structure used to map search groups to the shards which contain them. ResponseBuilder.searchGroupToShard assumes that a given search group only resides on one shard. I could not find this assumption documented anywhere, nor can I find a reason such a restriction need be imposed.

        There is no such restriction. A search group can reside on more than one shard. I wonder why this issue didn't result in test failure / bugs from the beginning. I guess b/c of the redundant requests all shards were queried and this way the end result was still correct. At least the latest patch I added should have resulted in a test failure but it didn't. Can you share how you did this testing? This can then be added to the TestDistributedGrouping test class.

        Show
        Martijn van Groningen added a comment - - edited Thanks for the refactoring! The bug is in the data structure used to map search groups to the shards which contain them. ResponseBuilder.searchGroupToShard assumes that a given search group only resides on one shard. I could not find this assumption documented anywhere, nor can I find a reason such a restriction need be imposed. There is no such restriction. A search group can reside on more than one shard. I wonder why this issue didn't result in test failure / bugs from the beginning. I guess b/c of the redundant requests all shards were queried and this way the end result was still correct. At least the latest patch I added should have resulted in a test failure but it didn't. Can you share how you did this testing? This can then be added to the TestDistributedGrouping test class.
        Hide
        Russell Black added a comment - - edited

        The current TestDistributedGrouping test case is constructed in such a way that each record has a unique value for it's search group field (i1), so that there is never more than one record in any given search group. This style of indexing conforms to the restriction discussed earlier. This is likely the reason there were no test failures.

        Show
        Russell Black added a comment - - edited The current TestDistributedGrouping test case is constructed in such a way that each record has a unique value for it's search group field ( i1 ), so that there is never more than one record in any given search group. This style of indexing conforms to the restriction discussed earlier. This is likely the reason there were no test failures.
        Hide
        Martijn van Groningen added a comment -

        Yes, that might be the reason. The TestDistributedGrouping needs to be changed, so that a search group contains multiple records.

        Show
        Martijn van Groningen added a comment - Yes, that might be the reason. The TestDistributedGrouping needs to be changed, so that a search group contains multiple records.
        Hide
        Russell Black added a comment -

        In TestDistributedGrouping you wrote the following comment:

         // In order to validate this we need to make sure that during indexing that all documents of one group only occur on the same shard
        

        I wanted to understand the reason for that comment before making any changes to the test case. (Assuming you wanted me to update the test case – if not, I'll leave the test case in your hands)

        Show
        Russell Black added a comment - In TestDistributedGrouping you wrote the following comment: // In order to validate this we need to make sure that during indexing that all documents of one group only occur on the same shard I wanted to understand the reason for that comment before making any changes to the test case. (Assuming you wanted me to update the test case – if not, I'll leave the test case in your hands)
        Hide
        Martijn van Groningen added a comment -

        No worries I didn't want to move this work to anyone. Just wanted to say that the test needs to be updated.

        I put that comment b/c in the following three lines group.ngroups and group.truncate features are tested. These features only work properly if documents belonging to a group reside in the same shard. If documents belonging to a group do occur in more than one shard then the results are very likely incorrect.

        Tomorrow I will update the test case and get this patch committed. If you want to update the test case and have time for that that would be great!

        Show
        Martijn van Groningen added a comment - No worries I didn't want to move this work to anyone. Just wanted to say that the test needs to be updated. I put that comment b/c in the following three lines group.ngroups and group.truncate features are tested. These features only work properly if documents belonging to a group reside in the same shard. If documents belonging to a group do occur in more than one shard then the results are very likely incorrect. Tomorrow I will update the test case and get this patch committed. If you want to update the test case and have time for that that would be great!
        Hide
        Russell Black added a comment -

        I'll let you do the test case, as I don't have a lot of time to spend on this. If there is a possibility of another 3.x release, I would like to backport the patch the 3x branch as well. Let me know, and I can create the 3x backport once you have updated the test case and have your final 4.0 patch. I have already created a 3_5 backport that we will be using internally until the next release.

        Show
        Russell Black added a comment - I'll let you do the test case, as I don't have a lot of time to spend on this. If there is a possibility of another 3.x release, I would like to backport the patch the 3x branch as well. Let me know, and I can create the 3x backport once you have updated the test case and have your final 4.0 patch. I have already created a 3_5 backport that we will be using internally until the next release.
        Hide
        Martijn van Groningen added a comment -

        You can share your 3.5 patch as you have it now. I think that applying the test changes to 3x branch isn't much effort.
        As far as I know there will be a 3.6 release, so this bug fix will also be included in the this release.

        Show
        Martijn van Groningen added a comment - You can share your 3.5 patch as you have it now. I think that applying the test changes to 3x branch isn't much effort. As far as I know there will be a 3.6 release, so this bug fix will also be included in the this release.
        Hide
        Martijn van Groningen added a comment -

        Russell, when attaching a patch can you click on the option box with the label:
        Grant license to ASF for inclusion in ASF works

        Assuming that you want to include your bug fixes to Solr.

        Show
        Martijn van Groningen added a comment - Russell, when attaching a patch can you click on the option box with the label: Grant license to ASF for inclusion in ASF works Assuming that you want to include your bug fixes to Solr.
        Hide
        Cody Young added a comment -

        The bug is in the data structure used to map search groups to the shards which contain them. ResponseBuilder.searchGroupToShard assumes that a given search group only resides on one shard. I could not find this assumption documented anywhere, nor can I find a reason such a restriction need be imposed.

        Perhaps this could be left in as an advanced option. It would be a performance boost for anyone who can guarantee that a group will reside wholly on a single shard.

        group.distributeGroupCollation=true|false defaults to true

        Show
        Cody Young added a comment - The bug is in the data structure used to map search groups to the shards which contain them. ResponseBuilder.searchGroupToShard assumes that a given search group only resides on one shard. I could not find this assumption documented anywhere, nor can I find a reason such a restriction need be imposed. Perhaps this could be left in as an advanced option. It would be a performance boost for anyone who can guarantee that a group will reside wholly on a single shard. group.distributeGroupCollation=true|false defaults to true
        Hide
        Russell Black added a comment - - edited

        Perhaps this could be left in as an advanced option. It would be a performance boost for anyone who can guarantee that a group will reside wholly on a single shard.

        group.distributeGroupCollation=true|false defaults to true

        As the patch currently stands, someone who can guarantee that a group will reside wholly on a single shard will benefit already because it will only send the query the shard that contains the group of interest. There would be no need to have a separate advanced option. I simply made the data structure allow for the possibility of having multiple shards per group, but there is no additional overhead for the single-shard case.

        Show
        Russell Black added a comment - - edited Perhaps this could be left in as an advanced option. It would be a performance boost for anyone who can guarantee that a group will reside wholly on a single shard. group.distributeGroupCollation=true|false defaults to true As the patch currently stands, someone who can guarantee that a group will reside wholly on a single shard will benefit already because it will only send the query the shard that contains the group of interest. There would be no need to have a separate advanced option. I simply made the data structure allow for the possibility of having multiple shards per group, but there is no additional overhead for the single-shard case.
        Hide
        Greg Bowyer added a comment -

        Since I need to test this to see if it is responsible for my large profiler costs spent in scoring and grouping I also backported this

        Patch attached that does a backport to Solr 3.5 +

        If my patch is terrifying please scream at me and replace it with a better one, but I figure it will be much the one already commented on.

        Show
        Greg Bowyer added a comment - Since I need to test this to see if it is responsible for my large profiler costs spent in scoring and grouping I also backported this Patch attached that does a backport to Solr 3.5 + If my patch is terrifying please scream at me and replace it with a better one, but I figure it will be much the one already commented on.
        Hide
        Russell Black added a comment - - edited

        Greg, I had some trouble applying your patch to my code base, although visually it looks like the right changes. Is your patch intended for the 3_5 branch or 3x branch? I have attached my own version of the patch. It is a patch against the 3.5 branch (http://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_3_5/).

        Show
        Russell Black added a comment - - edited Greg, I had some trouble applying your patch to my code base, although visually it looks like the right changes. Is your patch intended for the 3_5 branch or 3x branch? I have attached my own version of the patch. It is a patch against the 3.5 branch ( http://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_3_5/ ).
        Hide
        Greg Bowyer added a comment -

        Its for the 3.5 branch, but like I said I was jumping the gun, if your patch applies forget mine and go with that

        Show
        Greg Bowyer added a comment - Its for the 3.5 branch, but like I said I was jumping the gun, if your patch applies forget mine and go with that
        Hide
        Russell Black added a comment -

        Re-uploaded the same 4.0 patch as before, this time with "Grant license to ASF" checked.

        Show
        Russell Black added a comment - Re-uploaded the same 4.0 patch as before, this time with "Grant license to ASF" checked.
        Hide
        Martijn van Groningen added a comment -

        The current TestDistributedGrouping test case is constructed in such a way that each record has a unique value for it's search group field (i1), so that there is never more than one record in any given search group. This style of indexing conforms to the restriction discussed earlier. This is likely the reason there were no test failures.

        I think this issue doesn't exist in the released versions of Solr / 4.0-dev. Due to the bug that all shards were queried for each ShardRequest instance and all the matching top search groups still arrived at the right shard. Only after applying the changes to TopGroupsShardRequestFactory I could let the distributed grouping test fail.

        Show
        Martijn van Groningen added a comment - The current TestDistributedGrouping test case is constructed in such a way that each record has a unique value for it's search group field (i1), so that there is never more than one record in any given search group. This style of indexing conforms to the restriction discussed earlier. This is likely the reason there were no test failures. I think this issue doesn't exist in the released versions of Solr / 4.0-dev. Due to the bug that all shards were queried for each ShardRequest instance and all the matching top search groups still arrived at the right shard. Only after applying the changes to TopGroupsShardRequestFactory I could let the distributed grouping test fail.
        Hide
        Martijn van Groningen added a comment -

        Committed to branch3x and trunk.
        Thanks Russell and Greg for reporting and fixing this issue!

        Show
        Martijn van Groningen added a comment - Committed to branch3x and trunk. Thanks Russell and Greg for reporting and fixing this issue!
        Hide
        Russell Black added a comment -

        Thanks for the quick turnaround on this! It was fun to contribute.

        Show
        Russell Black added a comment - Thanks for the quick turnaround on this! It was fun to contribute.

          People

          • Assignee:
            Martijn van Groningen
            Reporter:
            Russell Black
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development