Solr
  1. Solr
  2. SOLR-7566

Search requests should return the shard name that is down

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Trivial Trivial
    • Resolution: Fixed
    • Affects Version/s: 5.1
    • Fix Version/s: 5.3, 6.0
    • Component/s: search, SolrCloud
    • Labels:
      None

      Description

      If no replicas of a shard are up and running, a search request gives the following response:

      {
        "responseHeader": {
          "status": 503,
          "QTime": 2,
          "params": {
            "q": "*:*",
            "indent": "true",
            "wt": "json",
            "_": "1432048084930"
          }
        },
        "error": {
          "msg": "no servers hosting shard: ",
          "code": 503
        }
      }
      

      The message should mention the shard which is down/unreachable.

      1. SOLR-7566.patch
        4 kB
        Shalin Shekhar Mangar
      2. SOLR-7566.patch
        1 kB
        Marius Grama
      3. SOLR-7566.patch
        0.9 kB
        Marius Grama
      4. SOLR-7566-test-code-cosmetics.patch
        1 kB
        Marius Grama

        Issue Links

          Activity

          Hide
          Marius Grama added a comment -

          Shalin Shekhar Mangar could you please give me some tips on how to reproduce this issue?

          Show
          Marius Grama added a comment - Shalin Shekhar Mangar could you please give me some tips on how to reproduce this issue?
          Hide
          Shalin Shekhar Mangar added a comment -
          1. Create a two node solrcloud cluster
          2. create a collection with numShards=2&replication factor=1&maxShardsPerNode=1
          3. shutdown one node
          4. run a search query using the collection name on the running node

          The request will fail with the error in the description.

          Show
          Shalin Shekhar Mangar added a comment - Create a two node solrcloud cluster create a collection with numShards=2&replication factor=1&maxShardsPerNode=1 shutdown one node run a search query using the collection name on the running node The request will fail with the error in the description.
          Hide
          Marius Grama added a comment - - edited

          Shalin Shekhar Mangar thank you. I could reproduce the issue and I have also found the cause of it.
          When doing a distributed search, the available shards are taken from the cluster state and are joined together (HttpShardHandler#checkDistributed(ResponseBuilder) method)

          HttpShardHandler#checkDistributed
          // ...
                      StringBuilder sliceShardsStr = new StringBuilder();
                      for (Replica replica : sliceShards.values()) {
                        if (!clusterState.liveNodesContain(replica.getNodeName())
                            || replica.getState() != Replica.State.ACTIVE) {
                          continue;
                        }
                        if (first) {
                          first = false;
                        } else {
                          sliceShardsStr.append('|');
                        }
                        String url = ZkCoreNodeProps.getCoreUrl(replica);
                        sliceShardsStr.append(url);
                      }
          
                      rb.shards[i] = sliceShardsStr.toString();
          

          In the case when the replicas for a shard are not available, the string corresponding to the shard addresses will remain empty.

          In the SearchHandler#handleRequestBody method, the empty shard will be simply forwarded to the HttpShardHandler to be evaluated asynchronously :
          SearchHandler.java line 352

          shardHandler1.submit(sreq, shard, params);
          

          and in the HttpShardHandler#submit() method will be thrown the exception with an inconsistent message because the shard is empty.

          HttpShardHandler#submit
                    // if there are no shards available for a slice, urls.size()==0
                    if (urls.size()==0) {
                      // TODO: what's the right error code here? We should use the same thing when
                      // all of the servers for a shard are down.
                      throw new SolrException(SolrException.ErrorCode.SERVICE_UNAVAILABLE, "no servers hosting shard: " + shard);
                    }
          

          One solution would be to throw the SolrException within the code of HttpShardHandler#checkDistributed method when the sliceShardsStr StringBuilder is empty. This seems to me the easy way to handle this situation.

          Can somebody give me feedback whether I am on the right track here? Thanks in advance.

          Show
          Marius Grama added a comment - - edited Shalin Shekhar Mangar thank you. I could reproduce the issue and I have also found the cause of it. When doing a distributed search, the available shards are taken from the cluster state and are joined together (HttpShardHandler#checkDistributed(ResponseBuilder) method) HttpShardHandler#checkDistributed // ... StringBuilder sliceShardsStr = new StringBuilder(); for (Replica replica : sliceShards.values()) { if (!clusterState.liveNodesContain(replica.getNodeName()) || replica.getState() != Replica.State.ACTIVE) { continue ; } if (first) { first = false ; } else { sliceShardsStr.append('|'); } String url = ZkCoreNodeProps.getCoreUrl(replica); sliceShardsStr.append(url); } rb.shards[i] = sliceShardsStr.toString(); In the case when the replicas for a shard are not available, the string corresponding to the shard addresses will remain empty. In the SearchHandler#handleRequestBody method, the empty shard will be simply forwarded to the HttpShardHandler to be evaluated asynchronously : SearchHandler.java line 352 shardHandler1.submit(sreq, shard, params); and in the HttpShardHandler#submit() method will be thrown the exception with an inconsistent message because the shard is empty. HttpShardHandler#submit // if there are no shards available for a slice, urls.size()==0 if (urls.size()==0) { // TODO: what's the right error code here? We should use the same thing when // all of the servers for a shard are down. throw new SolrException(SolrException.ErrorCode.SERVICE_UNAVAILABLE, "no servers hosting shard: " + shard); } One solution would be to throw the SolrException within the code of HttpShardHandler#checkDistributed method when the sliceShardsStr StringBuilder is empty. This seems to me the easy way to handle this situation. Can somebody give me feedback whether I am on the right track here? Thanks in advance.
          Hide
          Marius Grama added a comment - - edited

          Attached a small patch that could possibly be used as solution for this issue.

          I think throwing of the exception from HttpShardHandler#submit() method (mentioned in the above comment) should not be kept anymore in its current form. When there are no replica urls available there's no more need to submit the callable function.

          HttpShardHandler#submit
          final List<String> urls = getURLs(sreq, shard);
          if (urls.size()==0) {
              throw new IllegalArgumentException("The shard argument doesn't contain any valid URLs. got " + shard);
          }
          
          Show
          Marius Grama added a comment - - edited Attached a small patch that could possibly be used as solution for this issue. I think throwing of the exception from HttpShardHandler#submit() method (mentioned in the above comment) should not be kept anymore in its current form. When there are no replica urls available there's no more need to submit the callable function. HttpShardHandler#submit final List< String > urls = getURLs(sreq, shard); if (urls.size()==0) { throw new IllegalArgumentException( "The shard argument doesn't contain any valid URLs. got " + shard); }
          Hide
          Shalin Shekhar Mangar added a comment -

          Thanks Marius. Considering that urls can be non null if at least one slice has active replicas, we should detect this situation and throw an exception inside the HttpShardHandler.checkDistributed() method itself.

          Show
          Shalin Shekhar Mangar added a comment - Thanks Marius. Considering that urls can be non null if at least one slice has active replicas, we should detect this situation and throw an exception inside the HttpShardHandler.checkDistributed() method itself.
          Hide
          Marius Grama added a comment -

          Shalin Shekhar Mangar I don't understand what you imply in your last statement. The patch attached to this issue checks already within HttpShardHandler.checkDistributed method whether a shard doesn't have any active replicas.

          Show
          Marius Grama added a comment - Shalin Shekhar Mangar I don't understand what you imply in your last statement. The patch attached to this issue checks already within HttpShardHandler.checkDistributed method whether a shard doesn't have any active replicas.
          Hide
          Shalin Shekhar Mangar added a comment -

          The patch attached to this issue checks already within HttpShardHandler.checkDistributed method whether a shard doesn't have any active replicas.

          I am sorry, I misread your earlier patch. I thought that you were checking the complete URL string instead of one slice's URL string. You should also handle the shards.tolerant parameter (see ShardParams.SHARDS_TOLERANT). If this param is true then we cannot throw an exception in HttpShardHandler.checkDistributed and we should continue with the empty string.

          Show
          Shalin Shekhar Mangar added a comment - The patch attached to this issue checks already within HttpShardHandler.checkDistributed method whether a shard doesn't have any active replicas. I am sorry, I misread your earlier patch. I thought that you were checking the complete URL string instead of one slice's URL string. You should also handle the shards.tolerant parameter (see ShardParams.SHARDS_TOLERANT). If this param is true then we cannot throw an exception in HttpShardHandler.checkDistributed and we should continue with the empty string.
          Hide
          Marius Grama added a comment -

          I've added the modifications that you've requested on this new patch.
          I didn't know about the shards.tolerant parameter handling before.

          Do you see a way to unit test the changes from this patch?

          Show
          Marius Grama added a comment - I've added the modifications that you've requested on this new patch. I didn't know about the shards.tolerant parameter handling before. Do you see a way to unit test the changes from this patch?
          Hide
          Shalin Shekhar Mangar added a comment -

          There was no test which tested shards.tolerant with a cloud setup so I wrote one. This is ready. I'll commit shortly.

          Show
          Shalin Shekhar Mangar added a comment - There was no test which tested shards.tolerant with a cloud setup so I wrote one. This is ready. I'll commit shortly.
          Hide
          ASF subversion and git services added a comment -

          Commit 1685153 from shalin@apache.org in branch 'dev/trunk'
          [ https://svn.apache.org/r1685153 ]

          SOLR-7566: Search requests should return the shard name that is down

          Show
          ASF subversion and git services added a comment - Commit 1685153 from shalin@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1685153 ] SOLR-7566 : Search requests should return the shard name that is down
          Hide
          ASF subversion and git services added a comment -

          Commit 1685154 from shalin@apache.org in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1685154 ]

          SOLR-7566: Search requests should return the shard name that is down

          Show
          ASF subversion and git services added a comment - Commit 1685154 from shalin@apache.org in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1685154 ] SOLR-7566 : Search requests should return the shard name that is down
          Hide
          Shalin Shekhar Mangar added a comment -

          Thanks Marius!

          Show
          Shalin Shekhar Mangar added a comment - Thanks Marius!
          Hide
          Marius Grama added a comment -

          Shalin Shekhar Mangar thank you for setting up the test case. It looks very good and is straightforward to read. I've taken the liberty of doing some code cosmetics on it.It's up to you whether you include them in SVN.

          Show
          Marius Grama added a comment - Shalin Shekhar Mangar thank you for setting up the test case. It looks very good and is straightforward to read. I've taken the liberty of doing some code cosmetics on it.It's up to you whether you include them in SVN.
          Hide
          ASF subversion and git services added a comment -

          Commit 1685179 from shalin@apache.org in branch 'dev/trunk'
          [ https://svn.apache.org/r1685179 ]

          SOLR-7566: Code cosmetics

          Show
          ASF subversion and git services added a comment - Commit 1685179 from shalin@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1685179 ] SOLR-7566 : Code cosmetics
          Hide
          ASF subversion and git services added a comment -

          Commit 1685180 from shalin@apache.org in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1685180 ]

          SOLR-7566: Code cosmetics

          Show
          ASF subversion and git services added a comment - Commit 1685180 from shalin@apache.org in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1685180 ] SOLR-7566 : Code cosmetics
          Hide
          Shalin Shekhar Mangar added a comment -

          Committed, thanks Marius Grama!

          Show
          Shalin Shekhar Mangar added a comment - Committed, thanks Marius Grama !
          Hide
          Shalin Shekhar Mangar added a comment -

          Bulk close for 5.3.0 release

          Show
          Shalin Shekhar Mangar added a comment - Bulk close for 5.3.0 release

            People

            • Assignee:
              Shalin Shekhar Mangar
              Reporter:
              Shalin Shekhar Mangar
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development