Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-8461

CloudSolrStream and ParallelStream can choose replicas that are not active

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 6.0
    • Component/s: SolrCloud
    • Labels:
      None

      Description

      Currently CloudSolrStream and ParallelStream don't check the state of the replicas they route requests to. This can result in replicas that are not active receiving request.

      1. SOLR-8461.patch
        3 kB
        Joel Bernstein
      2. SOLR-8461.patch
        7 kB
        Cao Manh Dat
      3. SOLR-8461.patch
        6 kB
        Cao Manh Dat
      4. SOLR-8461.patch
        5 kB
        Cao Manh Dat
      5. SOLR-8461.patch
        2 kB
        Cao Manh Dat
      6. SOLR-8461.patch
        5 kB
        Cao Manh Dat

        Issue Links

          Activity

          Hide
          caomanhdat Cao Manh Dat added a comment -

          Joel Bernstein I will try to working on this issue. Can you assign it to me?

          Show
          caomanhdat Cao Manh Dat added a comment - Joel Bernstein I will try to working on this issue. Can you assign it to me?
          Hide
          joel.bernstein Joel Bernstein added a comment -

          I'm not sure if I can assign it to you. I believe a committer is typically the assignee. But feel free to post a patch and I'll review and commit when it's ready. You'll be referenced in CHANGES.txt.

          Show
          joel.bernstein Joel Bernstein added a comment - I'm not sure if I can assign it to you. I believe a committer is typically the assignee. But feel free to post a patch and I'll review and commit when it's ready. You'll be referenced in CHANGES.txt.
          Hide
          caomanhdat Cao Manh Dat added a comment -

          First attempt,
          I just dont know how to write unit test for this issue. Any hint?

          Show
          caomanhdat Cao Manh Dat added a comment - First attempt, I just dont know how to write unit test for this issue. Any hint?
          Hide
          joel.bernstein Joel Bernstein added a comment -

          I think all needs to be done here is to check the state of replica before adding to the list of replicas to choose from.

          As long as all existing tests pass after this change then it should be committed.

          Show
          joel.bernstein Joel Bernstein added a comment - I think all needs to be done here is to check the state of replica before adding to the list of replicas to choose from. As long as all existing tests pass after this change then it should be committed.
          Hide
          caomanhdat Cao Manh Dat added a comment -

          It quite trivial patch if you said so. I just afraid of replica being down after create SolrStream and before open it.

          Show
          caomanhdat Cao Manh Dat added a comment - It quite trivial patch if you said so. I just afraid of replica being down after create SolrStream and before open it.
          Hide
          joel.bernstein Joel Bernstein added a comment -

          Yes, this is a quick fix just to resolve the choosing of inactive replicas.

          From a testing standpoint we can investigate the Chaos Monkey behavior on the streaming tests.

          Show
          joel.bernstein Joel Bernstein added a comment - Yes, this is a quick fix just to resolve the choosing of inactive replicas. From a testing standpoint we can investigate the Chaos Monkey behavior on the streaming tests.
          Hide
          caomanhdat Cao Manh Dat added a comment -

          Added unit test.

          Show
          caomanhdat Cao Manh Dat added a comment - Added unit test.
          Hide
          varunthacker Varun Thacker added a comment -

          Hi Cao Manh Dat ,

          Patch looks great! Couple of comments -
          1. With the patch we check if the replica is active, we should also also check if the replica's nodeName belongs in the live nodes list. This is required for a scenario like this - Someone kills the node using "kill -9 <pid>" / OOM crash . In the cluster state that replica will still show us "active". HttpSolrCall#getCoreByCollection does the same thing
          2. Maybe move the code into a function - say "getActiveReplicasForCollection" ? The both CloudSolrStream and ParallelSteam can reuse it.

          Show
          varunthacker Varun Thacker added a comment - Hi Cao Manh Dat , Patch looks great! Couple of comments - 1. With the patch we check if the replica is active, we should also also check if the replica's nodeName belongs in the live nodes list. This is required for a scenario like this - Someone kills the node using "kill -9 <pid>" / OOM crash . In the cluster state that replica will still show us "active". HttpSolrCall#getCoreByCollection does the same thing 2. Maybe move the code into a function - say "getActiveReplicasForCollection" ? The both CloudSolrStream and ParallelSteam can reuse it.
          Hide
          caomanhdat Cao Manh Dat added a comment - - edited

          1. With the patch we check if the replica is active, we should also also check if the replica's nodeName belongs in the live nodes list. This is required for a scenario like this - Someone kills the node using "kill -9 <pid>" / OOM crash . In the cluster state that replica will still show us "active". HttpSolrCall#getCoreByCollection does the same thing

          Very nice suggestion, I Will update it now.

          2. Maybe move the code into a function - say "getActiveReplicasForCollection" ? The both CloudSolrStream and ParallelSteam can reuse it.

          Cant do that. In ParallelStream, we just wanna find enough workers, dont care about what collection/shard that replica belong to. In CloudSolrStream, for each collection we will randomly find a replica per shard.

          Show
          caomanhdat Cao Manh Dat added a comment - - edited 1. With the patch we check if the replica is active, we should also also check if the replica's nodeName belongs in the live nodes list. This is required for a scenario like this - Someone kills the node using "kill -9 <pid>" / OOM crash . In the cluster state that replica will still show us "active". HttpSolrCall#getCoreByCollection does the same thing Very nice suggestion, I Will update it now. 2. Maybe move the code into a function - say "getActiveReplicasForCollection" ? The both CloudSolrStream and ParallelSteam can reuse it. Cant do that. In ParallelStream, we just wanna find enough workers, dont care about what collection/shard that replica belong to. In CloudSolrStream, for each collection we will randomly find a replica per shard.
          Hide
          caomanhdat Cao Manh Dat added a comment -

          New patch based on Varun Thacker suggestion.

          Show
          caomanhdat Cao Manh Dat added a comment - New patch based on Varun Thacker suggestion.
          Hide
          varunthacker Varun Thacker added a comment -

          How about we add this method to ClusterState.java and make use of this?

          public List<Replica> getActiveReplicas(String collection, String sliceName) {
              Slice slice = getSlice(collection, sliceName);
              if (slice == null) return null;
              List<Replica> activeReplicas = new ArrayList<>();
              for (Replica replica : slice.getReplicas()) {
                if (liveNodes.contains(replica.getNodeName()) && replica.getState() == Replica.State.ACTIVE) {
                  activeReplicas.add(replica);
                }
              }
              return activeReplicas;
            }
          
          Show
          varunthacker Varun Thacker added a comment - How about we add this method to ClusterState.java and make use of this? public List<Replica> getActiveReplicas( String collection, String sliceName) { Slice slice = getSlice(collection, sliceName); if (slice == null ) return null ; List<Replica> activeReplicas = new ArrayList<>(); for (Replica replica : slice.getReplicas()) { if (liveNodes.contains(replica.getNodeName()) && replica.getState() == Replica.State.ACTIVE) { activeReplicas.add(replica); } } return activeReplicas; }
          Hide
          caomanhdat Cao Manh Dat added a comment -

          New patch based on the snippet of Varun Thacker

          Show
          caomanhdat Cao Manh Dat added a comment - New patch based on the snippet of Varun Thacker
          Hide
          joel.bernstein Joel Bernstein added a comment - - edited

          Thanks for the patch Cao Manh Dat. I'll start reviewing the patch shortly. I'm also considering adding an Executor to the StreamContext so each stream doesn't create an executer of it's own for connecting to the replicas. But I'll start from the patch provided by Cao Manh Dat and expand from there.

          Show
          joel.bernstein Joel Bernstein added a comment - - edited Thanks for the patch Cao Manh Dat . I'll start reviewing the patch shortly. I'm also considering adding an Executor to the StreamContext so each stream doesn't create an executer of it's own for connecting to the replicas. But I'll start from the patch provided by Cao Manh Dat and expand from there.
          Hide
          stephlag Stephan Lagraulet added a comment -

          I'm trying to gather all issues related to SolrCloud that affects Solr 5.4. Can you affect SolrCloud component to this issue ?

          Show
          stephlag Stephan Lagraulet added a comment - I'm trying to gather all issues related to SolrCloud that affects Solr 5.4. Can you affect SolrCloud component to this issue ?
          Hide
          joel.bernstein Joel Bernstein added a comment -

          This bug is only in trunk so it only effects the 6.0 release

          Show
          joel.bernstein Joel Bernstein added a comment - This bug is only in trunk so it only effects the 6.0 release
          Hide
          joel.bernstein Joel Bernstein added a comment -

          Actually I'm wrong. These classes are in 5x. I'll update the component.

          Show
          joel.bernstein Joel Bernstein added a comment - Actually I'm wrong. These classes are in 5x. I'll update the component.
          Hide
          joel.bernstein Joel Bernstein added a comment - - edited

          Patch with minimal changes to address the issue. With Solr 6.0 coming very soon, I felt it made sense to make this patch as small as possible.

          I will open another ticket to explore some of the ideas in this ticket around the resiliency of streaming requests.

          Show
          joel.bernstein Joel Bernstein added a comment - - edited Patch with minimal changes to address the issue. With Solr 6.0 coming very soon, I felt it made sense to make this patch as small as possible. I will open another ticket to explore some of the ideas in this ticket around the resiliency of streaming requests.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 8eb58cc00015f65d120f4ebd921cc4be2ee4c30d in lucene-solr's branch refs/heads/master from jbernste
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=8eb58cc ]

          SOLR-8461: CloudSolrStream and ParallelStream can choose replicas that are not active

          Show
          jira-bot ASF subversion and git services added a comment - Commit 8eb58cc00015f65d120f4ebd921cc4be2ee4c30d in lucene-solr's branch refs/heads/master from jbernste [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=8eb58cc ] SOLR-8461 : CloudSolrStream and ParallelStream can choose replicas that are not active
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 35337e8cf278ab445c3a8d1b5d256d80fb23aa7e in lucene-solr's branch refs/heads/master from jbernste
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=35337e8 ]

          SOLR-8461: Update CHANGES.txt

          Show
          jira-bot ASF subversion and git services added a comment - Commit 35337e8cf278ab445c3a8d1b5d256d80fb23aa7e in lucene-solr's branch refs/heads/master from jbernste [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=35337e8 ] SOLR-8461 : Update CHANGES.txt
          Hide
          joel.bernstein Joel Bernstein added a comment -
          Show
          joel.bernstein Joel Bernstein added a comment - Thanks Cao Manh Dat and Varun Thacker !

            People

            • Assignee:
              joel.bernstein Joel Bernstein
              Reporter:
              joel.bernstein Joel Bernstein
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development