Solr
  1. Solr
  2. SOLR-8034

If minRF is not satisfied, leader should not put replicas in recovery

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.4, 6.0
    • Component/s: SolrCloud
    • Labels:
    • Flags:
      Patch

      Description

      If the minimum replication factor parameter (minRf) in a solr update request is not satisfied – i.e. if the update was not successfully applied on at least n replicas where n >= minRf – the shard leader should not put the failed replicas in "leader initiated recovery" and the client should retry the update instead.

      This is so that in the scenario were minRf is not satisfied, the failed replicas can still be eligible to become a leader in case of leader failure, since in the client's perspective this update did not succeed.

      This came up from a network partition scenario where the leader becomes sectioned off from its two followers, but they all could still talk to zookeeper. The partitioned leader set its two followers as in leader initiated recovery, so we couldn't just kill off the partitioned node and have a follower take over leadership. For a minRf=1 case, this is the correct behavior because the partitioned leader would have accepted updates that the followers don't have, and therefore we can't switch leadership or we'd lose those updates. However, in the case of minRf=2, solr never accepted any update in the client's point of view, so in fact the partitioned leader doesn't have any accepted update that the followers don't have, and therefore the followers should be eligible to become leaders. Thus, I'm proposing modifying the leader initiated recovery logic to not put the followers in recovery if the minRf parameter is present and is not satisfied.

      1. SOLR-8034.patch
        9 kB
        Jessica Cheng Mallet
      2. SOLR-8034.patch
        9 kB
        Jessica Cheng Mallet

        Issue Links

          Activity

          Hide
          Jessica Cheng Mallet added a comment -

          Tim Potter This is what we discussed a while ago. Will you please give it a look? Thanks!

          Show
          Jessica Cheng Mallet added a comment - Tim Potter This is what we discussed a while ago. Will you please give it a look? Thanks!
          Hide
          Anshum Gupta added a comment - - edited

          I'm kind of split on this as the replica here would be out of sync from the leader and would never know about it, increasing the odds of inconsistency when the client doesn't handle it the right way i.e. it kind of self-heals at this point, and that would stop happening.

          At the same time, I kind of like the idea as it allows for failover in the case you've mentioned above.

          For the patch, LGTM. Can you also test that the replica is down, right after the first network partition, but before you send the document? That would ensure that the replication factor is 2, because of the partition and not due to another variable.

          Also, can you fix the comment for assertion failure below? You're expecting 2 non-leader replicas and not 2 replicas :

          +    List<Replica> notLeaders =
          +        ensureAllReplicasAreActive(testCollectionName, "shard1", 1, 3, maxWaitSecsToSeeAllActive);
          +    assertTrue("Expected 2 replicas for collection " + testCollectionName
          +            + " but found " + notLeaders.size() + "; clusterState: "
          +            + printClusterStateInfo(testCollectionName),
          +        notLeaders.size() == 2);
          
          Show
          Anshum Gupta added a comment - - edited I'm kind of split on this as the replica here would be out of sync from the leader and would never know about it, increasing the odds of inconsistency when the client doesn't handle it the right way i.e. it kind of self-heals at this point, and that would stop happening. At the same time, I kind of like the idea as it allows for failover in the case you've mentioned above. For the patch, LGTM. Can you also test that the replica is down, right after the first network partition, but before you send the document? That would ensure that the replication factor is 2, because of the partition and not due to another variable. Also, can you fix the comment for assertion failure below? You're expecting 2 non-leader replicas and not 2 replicas : + List<Replica> notLeaders = + ensureAllReplicasAreActive(testCollectionName, "shard1" , 1, 3, maxWaitSecsToSeeAllActive); + assertTrue( "Expected 2 replicas for collection " + testCollectionName + + " but found " + notLeaders.size() + "; clusterState: " + + printClusterStateInfo(testCollectionName), + notLeaders.size() == 2);
          Hide
          Tim Potter added a comment -

          Hi Jessica Cheng Mallet. I don't recall talking with you - perhaps you meant someone else?

          Show
          Tim Potter added a comment - Hi Jessica Cheng Mallet . I don't recall talking with you - perhaps you meant someone else?
          Hide
          Jessica Cheng Mallet added a comment -

          Anshum Gupta, I fixed the comment for the assertion, but I didn't add the test that the replica is down after the first network partition, because the point is that the replica will not realize it's down on its own since the partition is between the leader and the replica, not between the replica and zookeeper – so it won't be set to down until the leader tries to forward the document to it and fails, and then set it in leader-initiated-recovery.

          Tim Potter, we discussed this in ticket 4072.

          Show
          Jessica Cheng Mallet added a comment - Anshum Gupta , I fixed the comment for the assertion, but I didn't add the test that the replica is down after the first network partition, because the point is that the replica will not realize it's down on its own since the partition is between the leader and the replica, not between the replica and zookeeper – so it won't be set to down until the leader tries to forward the document to it and fails, and then set it in leader-initiated-recovery. Tim Potter , we discussed this in ticket 4072.
          Hide
          Jessica Cheng Mallet added a comment -

          Ah, and regarding

          I'm kind of split on this as the replica here would be out of sync from the leader and would never know about it, increasing the odds of inconsistency when the client doesn't handle it the right way i.e. it kind of self-heals at this point, and that would stop happening.

          I'd hope that if the user is explicitly using minRf that they handle it the right way (i.e. retry if minRf isn't achieved). The contract would be if the request fails, it needs to be retried or we can possibly see inconsistent state. I think this is true currently in a normal update if the forwarded parallel update to the followers succeeds but somehow it fails on the leader--a failure would be returned to the user but the update could be present on the followers.

          Show
          Jessica Cheng Mallet added a comment - Ah, and regarding I'm kind of split on this as the replica here would be out of sync from the leader and would never know about it, increasing the odds of inconsistency when the client doesn't handle it the right way i.e. it kind of self-heals at this point, and that would stop happening. I'd hope that if the user is explicitly using minRf that they handle it the right way (i.e. retry if minRf isn't achieved). The contract would be if the request fails, it needs to be retried or we can possibly see inconsistent state. I think this is true currently in a normal update if the forwarded parallel update to the followers succeeds but somehow it fails on the leader--a failure would be returned to the user but the update could be present on the followers.
          Hide
          Timothy Potter added a comment -

          Tim Potter oops! Jessica was actually pinging me thelabdude (same name, different handle)

          Jessica Cheng Mallet this looks good to me ... nice test case! Also, I agree that if the client is using minRf then it is their responsibility to handle the response correctly. Previously, we talked about throwing an exception instead of just returning to value for the client to interpret and maybe that makes it more explicit that clients MUST handle minRf not being achieved. We should handle that in another ticket though.

          Show
          Timothy Potter added a comment - Tim Potter oops! Jessica was actually pinging me thelabdude (same name, different handle) Jessica Cheng Mallet this looks good to me ... nice test case! Also, I agree that if the client is using minRf then it is their responsibility to handle the response correctly. Previously, we talked about throwing an exception instead of just returning to value for the client to interpret and maybe that makes it more explicit that clients MUST handle minRf not being achieved. We should handle that in another ticket though.
          Hide
          Anshum Gupta added a comment -

          Thanks for fixing the assert.

          replica will not realize it's down on its own since the partition is between the leader and the replica, not between the replica and zookeeper – so it won't be set to down until the leader tries to forward the document to it and fails

          Right, should've realized that.

          Also, about my opinion being split, I wasn't in on this, but I thought more and it makes more sense to go with this.

          Thanks Jessica Cheng Mallet . LGTM overall, I'll commit this.

          Show
          Anshum Gupta added a comment - Thanks for fixing the assert. replica will not realize it's down on its own since the partition is between the leader and the replica, not between the replica and zookeeper – so it won't be set to down until the leader tries to forward the document to it and fails Right, should've realized that. Also, about my opinion being split, I wasn't in on this, but I thought more and it makes more sense to go with this. Thanks Jessica Cheng Mallet . LGTM overall, I'll commit this.
          Hide
          Jessica Cheng Mallet added a comment -

          Oops, sorry! Didn't know there's another Tim Potter.

          Show
          Jessica Cheng Mallet added a comment - Oops, sorry! Didn't know there's another Tim Potter.
          Hide
          ASF subversion and git services added a comment -

          Commit 1703289 from Anshum Gupta in branch 'dev/trunk'
          [ https://svn.apache.org/r1703289 ]

          SOLR-8034: Leader no longer puts replicas in recovery in case of a failed update, when minRF isn't achieved.

          Show
          ASF subversion and git services added a comment - Commit 1703289 from Anshum Gupta in branch 'dev/trunk' [ https://svn.apache.org/r1703289 ] SOLR-8034 : Leader no longer puts replicas in recovery in case of a failed update, when minRF isn't achieved.
          Hide
          ASF subversion and git services added a comment -

          Commit 1703302 from Anshum Gupta in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1703302 ]

          SOLR-8034: Leader no longer puts replicas in recovery in case of a failed update, when minRF isn't achieved. (merge from trunk)

          Show
          ASF subversion and git services added a comment - Commit 1703302 from Anshum Gupta in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1703302 ] SOLR-8034 : Leader no longer puts replicas in recovery in case of a failed update, when minRF isn't achieved. (merge from trunk)
          Hide
          Mark Miller added a comment -

          Previously, we talked about throwing an exception instead of just returning to value for the client to interpret and maybe that makes it more explicit that clients MUST handle minRf not being achieved.

          I still think that is how this should work.

          Show
          Mark Miller added a comment - Previously, we talked about throwing an exception instead of just returning to value for the client to interpret and maybe that makes it more explicit that clients MUST handle minRf not being achieved. I still think that is how this should work.
          Hide
          Timothy Potter added a comment -

          Cool - I created SOLR-8062 to handle that change.

          Show
          Timothy Potter added a comment - Cool - I created SOLR-8062 to handle that change.
          Hide
          Anshum Gupta added a comment -
          Show
          Anshum Gupta added a comment - Thanks Jessica Cheng Mallet .

            People

            • Assignee:
              Anshum Gupta
              Reporter:
              Jessica Cheng Mallet
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development