Solr
  1. Solr
  2. SOLR-5593

shard leader loss due to ZK session expiry

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.7, 6.0
    • Component/s: SolrCloud
    • Labels:
      None

      Description

      The problem we saw was that the shard leader ceased to be shard leader (in our case due to its zookeeper session expiring). The followers thus rejected update requests (DistributedUpdateProcessor setupRequest's call to ZkStateReader getLeaderRetry) and the leader asked them to recover (DistributedUpdateProcessor doFinish). The followers published themselves as recovering (CoreAdminHandler handleRequestRecoveryAction) and the shard leader loss triggered an election in which none of the followers became the leader due to their recovering state (ShardLeaderElectionContext shouldIBeLeader). The former shard leader also did not become shard leader because its new seq number placed it after the existing replicas (LeaderElector checkIfIamLeader seq <= intSeqs.get(0)).

      1. CoreAdminHandler.patch
        2 kB
        Christine Poerschke

        Issue Links

          Activity

          Hide
          Christine Poerschke added a comment -

          Attaching one potential solution (we are investigating others):

          As part of the recovery process state=recovering publishing already happens (RecoveryStrategy doRecovery) but only after a shard leader to recover from has been found. If the CoreAdminHandler handleRequestRecoveryAction publish had not happened then one of the followers should have been elected shard leader.

          Show
          Christine Poerschke added a comment - Attaching one potential solution (we are investigating others): As part of the recovery process state=recovering publishing already happens (RecoveryStrategy doRecovery) but only after a shard leader to recover from has been found. If the CoreAdminHandler handleRequestRecoveryAction publish had not happened then one of the followers should have been elected shard leader.
          Hide
          Mark Miller added a comment -

          Great find!

          The former shard leader also did not become shard leader because its new seq number placed it after the existing replicas (LeaderElector checkIfIamLeader seq <= intSeqs.get(0)).

          This I'm surprised to see - when someone cannot be the leader, for instance if they published a non active state last, they should get back in line - the original leader should have his chance again...

          Show
          Mark Miller added a comment - Great find! The former shard leader also did not become shard leader because its new seq number placed it after the existing replicas (LeaderElector checkIfIamLeader seq <= intSeqs.get(0)). This I'm surprised to see - when someone cannot be the leader, for instance if they published a non active state last, they should get back in line - the original leader should have his chance again...
          Hide
          Mark Miller added a comment -

          I think your change is probably good in general. Let's see what else we can do here.

          One thing that seems kind of silly is that those replicas reject the updates at all. It seems like perhaps we should relax things a bit so that they would be accepted.

          Show
          Mark Miller added a comment - I think your change is probably good in general. Let's see what else we can do here. One thing that seems kind of silly is that those replicas reject the updates at all. It seems like perhaps we should relax things a bit so that they would be accepted.
          Hide
          Christine Poerschke added a comment -

          One thing that seems kind of silly is that those replicas reject the updates at all. It seems like perhaps we should relax things a bit so that they would be accepted.

          Yes, we are working on changes to DistributedUpdateProcessor to relax the requirement for the getLeaderRetry to succeed within setupRequest (if phase is DistribPhase.FROMLEADER and the shard state shows it could not be subShardLeader then getLeaderRetry success should be optional).

          Show
          Christine Poerschke added a comment - One thing that seems kind of silly is that those replicas reject the updates at all. It seems like perhaps we should relax things a bit so that they would be accepted. Yes, we are working on changes to DistributedUpdateProcessor to relax the requirement for the getLeaderRetry to succeed within setupRequest (if phase is DistribPhase.FROMLEADER and the shard state shows it could not be subShardLeader then getLeaderRetry success should be optional).
          Hide
          Steve Rowe added a comment -

          I didn't mean to reassign - I was typing in another window, accidentally hit the mouse button with my elbow, which focused the browser window with this issue up, and then I guess JIRA interpreted my random typing as keyboard shortcuts ....

          Show
          Steve Rowe added a comment - I didn't mean to reassign - I was typing in another window, accidentally hit the mouse button with my elbow, which focused the browser window with this issue up, and then I guess JIRA interpreted my random typing as keyboard shortcuts ....
          Hide
          Mark Miller added a comment -

          Yes, we are working on changes to DistributedUpdateProcessor to relax the requirement for the getLeaderRetry to succeed within setupRequest (if phase is DistribPhase.FROMLEADER and the shard state shows it could not be subShardLeader then getLeaderRetry success should be optional).

          Yeah, on some thought, this is the right approach I think. Removing the publish is actually probably not a good idea. It actually protects us from losing data - we don't want a replica that was asked to recover to become the leader - that could mean updates were accepted that it is expected to have. If the previous leader died before one of the replicas became a leader, that leader might have been ahead. In this case, we don't choose a new leader, because you should really reboot the whole shard with all the replicas you can to avoid any possible data lose.

          Show
          Mark Miller added a comment - Yes, we are working on changes to DistributedUpdateProcessor to relax the requirement for the getLeaderRetry to succeed within setupRequest (if phase is DistribPhase.FROMLEADER and the shard state shows it could not be subShardLeader then getLeaderRetry success should be optional). Yeah, on some thought, this is the right approach I think. Removing the publish is actually probably not a good idea. It actually protects us from losing data - we don't want a replica that was asked to recover to become the leader - that could mean updates were accepted that it is expected to have. If the previous leader died before one of the replicas became a leader, that leader might have been ahead. In this case, we don't choose a new leader, because you should really reboot the whole shard with all the replicas you can to avoid any possible data lose.
          Hide
          Mark Miller added a comment -

          bq within setupRequest

          The tough case seems to be nailing delete by query - I've been peaking a bit at it.

          Show
          Mark Miller added a comment - bq within setupRequest The tough case seems to be nailing delete by query - I've been peaking a bit at it.
          Hide
          Mark Miller added a comment -

          Yes, we are working on changes to DistributedUpdateProcessor to relax

          Any progress? I'll likely look at attacking this soon.

          Show
          Mark Miller added a comment - Yes, we are working on changes to DistributedUpdateProcessor to relax Any progress? I'll likely look at attacking this soon.
          Hide
          Christine Poerschke added a comment -

          Uploaded https://github.com/apache/lucene-solr/pull/27 which rather than relaxing the error handling for the getLeaderRetry call actually tries to completely avoid it in the first place (if circumstances seem to permit it i.e. the request said it came from the leader and we don't think we are leader and we could not be sub-shard leader).

          Show
          Christine Poerschke added a comment - Uploaded https://github.com/apache/lucene-solr/pull/27 which rather than relaxing the error handling for the getLeaderRetry call actually tries to completely avoid it in the first place (if circumstances seem to permit it i.e. the request said it came from the leader and we don't think we are leader and we could not be sub-shard leader).
          Hide
          Mark Miller added a comment -

          Great, thanks Christine! Patch looks good on first glance.

          Show
          Mark Miller added a comment - Great, thanks Christine! Patch looks good on first glance.
          Hide
          ASF subversion and git services added a comment -

          Commit 1565049 from Mark Miller in branch 'dev/trunk'
          [ https://svn.apache.org/r1565049 ]

          SOLR-5593: Replicas should accept the last updates from a leader that has just lost it's connection to ZooKeeper.

          Show
          ASF subversion and git services added a comment - Commit 1565049 from Mark Miller in branch 'dev/trunk' [ https://svn.apache.org/r1565049 ] SOLR-5593 : Replicas should accept the last updates from a leader that has just lost it's connection to ZooKeeper.
          Hide
          ASF subversion and git services added a comment -

          Commit 1565053 from Mark Miller in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1565053 ]

          SOLR-5593: Replicas should accept the last updates from a leader that has just lost it's connection to ZooKeeper.

          Show
          ASF subversion and git services added a comment - Commit 1565053 from Mark Miller in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1565053 ] SOLR-5593 : Replicas should accept the last updates from a leader that has just lost it's connection to ZooKeeper.
          Hide
          Mark Miller added a comment -

          Thanks Christine!

          Show
          Mark Miller added a comment - Thanks Christine!
          Hide
          ASF GitHub Bot added a comment -

          Github user cpoerschke closed the pull request at:

          https://github.com/apache/lucene-solr/pull/27

          Show
          ASF GitHub Bot added a comment - Github user cpoerschke closed the pull request at: https://github.com/apache/lucene-solr/pull/27

            People

            • Assignee:
              Mark Miller
              Reporter:
              Christine Poerschke
            • Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development