Solr
  1. Solr
  2. SOLR-3813

When a new leader syncs, we need to ask all shards to sync back, not just those that are active.

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0, 6.0
    • Component/s: SolrCloud
    • Labels:
      None

      Description

      Otherwise there is a race where a shard can complete recovery against the old leader and publish as active, while missing the sync stage with the leader - resulting in possible lost updates and shard inconsistency.

        Activity

        Hide
        Mark Miller added a comment -

        we should also ask all shards to sync to us initially, not just active shards - better that than rely on cluster state which can be slightly stale.

        Show
        Mark Miller added a comment - we should also ask all shards to sync to us initially, not just active shards - better that than rely on cluster state which can be slightly stale.
        Hide
        Commit Tag Bot added a comment -

        [branch_4x commit] Mark Robert Miller
        http://svn.apache.org/viewvc?view=revision&revision=1384937

        SOLR-3833: When a election is started because a leader went down, the new leader candidate should decline if the last state they published was not active.

        SOLR-3836: When doing peer sync, we should only count sync attempts that cannot reach the given host as success when the candidate leader is syncing with the replicas - not when replicas are syncing to the leader.

        SOLR-3835: In our leader election algorithm, if on connection loss we found we did not create our election node, we should retry, not throw an exception.

        SOLR-3834: A new leader on cluster startup should also run the leader sync process in case there was a bad cluster shutdown.

        SOLR-3772: On cluster startup, we should wait until we see all registered replicas before running the leader process - or if they all do not come up, N amount of time.

        SOLR-3756: If we are elected the leader of a shard, but we fail to publish this for any reason, we should clean up and re trigger a leader election.

        SOLR-3812: ConnectionLoss during recovery can cause lost updates, leading to shard inconsistency.

        SOLR-3813: When a new leader syncs, we need to ask all shards to sync back, not just those that are active.

        SOLR-3807: Currently during recovery we pause for a number of seconds after waiting for the leader to see a recovering state so that any previous updates will have finished before our commit on the leader - we don't need this wait for peersync.

        SOLR-3837: When a leader is elected and asks replicas to sync back to him and that fails, we should ask those nodes to recovery asynchronously rather than synchronously.

        Show
        Commit Tag Bot added a comment - [branch_4x commit] Mark Robert Miller http://svn.apache.org/viewvc?view=revision&revision=1384937 SOLR-3833 : When a election is started because a leader went down, the new leader candidate should decline if the last state they published was not active. SOLR-3836 : When doing peer sync, we should only count sync attempts that cannot reach the given host as success when the candidate leader is syncing with the replicas - not when replicas are syncing to the leader. SOLR-3835 : In our leader election algorithm, if on connection loss we found we did not create our election node, we should retry, not throw an exception. SOLR-3834 : A new leader on cluster startup should also run the leader sync process in case there was a bad cluster shutdown. SOLR-3772 : On cluster startup, we should wait until we see all registered replicas before running the leader process - or if they all do not come up, N amount of time. SOLR-3756 : If we are elected the leader of a shard, but we fail to publish this for any reason, we should clean up and re trigger a leader election. SOLR-3812 : ConnectionLoss during recovery can cause lost updates, leading to shard inconsistency. SOLR-3813 : When a new leader syncs, we need to ask all shards to sync back, not just those that are active. SOLR-3807 : Currently during recovery we pause for a number of seconds after waiting for the leader to see a recovering state so that any previous updates will have finished before our commit on the leader - we don't need this wait for peersync. SOLR-3837 : When a leader is elected and asks replicas to sync back to him and that fails, we should ask those nodes to recovery asynchronously rather than synchronously.
        Hide
        Uwe Schindler added a comment -

        Closed after release.

        Show
        Uwe Schindler added a comment - Closed after release.

          People

          • Assignee:
            Mark Miller
            Reporter:
            Mark Miller
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development