Solr
  1. Solr
  2. SOLR-3812

ConnectionLoss during recovery can cause lost updates, leading to shard inconsistency.

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0, 6.0
    • Component/s: SolrCloud
    • Labels:
      None

      Description

      When we lose our connection to ZooKeeper due to connectionloss (that does not lead to expiration), we can drop updates when replaying buffered updates and think we have successfully recovered.

      We need to detect this and fail recovery when it happens. We should also increase how long we wait for re connection when an update comes and we have lost our connection to zk (up to the session timeout).

        Activity

        Hide
        Commit Tag Bot added a comment -

        [branch_4x commit] Mark Robert Miller
        http://svn.apache.org/viewvc?view=revision&revision=1384937

        SOLR-3833: When a election is started because a leader went down, the new leader candidate should decline if the last state they published was not active.

        SOLR-3836: When doing peer sync, we should only count sync attempts that cannot reach the given host as success when the candidate leader is syncing with the replicas - not when replicas are syncing to the leader.

        SOLR-3835: In our leader election algorithm, if on connection loss we found we did not create our election node, we should retry, not throw an exception.

        SOLR-3834: A new leader on cluster startup should also run the leader sync process in case there was a bad cluster shutdown.

        SOLR-3772: On cluster startup, we should wait until we see all registered replicas before running the leader process - or if they all do not come up, N amount of time.

        SOLR-3756: If we are elected the leader of a shard, but we fail to publish this for any reason, we should clean up and re trigger a leader election.

        SOLR-3812: ConnectionLoss during recovery can cause lost updates, leading to shard inconsistency.

        SOLR-3813: When a new leader syncs, we need to ask all shards to sync back, not just those that are active.

        SOLR-3807: Currently during recovery we pause for a number of seconds after waiting for the leader to see a recovering state so that any previous updates will have finished before our commit on the leader - we don't need this wait for peersync.

        SOLR-3837: When a leader is elected and asks replicas to sync back to him and that fails, we should ask those nodes to recovery asynchronously rather than synchronously.

        Show
        Commit Tag Bot added a comment - [branch_4x commit] Mark Robert Miller http://svn.apache.org/viewvc?view=revision&revision=1384937 SOLR-3833 : When a election is started because a leader went down, the new leader candidate should decline if the last state they published was not active. SOLR-3836 : When doing peer sync, we should only count sync attempts that cannot reach the given host as success when the candidate leader is syncing with the replicas - not when replicas are syncing to the leader. SOLR-3835 : In our leader election algorithm, if on connection loss we found we did not create our election node, we should retry, not throw an exception. SOLR-3834 : A new leader on cluster startup should also run the leader sync process in case there was a bad cluster shutdown. SOLR-3772 : On cluster startup, we should wait until we see all registered replicas before running the leader process - or if they all do not come up, N amount of time. SOLR-3756 : If we are elected the leader of a shard, but we fail to publish this for any reason, we should clean up and re trigger a leader election. SOLR-3812 : ConnectionLoss during recovery can cause lost updates, leading to shard inconsistency. SOLR-3813 : When a new leader syncs, we need to ask all shards to sync back, not just those that are active. SOLR-3807 : Currently during recovery we pause for a number of seconds after waiting for the leader to see a recovering state so that any previous updates will have finished before our commit on the leader - we don't need this wait for peersync. SOLR-3837 : When a leader is elected and asks replicas to sync back to him and that fails, we should ask those nodes to recovery asynchronously rather than synchronously.
        Hide
        Uwe Schindler added a comment -

        Closed after release.

        Show
        Uwe Schindler added a comment - Closed after release.

          People

          • Assignee:
            Mark Miller
            Reporter:
            Mark Miller
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development