[SOLR-5552] Leader recovery process can select the wrong leader if all replicas for a shard are down and trying to recover as well as lose updates that should have been recovered. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Critical
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 4.6.1, 4.7, 6.0
Component/s: SolrCloud
Labels:
- leader
- recovery

Description

One particular issue that leads to out-of-sync shards, related to ~~SOLR-4260~~

Here's what I know so far, which admittedly isn't much:
As cloud85 (replica before it crashed) is initializing, it enters the wait process in ShardLeaderElectionContext#waitForReplicasToComeUp; this is expected and a good thing.
Some short amount of time in the future, cloud84 (leader before it crashed) begins initializing and gets to a point where it adds itself as a possible leader for the shard (by creating a znode under /collections/cloud/leaders_elect/shard1/election), which leads to cloud85 being able to return from waitForReplicasToComeUp and try to determine who should be the leader.
cloud85 then tries to run the SyncStrategy, which can never work because in this scenario the Jetty HTTP listener is not active yet on either node, so all replication work that uses HTTP requests fails on both nodes ... PeerSync treats these failures as indicators that the other replicas in the shard are unavailable (or whatever) and assumes success. Here's the log message:
2013-12-11 11:43:25,936 [coreLoadExecutor-3-thread-1] WARN solr.update.PeerSync - PeerSync: core=cloud_shard1_replica1 url=http://cloud85:8985/solr couldn't connect to http://cloud84:8984/solr/cloud_shard1_replica2/, counting as success
The Jetty HTTP listener doesn't start accepting connections until long after this process has completed and already selected the wrong leader.
From what I can see, we seem to have a leader recovery process that is based partly on HTTP requests to the other nodes, but the HTTP listener on those nodes isn't active yet. We need a leader recovery process that doesn't rely on HTTP requests. Perhaps, leader recovery for a shard w/o a current leader may need to work differently than leader election in a shard that has replicas that can respond to HTTP requests? All of what I'm seeing makes perfect sense for leader election when there are active replicas and the current leader fails.
All this aside, I'm not asserting that this is the only cause for the out-of-sync issues reported in this ticket, but it definitely seems like it could happen in a real cluster.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

SOLR-5552.patch
22/Dec/13 20:31
13 kB
Mark Miller
SOLR-5552.patch
13/Dec/13 16:21
4 kB
Timothy Potter

Issue Links

is cloned by

SOLR-8173 CLONE - Leader recovery process can select the wrong leader if all replicas for a shard are down and trying to recover as well as lose updates that should have been recovered.

Resolved

is related to

SOLR-4260 Inconsistent numDocs between leader and replica

Resolved

SOLR-5569 A replica should not try and recover from a leader until it has published that it is ACTIVE.

Closed

SOLR-5568 A SolrCore cannot decide to be the leader just because the cluster state says no other SolrCore's are active.

Closed

Activity

People

Assignee:: Mark Miller

Reporter:: Timothy Potter

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 12/Dec/13 22:09

Updated:: 09/May/16 18:55

Resolved:: 05/Jan/14 20:59