[SOLR-6763] Shard leader election thread can persist across session expiry - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 4.10.4, 5.0
Component/s: None
Labels:
None

Description

A ZK connection loss during a call to ElectionContext.waitForReplicasToComeUp() will result in two leader election processes for the shard running within a single node - the initial election that was waiting, and another spawned by the ReconnectStrategy. After the function returns, the first election will create an ephemeral leader node. The second election will then also attempt to create this node, fail, and try to put itself into recovery. It will also set the 'isLeader' value in its CloudDescriptor to false.

The first election, meanwhile, is happily maintaining the ephemeral leader node. But any updates that are sent to the shard will cause an exception due to the mismatch between the cloudstate (where this node is the leader) and the local CloudDescriptor leader state.

I think the fix is straightfoward - the call to zkClient.getChildren() in waitForReplicasToComeUp should be called with 'retryOnReconnect=false', rather than 'true' as it is currently, because once the connection has dropped we're going to launch a new election process anyway.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

SOLR-6763.patch
21/Nov/14 10:21
3 kB
Alan Woodward
SOLR-6763.patch
20/Nov/14 11:52
2 kB
Alan Woodward

Activity

People

Assignee:: Alan Woodward

Reporter:: Alan Woodward

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 19/Nov/14 16:50

Updated:: 05/Mar/15 15:36

Resolved:: 26/Feb/15 12:20