[SOLR-5952] Recovery race/ error - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Invalid
Affects Version/s: 4.7
Fix Version/s: 4.8, 6.0
Component/s: SolrCloud
Labels:
- leader
- recovery
- solrcloud
- zookeeper

Description

We're seeing some shard recovery errors in our cluster when a zookeeper "error event" happened. In this particular case, we had two replicas. The event from the logs look roughly like this:

18:40:36 follower (host2) disconnected from zk
18:40:38 original leader started recovery (there was no log about why it needed recovery though) and failed because cluster state still says it's the leader
18:40:39 follower successfully connected to zk after some trouble
19:03:35 follower register core/replica
19:16:36 follower registration fails due to no leader (NoNode for /collections/test-1/leaders/shard2)

Essentially, I think the problem is that the isLeader property on the cluster state is never cleaned up, so neither replicas are able to recover/register in order to participate in leader election again.

Looks like from the code that the only place that the isLeader property is cleared from the cluster state is from ElectionContext.runLeaderProcess, which assumes that the replica with the next election seqId will notice the leader's node disappearing and run the leader process. This assumption fails in this scenario because the follower experienced the same zookeeper "error event" and never handled the event of the leader going away. (Mark, this is where I was saying in ~~SOLR-3582~~ that maybe the watcher in LeaderElector.checkIfIamLeader does need to handle "Expired" by somehow realizing that the leader is gone and clearing the isLeader state at least, but it currently ignores all EventType.None events.)

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

recovery-failure-host1-log.txt
03/Apr/14 00:44
3 kB
Jessica Cheng Mallet
recovery-failure-host2-log.txt
03/Apr/14 00:44
6 kB
Jessica Cheng Mallet

Activity

People

Assignee:: Mark Miller

Reporter:: Jessica Cheng Mallet

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 03/Apr/14 00:41

Updated:: 09/May/16 18:45

Resolved:: 05/Apr/14 00:31