[SOLR-5373] Can't become leader due infinite recovery loop - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Minor
Resolution: Not A Problem
Affects Version/s: 4.2
Fix Version/s: 4.7
Component/s: None
Labels:
- Recovery
- SolrCloud
Environment:

SolrCloud, 2 nodes, Fedora

Description

We found an issue while performing stability tests on SolrCloud. Under certain circumstances, a node will get in an endless loop trying to recover. I've seen this happen in a two node setup, by following these steps:

1) Node A started
2) Node B started
3) Node B stopped
4) Node B started, and immediately Node A stopped (normal graceful shutdown).

At this point node B will throw connection refused messages while trying to sync to node A. For some reason (not always) this leads to a corrupt state where node B enters an infinite loop trying to recover from node A (it still thinks the cluster has two nodes). I think the leader election process started just fine, but since recovery is running async, at some point node B published it state as recovery failed, hence causing leader election to fail.

Zookeeper /live_nodes has only one file.

This shows on the logs:
0:57:18,960 INFO INFO [ShardLeaderElectionContext] (main-EventThread) Running the leader process.
10:57:19,068 INFO INFO [ShardLeaderElectionContext] (main-EventThread) Checking if I should try and be the leader.
10:57:19,068 INFO INFO [ShardLeaderElectionContext] (main-EventThread) My last published State was recovery_failed, I won't be the leader.
10:57:19,068 INFO INFO [ShardLeaderElectionContext] (main-EventThread) There may be a better leader candidate than us - going back into recovery
10:57:19,118 INFO INFO [DefaultSolrCoreState] (main-EventThread) Running recovery - first canceling any ongoing recovery
10:57:19,118 WARN WARN [RecoveryStrategy] (main-EventThread) Stopping recovery for zkNodeName=10.50.100.30:8998_solr_myCollectioncore=myCollection
10:57:19,869 ERROR ERROR [RecoveryStrategy] (RecoveryThread) Error while trying to recover. core=myCollection:org.apache.solr.common.SolrException: No registered leader was found, collection:myCollection slice:shard1
at org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:484)
at org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:467)
at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:321)
at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:223)

10:57:19,869 ERROR ERROR [RecoveryStrategy] (RecoveryThread) Recovery failed - trying again... (0) core=myCollection
10:57:19,869 ERROR ERROR [RecoveryStrategy] (RecoveryThread) Recovery failed - interrupted. core=myCollection
10:57:19,869 ERROR ERROR [RecoveryStrategy] (RecoveryThread) Recovery failed - I give up. core=myCollection
10:57:19,869 INFO INFO [ZkController] (RecoveryThread) publishing core=myCollection state=recovery_failed
10:57:19,869 INFO INFO [ZkController] (RecoveryThread) numShards not found on descriptor - reading it from system property
10:57:19,902 WARN WARN [RecoveryStrategy] (RecoveryThread) Stopping recovery for zkNodeName=10.50.100.30:8998_solr_myCollectioncore=myCollection
10:57:19,902 INFO INFO [RecoveryStrategy] (RecoveryThread) Finished recovery process. core=myCollection
10:57:19,902 INFO INFO [RecoveryStrategy] (RecoveryThread) Starting recovery process. core=myCollection recoveringAfterStartup=false

Solr Version: 4.2.1.2013.03.26.08.26.55

Other references to the same issue:

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

SOLR-5373.patch
24/Oct/13 00:02
3 kB
Mark Miller
stack7
21/Oct/13 17:08
255 kB
Javier Mendez
stack6
21/Oct/13 17:08
253 kB
Javier Mendez
stack5
21/Oct/13 17:08
249 kB
Javier Mendez
stack4
21/Oct/13 17:08
245 kB
Javier Mendez
stack3
21/Oct/13 17:08
241 kB
Javier Mendez
stack2
21/Oct/13 17:08
236 kB
Javier Mendez
stack1
21/Oct/13 17:08
176 kB
Javier Mendez

Activity

People

Assignee:: Mark Miller

Reporter:: Javier Mendez

Votes:: 2 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 21/Oct/13 17:04

Updated:: 16/Mar/14 13:04

Resolved:: 06/Feb/14 07:44