Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-5373

Can't become leader due infinite recovery loop



    • Bug
    • Status: Closed
    • Minor
    • Resolution: Not A Problem
    • 4.2
    • 4.7
    • None
    • SolrCloud, 2 nodes, Fedora


      We found an issue while performing stability tests on SolrCloud. Under certain circumstances, a node will get in an endless loop trying to recover. I've seen this happen in a two node setup, by following these steps:

      1) Node A started
      2) Node B started
      3) Node B stopped
      4) Node B started, and immediately Node A stopped (normal graceful shutdown).

      At this point node B will throw connection refused messages while trying to sync to node A. For some reason (not always) this leads to a corrupt state where node B enters an infinite loop trying to recover from node A (it still thinks the cluster has two nodes). I think the leader election process started just fine, but since recovery is running async, at some point node B published it state as recovery failed, hence causing leader election to fail.

      Zookeeper /live_nodes has only one file.

      This shows on the logs:
      0:57:18,960 INFO INFO [ShardLeaderElectionContext] (main-EventThread) Running the leader process.
      10:57:19,068 INFO INFO [ShardLeaderElectionContext] (main-EventThread) Checking if I should try and be the leader.
      10:57:19,068 INFO INFO [ShardLeaderElectionContext] (main-EventThread) My last published State was recovery_failed, I won't be the leader.
      10:57:19,068 INFO INFO [ShardLeaderElectionContext] (main-EventThread) There may be a better leader candidate than us - going back into recovery
      10:57:19,118 INFO INFO [DefaultSolrCoreState] (main-EventThread) Running recovery - first canceling any ongoing recovery
      10:57:19,118 WARN WARN [RecoveryStrategy] (main-EventThread) Stopping recovery for zkNodeName=
      10:57:19,869 ERROR ERROR [RecoveryStrategy] (RecoveryThread) Error while trying to recover. core=myCollection:org.apache.solr.common.SolrException: No registered leader was found, collection:myCollection slice:shard1
      at org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:484)
      at org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:467)
      at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:321)
      at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:223)

      10:57:19,869 ERROR ERROR [RecoveryStrategy] (RecoveryThread) Recovery failed - trying again... (0) core=myCollection
      10:57:19,869 ERROR ERROR [RecoveryStrategy] (RecoveryThread) Recovery failed - interrupted. core=myCollection
      10:57:19,869 ERROR ERROR [RecoveryStrategy] (RecoveryThread) Recovery failed - I give up. core=myCollection
      10:57:19,869 INFO INFO [ZkController] (RecoveryThread) publishing core=myCollection state=recovery_failed
      10:57:19,869 INFO INFO [ZkController] (RecoveryThread) numShards not found on descriptor - reading it from system property
      10:57:19,902 WARN WARN [RecoveryStrategy] (RecoveryThread) Stopping recovery for zkNodeName=
      10:57:19,902 INFO INFO [RecoveryStrategy] (RecoveryThread) Finished recovery process. core=myCollection
      10:57:19,902 INFO INFO [RecoveryStrategy] (RecoveryThread) Starting recovery process. core=myCollection recoveringAfterStartup=false

      Solr Version:

      Other references to the same issue:


        1. SOLR-5373.patch
          3 kB
          Mark Miller
        2. stack1
          176 kB
          Javier Mendez
        3. stack2
          236 kB
          Javier Mendez
        4. stack3
          241 kB
          Javier Mendez
        5. stack4
          245 kB
          Javier Mendez
        6. stack5
          249 kB
          Javier Mendez
        7. stack6
          253 kB
          Javier Mendez
        8. stack7
          255 kB
          Javier Mendez



            markrmiller@gmail.com Mark Miller
            jmendez Javier Mendez
            2 Vote for this issue
            7 Start watching this issue