Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-7021

Leader will not publish core as active without recovering first, but never recovers

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Cannot Reproduce
    • Affects Version/s: 4.10
    • Fix Version/s: None
    • Component/s: SolrCloud

      Description

      A little background: 1 core solr-cloud cluster across 3 nodes, each with its own shard and each shard with a single replica hence each replica is itself a leader.

      For reasons we won't get into, we witnessed a shard go down in our cluster. We restarted the cluster but our core/shards still did not come back up. After inspecting the logs, we found this:

      015-01-21 15:51:56,494 [coreZkRegister-1-thread-2] INFO  cloud.ZkController  - We are http://xxx.xxx.xxx.35:8081/solr/xyzcore/ and leader is http://xxx.xxx.xxx.35:8081/solr/xyzcore/
      2015-01-21 15:51:56,496 [coreZkRegister-1-thread-2] INFO  cloud.ZkController  - No LogReplay needed for core=xyzcore baseURL=http://xxx.xxx.xxx.35:8081/solr
      2015-01-21 15:51:56,496 [coreZkRegister-1-thread-2] INFO  cloud.ZkController  - I am the leader, no recovery necessary
      2015-01-21 15:51:56,496 [coreZkRegister-1-thread-2] INFO  cloud.ZkController  - publishing core=xyzcore state=active collection=xyzcore
      2015-01-21 15:51:56,497 [coreZkRegister-1-thread-2] INFO  cloud.ZkController  - numShards not found on descriptor - reading it from system property
      2015-01-21 15:51:56,498 [coreZkRegister-1-thread-2] INFO  cloud.ZkController  - publishing core=xyzcore state=down collection=xyzcore
      2015-01-21 15:51:56,498 [coreZkRegister-1-thread-2] INFO  cloud.ZkController  - numShards not found on descriptor - reading it from system property
      2015-01-21 15:51:56,501 [coreZkRegister-1-thread-2] ERROR core.ZkContainer  - :org.apache.solr.common.SolrException: Cannot publish state of core 'xyzcore' as active without recovering first!
      	at org.apache.solr.cloud.ZkController.publish(ZkController.java:1075)
      

      And at this point the necessary shards never recover correctly and hence our core never returns to a functional state.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              hardwickj James Hardwick
            • Votes:
              2 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: