Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-5579

Leader stops processing collection-work-queue after failed collection reload

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 4.5.1
    • Fix Version/s: None
    • Component/s: SolrCloud
    • Labels:
    • Environment:

      Debian Linux 6.0 running on VMWare
      Using embedded SOLR Jetty.

      Description

      I've been experiencing the same problem a few times now. My leader in /overseer_elect/leader stops processing the collection queue at /overseer/collection-queue-work. The queue will build up and it will trigger an alert in my monitoring tool.

      I haven't been able to pinpoint the reason that the leader stops, but usually I kill the leader node to trigger a leader election. The new node will pick up the queue. And this is where the problems start.

      When the new leader is processing the queue and picks up a reload for a shard without an active leader, the queue stops. It keeps repeating the message that there is no active leader for the shard. But a new leader is never elected:

      ERROR - 2013-12-24 14:43:40.390; org.apache.solr.common.SolrException; Error while trying to recover. core=magento_349_shard1_replica1:org.apache.solr.common.SolrException: No registered leader was found, collection:magento_349 slice:shard1
      at org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:482)
      at org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:465)
      at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:317)
      at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:219)

      ERROR - 2013-12-24 14:43:40.391; org.apache.solr.cloud.RecoveryStrategy; Recovery failed - trying again... (7) core=magento_349_shard1_replica1
      INFO - 2013-12-24 14:43:40.391; org.apache.solr.cloud.RecoveryStrategy; Wait 256.0 seconds before trying to recover again (8)

      Is the leader election in some way connected to the collection queue? If so, can this be a deadlock, because it won't elect until the reload is complete?

        Attachments

          Activity

            People

            • Assignee:
              markrmiller@gmail.com Mark Miller
              Reporter:
              eric.bus Eric Bus
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated: