Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-8868

SolrCloud: if zookeeper loses and then regains a quorum, Solr nodes and SolrJ Client do not recover and need to be restarted

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 5.3.1
    • Fix Version/s: None
    • Component/s: SolrCloud, SolrJ
    • Labels:
      None

      Description

      Tried mailing list on 3/15 and 3/16 to no avail. Hopefully I gave enough details.


      Just wondering if my observation of SolrCloud behavior after ZooKeeper loses a quorum is normal or to-be-expected

      Version of Solr: 5.3.1
      Version of ZooKeeper: 3.4.7
      Using SolrCloud with external ZooKeeper
      Deployed on AWS

      Our Solr cluster has 3 nodes (m3.large)

      Our Zookeeper ensemble consists of three nodes (t2.small) with the same config using DNS names e.g.

      $ more ../conf/zoo.cfg
      tickTime=2000
      dataDir=/var/zookeeper
      dataLogDir=/var/log/zookeeper
      clientPort=2181
      initLimit=10
      syncLimit=5
      standaloneEnabled=false
      server.1=zookeeper1.qa.eu-west-1.mysearch.com:2888:3888
      server.2=zookeeper2.qa.eu-west-1.mysearch.com:2888:3888
      server.3=zookeeper3.qa.eu-west-1.mysearch.com:2888:3888
      

      If we terminate one of the zookeeper nodes we get a ZK election (and I think) a quorum is maintained.
      Operation continues OK and we detect the terminated instance and relaunch a new ZK node which comes up fine

      If we terminate two of the ZK nodes we lose a quorum and then we observe the following

      1.1) Admin UI shows an error that it is unable to contact ZooKeeper “Could not connect to ZooKeeper"

      1.2) SolrJ returns the following

      org.apache.solr.common.SolrException: Could not load collection from ZK:qa_eu-west-1_public_index
      at org.apache.solr.common.cloud.ZkStateReader.getCollectionLive(ZkStateReader.java:850)
      at org.apache.solr.common.cloud.ZkStateReader$7.get(ZkStateReader.java:515)
      at org.apache.solr.client.solrj.impl.CloudSolrClient.getDocCollection(CloudSolrClient.java:1205)
      at org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStaleState(CloudSolrClient.java:837)
      at org.apache.solr.client.solrj.impl.CloudSolrClient.request(CloudSolrClient.java:805)
      at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:135)
      at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:107)
      at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:72)
      at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:86)
      at com.here.scbe.search.solr.SolrFacadeImpl.addToSearchIndex(SolrFacadeImpl.java:112)
      Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /collections/qa_eu-west-1_public_index/state.json
      at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
      at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
      at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155)
      at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:345)
      at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:342)
      at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61)
      at org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:342)
      at org.apache.solr.common.cloud.ZkStateReader.getCollectionLive(ZkStateReader.java:841)
      ... 24 more
      

      This makes sense based on our understanding.
      When our AutoScale groups launch two new ZooKeeper nodes, initialize them, fix the DNS etc. we regain a quorum but at this point

      2.1) Admin UI shows the shards as “GONE” (all greyed out)

      2.2) SolrJ returns the same error even though the ZooKeeper DNS names are now bound to new IP addresses

      So at this point I restart the Solr nodes. At this point then

      3.1) Admin UI shows the collections as OK (all shards are green) – yeah the nodes are back!

      3.2) SolrJ Client still shows the same error – namely

      org.apache.solr.common.SolrException: Could not load collection from ZK:qa_eu-west-1_here_account
      at org.apache.solr.common.cloud.ZkStateReader.getCollectionLive(ZkStateReader.java:850)
      at org.apache.solr.common.cloud.ZkStateReader$7.get(ZkStateReader.java:515)
      at org.apache.solr.client.solrj.impl.CloudSolrClient.getDocCollection(CloudSolrClient.java:1205)
      at org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStaleState(CloudSolrClient.java:837)
      at org.apache.solr.client.solrj.impl.CloudSolrClient.request(CloudSolrClient.java:805)
      at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:135)
      at org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:825)
      at org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:788)
      at org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:803)
      at com.here.scbe.search.solr.SolrFacadeImpl.deleteById(SolrFacadeImpl.java:257)
      .
      .
      Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /collections/qa_eu-west-1_here_account/state.json
      at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
      at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
      at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155)
      at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:345)
      at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:342)
      at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61)
      at org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:342)
      at org.apache.solr.common.cloud.ZkStateReader.getCollectionLive(ZkStateReader.java:841)
      

      Is this behavior (lack of self-healing) a known and expected behavior?

      If this is expected behavior then likely this should be recast as an Improvement request?

      Is this the same or similar behavior as documented here https://issues.apache.org/jira/browse/SOLR-5129

      p.s. I can add Solr log files if they will help

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                kellyfj@gmail.com Frank J Kelly
              • Votes:
                9 Vote for this issue
                Watchers:
                17 Start watching this issue

                Dates

                • Created:
                  Updated: