Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-6095

SolrCloud cluster can end up without an overseer with overseer roles

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 4.8
    • Fix Version/s: 4.10, 6.0
    • Component/s: SolrCloud
    • Labels:
      None

      Description

      We have a large cluster running on ec2 which occasionally ends up without an overseer after a rolling restart. We always restart our overseer nodes at the very last otherwise we end up with a large number of shards that can't recover properly.

      This cluster is running a custom branch forked from 4.8 and has SOLR-5473, SOLR-5495 and SOLR-5468 applied. We have a large number of small collections (120 collections each with approx 5M docs) on 16 Solr nodes. We are also using the overseer roles feature to designate two specified nodes as overseers. However, I think the problem that we're seeing is not specific to the overseer roles feature.

      As soon as the overseer was shutdown, we saw the following on the node which was next in line to become the overseer:

      2014-05-20 09:55:39,261 [main-EventThread] INFO  solr.cloud.ElectionContext  - I am going to be the leader ec2-xxxxxxxxxx.compute-1.amazonaws.com:8987_solr
      2014-05-20 09:55:39,265 [main-EventThread] WARN  solr.cloud.LeaderElector  - 
      org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists for /overseer_elect/leader
      	at org.apache.zookeeper.KeeperException.create(KeeperException.java:119)
      	at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
      	at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
      	at org.apache.solr.common.cloud.SolrZkClient$10.execute(SolrZkClient.java:432)
      	at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73)
      	at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:429)
      	at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:386)
      	at org.apache.solr.common.cloud.SolrZkClient.makePath(SolrZkClient.java:373)
      	at org.apache.solr.cloud.OverseerElectionContext.runLeaderProcess(ElectionContext.java:551)
      	at org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:142)
      	at org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:110)
      	at org.apache.solr.cloud.LeaderElector.access$200(LeaderElector.java:55)
      	at org.apache.solr.cloud.LeaderElector$ElectionWatcher.process(LeaderElector.java:303)
      	at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
      	at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
      

      When the overseer leader node is gracefully shutdown, we get the following in the logs:

      2014-05-20 09:55:39,254 [Thread-63] ERROR solr.cloud.Overseer  - Exception in Overseer main queue loop
      org.apache.solr.common.SolrException: Could not load collection from ZK:sm12
      	at org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:778)
      	at org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:553)
      	at org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:246)
      	at org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:237)
      	at java.lang.Thread.run(Thread.java:745)
      Caused by: java.lang.InterruptedException
      	at java.lang.Object.wait(Native Method)
      	at java.lang.Object.wait(Object.java:503)
      	at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1342)
      	at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1040)
      	at org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:226)
      	at org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:223)
      	at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73)
      	at org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:223)
      	at org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:767)
      	... 4 more
      2014-05-20 09:55:39,254 [Thread-63] INFO  solr.cloud.Overseer  - Overseer Loop exiting : ec2-xxxxxxxxxx.compute-1.amazonaws.com:8986_solr
      2014-05-20 09:55:39,256 [main-EventThread] WARN  common.cloud.ZkStateReader  - ZooKeeper watch triggered, but Solr cannot talk to ZK
      2014-05-20 09:55:39,259 [ShutdownMonitor] INFO  server.handler.ContextHandler  - stopped o.e.j.w.WebAppContext{/solr,file:/vol0/cloud86/solr-webapp/webapp/},/vol0/cloud86/webapps/solr.war
      

      Notice how the overseer kept on running almost till the last point i.e. until the jetty context stopped. On some runs, we got the following on the overseer leader node on graceful shutdown:

      2014-05-19 21:33:43,657 [Thread-75] ERROR solr.cloud.Overseer  - Exception in Overseer main queue loop
      org.apache.solr.common.SolrException: Could not load collection from ZK:sm71
      	at org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:778)
      	at org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:553)
      	at org.apache.solr.common.cloud.ZkStateReader.updateClusterState(ZkStateReader.java:246)
      	at org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:237)
      	at java.lang.Thread.run(Thread.java:745)
      Caused by: java.lang.InterruptedException
      	at java.lang.Object.wait(Native Method)
      	at java.lang.Object.wait(Object.java:503)
      	at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1342)
      	at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1153)
      	at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:277)
      	at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:274)
      	at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73)
      	at org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:274)
      	at org.apache.solr.common.cloud.ZkStateReader.getExternCollectionFresh(ZkStateReader.java:769)
      	... 4 more
      2014-05-19 21:33:43,662 [main-EventThread] WARN  common.cloud.ZkStateReader  - ZooKeeper watch triggered, but Solr cannot talk to ZK
      2014-05-19 21:33:43,663 [Thread-75] INFO  solr.cloud.Overseer  - Overseer Loop exiting : ec2-xxxxxxxxxxxx.compute-1.amazonaws.com:8987_solr
      2014-05-19 21:33:43,664 [OverseerExitThread] ERROR solr.cloud.Overseer  - could not read the data
      org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /overseer_elect/leader
      	at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
      	at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
      	at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155)
      	at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:277)
      	at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:274)
      	at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73)
      	at org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:274)
      	at org.apache.solr.cloud.Overseer$ClusterStateUpdater.checkIfIamStillLeader(Overseer.java:329)
      	at org.apache.solr.cloud.Overseer$ClusterStateUpdater.access$300(Overseer.java:85)
      	at org.apache.solr.cloud.Overseer$ClusterStateUpdater$1.run(Overseer.java:293)
      2014-05-19 21:33:43,665 [ShutdownMonitor] INFO  server.handler.ContextHandler  - stopped o.e.j.w.WebAppContext{/solr,file:/vol0/cloud87/solr-webapp/webapp/},/vol0/cloud87/webapps/solr.war
      

      Again the overseer was clinging on till the last moment and by the time it exited the ZK session expired and it couldn't delete the /overseer_elect/leader node. The exception on the next-in-line node was the same i.e. NodeExists for /overseer_elect/leader.

      In both cases, we are left with no overseers after restart. I can easily reproduce this problem by just restarting overseer leader nodes repeatedly.

        Attachments

        1. SOLR-6095.patch
          36 kB
          Noble Paul
        2. SOLR-6095.patch
          36 kB
          Noble Paul
        3. SOLR-6095.patch
          35 kB
          Noble Paul
        4. SOLR-6095.patch
          7 kB
          Shalin Shekhar Mangar

          Issue Links

            Activity

              People

              • Assignee:
                noble.paul Noble Paul
                Reporter:
                shalinmangar Shalin Shekhar Mangar
              • Votes:
                0 Vote for this issue
                Watchers:
                9 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: