Description
I had a solr collection that basically was out of memory (no exception, just continuous 80-90 second full GCs). This of course is not a good state, but when in this state ever time you come out of the GC your zookeeper session has expired causing all kinds of havoc. Anyway I found bug in the condition where during LeaderElector.setup() if you get a Zookeeper error the LeaderElector.context get set to a context that is not fully initialized (ie hasn't called joinElection..
Anyway once this happens the node can no longer attempt to join elections because every time the LeaderElector attempts to call cancelElection() on the previous ElectionContext..
Some logs below and I've attached a patch that does 2 things:
- Move the setting of LeaderElector.context in the setup call to then of the call so it is only set if the setup completes.
- Added a check to see if leaderSeqPath is null in ElectionContext.cancelElection
- Made leaderSeqPath volatile as it is being directly accessed by multiple threads.
- set LeaderElector.context = null when joinElection fails
There may be other issues.. the patch is focused on breaking the failure loop that occurs when initialization of the ElectionContext fails.
2014-06-08 23:14:57.805 INFO ClientCnxn [main-SendThread(host1:1234)] - Opening socket connection to server host1/10.122.142.31:1234. Will not attempt to authenticate using SASL (unknown error) 2014-06-08 23:14:57.806 INFO ClientCnxn [main-SendThread(host1:1234)] - Socket connection established to host1/10.122.142.31:1234, initiating session 2014-06-08 23:14:57.810 INFO ClientCnxn [main-SendThread(host1:1234)] - Unable to reconnect to ZooKeeper service, session 0x2467d956c8d0446 has expired, closing socket connection 2014-06-08 23:14:57.816 INFO ConnectionManager [main-EventThread] - Watcher org.apache.solr.common.cloud.ConnectionManager@7fe8e0f1 name:ZooKeeperConnection Watcher:host4:1234,host1:1234,host3:1234,host2:1234/engines/solr/collections/XXXXXX got event WatchedEvent state:Expired type:None path:null path:null type:None 2014-06-08 23:14:57.817 INFO ConnectionManager [main-EventThread] - Our previous ZooKeeper session was expired. Attempting to reconnect to recover relationship with ZooKeeper... 2014-06-08 23:14:57.817 INFO DefaultConnectionStrategy [main-EventThread] - Connection expired - starting a new one... 2014-06-08 23:14:57.817 INFO ZooKeeper [main-EventThread] - Initiating client connection, connectString=host4:1234,host1:1234,host3:1234,host2:1234/engines/solr/collections/XXXXXX sessionTimeout=15000 watcher=org.apache.solr.common.cloud.ConnectionManager@7fe8e0f1 2014-06-08 23:14:57.857 INFO ConnectionManager [main-EventThread] - Waiting for client to connect to ZooKeeper 2014-06-08 23:14:57.859 INFO ClientCnxn [main-SendThread(host4:1234)] - Opening socket connection to server host4/172.17.14.107:1234. Will not attempt to authenticate using SASL (unknown error) 2014-06-08 23:14:57.891 INFO ClientCnxn [main-SendThread(host4:1234)] - Socket connection established to host4/172.17.14.107:1234, initiating session 2014-06-08 23:14:57.906 INFO ClientCnxn [main-SendThread(host4:1234)] - Session establishment complete on server host4/172.17.14.107:1234, sessionid = 0x4467d8d79260486, negotiated timeout = 15000 2014-06-08 23:14:57.907 INFO ConnectionManager [main-EventThread] - Watcher org.apache.solr.common.cloud.ConnectionManager@7fe8e0f1 name:ZooKeeperConnection Watcher:host4:1234,host1:1234,host3:1234,host2:1234/engines/solr/collections/XXXXXX got event WatchedEvent state:SyncConnected type:None path:null path:null type:None 2014-06-08 23:14:57.909 INFO ConnectionManager [main-EventThread] - Client is connected to ZooKeeper 2014-06-08 23:14:57.909 INFO ConnectionManager [main-EventThread] - Connection with ZooKeeper reestablished. 2014-06-08 23:14:57.911 ERROR ZkController [Thread-203] - :org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /overseer_elect/election at org.apache.zookeeper.KeeperException.create(KeeperException.java:127) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045) at org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:226) at org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:223) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73) at org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:223) at org.apache.solr.common.cloud.ZkCmdExecutor.ensureExists(ZkCmdExecutor.java:100) at org.apache.solr.common.cloud.ZkCmdExecutor.ensureExists(ZkCmdExecutor.java:94) at org.apache.solr.cloud.LeaderElector.setup(LeaderElector.java:324) at org.apache.solr.cloud.ZkController$1.command(ZkController.java:235) at org.apache.solr.common.cloud.ConnectionManager$1$1.run(ConnectionManager.java:166) 2014-06-08 23:14:57.911 WARN ConnectionManager [Thread-203] - Exception running onReconnect command org.apache.solr.common.cloud.ZooKeeperException: at org.apache.solr.cloud.ZkController$1.command(ZkController.java:267) at org.apache.solr.common.cloud.ConnectionManager$1$1.run(ConnectionManager.java:166) Caused by: org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /overseer_elect/election at org.apache.zookeeper.KeeperException.create(KeeperException.java:127) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045) at org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:226) at org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:223) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73) at org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:223) at org.apache.solr.common.cloud.ZkCmdExecutor.ensureExists(ZkCmdExecutor.java:100) at org.apache.solr.common.cloud.ZkCmdExecutor.ensureExists(ZkCmdExecutor.java:94) at org.apache.solr.cloud.LeaderElector.setup(LeaderElector.java:324) at org.apache.solr.cloud.ZkController$1.command(ZkController.java:235) ... 1 more 2014-06-08 23:14:57.917 INFO DefaultConnectionStrategy [main-EventThread] - Reconnected to ZooKeeper 2014-06-08 23:14:57.917 INFO ConnectionManager [main-EventThread] - Connected:true 2014-06-08 23:14:57.917 INFO ClientCnxn [main-EventThread] - EventThread shut down 2014-06-08 23:14:57.917 INFO ZkController [Thread-205] - publishing core=XXXXXX state=down collection=XXXXXX 2014-06-08 23:14:57.920 INFO ZkController [Thread-205] - numShards not found on descriptor - reading it from system property 2014-06-08 23:14:58.033 INFO ElectionContext [Thread-205] - canceling election null 2014-06-08 23:14:58.083 ERROR ZkController [Thread-205] - :java.lang.IllegalArgumentException: Path cannot be null at org.apache.zookeeper.common.PathUtils.validatePath(PathUtils.java:45) at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:851) at org.apache.solr.common.cloud.SolrZkClient$2.execute(SolrZkClient.java:177) at org.apache.solr.common.cloud.SolrZkClient$2.execute(SolrZkClient.java:174) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73) at org.apache.solr.common.cloud.SolrZkClient.delete(SolrZkClient.java:174) at org.apache.solr.cloud.ElectionContext.cancelElection(ElectionContext.java:72) at org.apache.solr.cloud.OverseerElectionContext.cancelElection(ElectionContext.java:447) at org.apache.solr.cloud.ZkController$1.command(ZkController.java:232) at org.apache.solr.common.cloud.ConnectionManager$1$1.run(ConnectionManager.java:166) 2014-06-08 23:14:58.083 WARN ConnectionManager [Thread-205] - Exception running onReconnect command org.apache.solr.common.cloud.ZooKeeperException: at org.apache.solr.cloud.ZkController$1.command(ZkController.java:267) at org.apache.solr.common.cloud.ConnectionManager$1$1.run(ConnectionManager.java:166) Caused by: java.lang.IllegalArgumentException: Path cannot be null at org.apache.zookeeper.common.PathUtils.validatePath(PathUtils.java:45) at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:851) at org.apache.solr.common.cloud.SolrZkClient$2.execute(SolrZkClient.java:177) at org.apache.solr.common.cloud.SolrZkClient$2.execute(SolrZkClient.java:174) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73) at org.apache.solr.common.cloud.SolrZkClient.delete(SolrZkClient.java:174) at org.apache.solr.cloud.ElectionContext.cancelElection(ElectionContext.java:72) at org.apache.solr.cloud.OverseerElectionContext.cancelElection(ElectionContext.java:447) at org.apache.solr.cloud.ZkController$1.command(ZkController.java:232) ... 1 more
Attachments
Attachments
Issue Links
- is related to
-
SOLR-6181 SliceStateUpdateTest.testSliceStateUpdate FAILS
- Resolved