Description
If a collection gets reloaded, then a core's isReloaded flag is always true. If a core experiences a ZK session expiration after a reload, then it won't ever be able to set itself to active because of the check in ZkController#register:
UpdateLog ulog = core.getUpdateHandler().getUpdateLog(); if (!core.isReloaded() && ulog != null) { // disable recovery in case shard is in construction state (for shard splits) Slice slice = getClusterState().getSlice(collection, shardId); if (slice.getState() != Slice.State.CONSTRUCTION || !isLeader) { Future<UpdateLog.RecoveryInfo> recoveryFuture = core.getUpdateHandler().getUpdateLog().recoverFromLog(); if (recoveryFuture != null) { log.info("Replaying tlog for " + ourUrl + " during startup... NOTE: This can take a while."); recoveryFuture.get(); // NOTE: this could potentially block for // minutes or more! // TODO: public as recovering in the mean time? // TODO: in the future we could do peersync in parallel with recoverFromLog } else { log.info("No LogReplay needed for core=" + core.getName() + " baseURL=" + baseUrl); } } boolean didRecovery = checkRecovery(coreName, desc, recoverReloadedCores, isLeader, cloudDesc, collection, coreZkNodeName, shardId, leaderProps, core, cc); if (!didRecovery) { publish(desc, ZkStateReader.ACTIVE); } }
I can easily simulate this on trunk by doing:
bin/solr -c -z localhost:2181
bin/solr create -c foo
bin/post -c foo example/exampledocs/*.xml
curl "http://localhost:8983/solr/admin/collections?action=RELOAD&name=foo"
kill -STOP <PID> && sleep <PAUSE_SECONDS> && kill -CONT <PID>
Where <PID> is the process ID of the Solr node. Here are the logs after the CONT command. As you can see below, the core never gets to setting itself as active again. I think the bug is that the isReloaded flag needs to get set back to false once the reload is successful, but I don't understand what this flag is needed for anyway???
INFO - 2015-04-01 17:28:50.962; org.apache.solr.common.cloud.ConnectionManager; Watcher org.apache.solr.common.cloud.ConnectionManager@5519dba0 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent state:Disconnected type:None path:null path:null type:None INFO - 2015-04-01 17:28:50.963; org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected INFO - 2015-04-01 17:28:51.107; org.apache.solr.common.cloud.ConnectionManager; Watcher org.apache.solr.common.cloud.ConnectionManager@5519dba0 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent state:Expired type:None path:null path:null type:None INFO - 2015-04-01 17:28:51.107; org.apache.solr.common.cloud.ConnectionManager; Our previous ZooKeeper session was expired. Attempting to reconnect to recover relationship with ZooKeeper... INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer; Overseer (id=93579450724974592-192.168.1.2:8983_solr-n_0000000000) closing INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$WatcherImpl; A node got unwatched for /configs/foo INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer$ClusterStateUpdater; Overseer Loop exiting : 192.168.1.2:8983_solr INFO - 2015-04-01 17:28:51.109; org.apache.solr.cloud.OverseerCollectionProcessor; According to ZK I (id=93579450724974592-192.168.1.2:8983_solr-n_0000000000) am no longer a leader. INFO - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$4; Running listeners for /configs/foo INFO - 2015-04-01 17:28:51.109; org.apache.solr.common.cloud.DefaultConnectionStrategy; Connection expired - starting a new one... INFO - 2015-04-01 17:28:51.109; org.apache.solr.core.SolrCore$11; config update listener called for core foo_shard1_replica1 INFO - 2015-04-01 17:28:51.110; org.apache.solr.common.cloud.ConnectionManager; Waiting for client to connect to ZooKeeper ERROR - 2015-04-01 17:28:51.110; org.apache.solr.common.SolrException; OverseerAutoReplicaFailoverThread had an error in its thread work loop.:org.apache.solr.common.SolrException: Error reading cluster properties at org.apache.solr.common.cloud.ZkStateReader.getClusterProps(ZkStateReader.java:772) at org.apache.solr.cloud.OverseerAutoReplicaFailoverThread.doWork(OverseerAutoReplicaFailoverThread.java:150) at org.apache.solr.cloud.OverseerAutoReplicaFailoverThread.run(OverseerAutoReplicaFailoverThread.java:129) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.InterruptedException: sleep interrupted at java.lang.Thread.sleep(Native Method) at org.apache.solr.common.cloud.ZkCmdExecutor.retryDelay(ZkCmdExecutor.java:108) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:76) at org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:308) at org.apache.solr.common.cloud.ZkStateReader.getClusterProps(ZkStateReader.java:765) ... 3 more WARN - 2015-04-01 17:28:51.110; org.apache.solr.cloud.ZkController$4; listener throws error org.apache.solr.common.SolrException: org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /configs/foo/params.json at org.apache.solr.core.RequestParams.getFreshRequestParams(RequestParams.java:160) at org.apache.solr.core.SolrConfig.refreshRequestParams(SolrConfig.java:907) at org.apache.solr.core.SolrCore$11.run(SolrCore.java:2503) at org.apache.solr.cloud.ZkController$4.run(ZkController.java:2351) Caused by: org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /configs/foo/params.json at org.apache.zookeeper.KeeperException.create(KeeperException.java:127) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045) at org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:294) at org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:291) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61) at org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:291) at org.apache.solr.core.RequestParams.getFreshRequestParams(RequestParams.java:150) ... 3 more ERROR - 2015-04-01 17:28:51.110; org.apache.solr.cloud.Overseer$ClusterStateUpdater; could not read the data org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /overseer_elect/leader at org.apache.zookeeper.KeeperException.create(KeeperException.java:127) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155) at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:345) at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:342) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61) at org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:342) at org.apache.solr.cloud.Overseer$ClusterStateUpdater.checkIfIamStillLeader(Overseer.java:298) at org.apache.solr.cloud.Overseer$ClusterStateUpdater.access$300(Overseer.java:83) at org.apache.solr.cloud.Overseer$ClusterStateUpdater$2.run(Overseer.java:259) INFO - 2015-04-01 17:28:51.114; org.apache.solr.common.cloud.ConnectionManager; Watcher org.apache.solr.common.cloud.ConnectionManager@5519dba0 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent state:SyncConnected type:None path:null path:null type:None INFO - 2015-04-01 17:28:51.115; org.apache.solr.common.cloud.ConnectionManager; Client is connected to ZooKeeper INFO - 2015-04-01 17:28:51.115; org.apache.solr.common.cloud.ConnectionManager$1; Connection with ZooKeeper reestablished. INFO - 2015-04-01 17:28:51.115; org.apache.solr.cloud.ZkController$1; ZooKeeper session re-connected ... refreshing core states after session expiration. INFO - 2015-04-01 17:28:51.115; org.apache.solr.cloud.ZkController; publishing core=foo_shard1_replica1 state=down collection=foo INFO - 2015-04-01 17:28:51.118; org.apache.solr.cloud.ElectionContext; canceling election /overseer_elect/election/93579450724974592-192.168.1.2:8983_solr-n_0000000000 WARN - 2015-04-01 17:28:51.119; org.apache.solr.cloud.ElectionContext; cancelElection did not find election node to remove /overseer_elect/election/93579450724974592-192.168.1.2:8983_solr-n_0000000000 INFO - 2015-04-01 17:28:51.121; org.apache.solr.cloud.OverseerElectionContext; I am going to be the leader 192.168.1.2:8983_solr INFO - 2015-04-01 17:28:51.121; org.apache.solr.common.cloud.SolrZkClient; makePath: /overseer_elect/leader INFO - 2015-04-01 17:28:51.122; org.apache.solr.cloud.Overseer; Overseer (id=93579450724974594-192.168.1.2:8983_solr-n_0000000001) starting INFO - 2015-04-01 17:28:51.128; org.apache.solr.cloud.OverseerAutoReplicaFailoverThread; Starting OverseerAutoReplicaFailoverThread autoReplicaFailoverWorkLoopDelay=10000 autoReplicaFailoverWaitAfterExpiration=30000 autoReplicaFailoverBadNodeExpiration=60000 INFO - 2015-04-01 17:28:51.128; org.apache.solr.cloud.OverseerCollectionProcessor; Process current queue of collection creations INFO - 2015-04-01 17:28:51.128; org.apache.solr.common.cloud.ZkStateReader; Updating cluster state from ZooKeeper... INFO - 2015-04-01 17:28:51.129; org.apache.solr.cloud.Overseer$ClusterStateUpdater; Starting to work on the main queue INFO - 2015-04-01 17:28:51.130; org.apache.solr.common.cloud.ZkStateReader; addZkWatch foo INFO - 2015-04-01 17:28:51.131; org.apache.solr.common.cloud.ZkStateReader; Updating collection state at /collections/foo/state.json from ZooKeeper... INFO - 2015-04-01 17:28:51.131; org.apache.solr.common.cloud.ZkStateReader; Updating data for foo to ver 4 INFO - 2015-04-01 17:28:51.131; org.apache.solr.cloud.ZkController; Register node as live in ZooKeeper:/live_nodes/192.168.1.2:8983_solr INFO - 2015-04-01 17:28:51.132; org.apache.solr.common.cloud.SolrZkClient; makePath: /live_nodes/192.168.1.2:8983_solr INFO - 2015-04-01 17:28:51.133; org.apache.solr.cloud.ZkController; Register replica - core:foo_shard1_replica1 address:http://192.168.1.2:8983/solr collection:foo shard:shard1 INFO - 2015-04-01 17:28:51.133; org.apache.solr.common.cloud.ZkStateReader; Updating data for foo to ver 4 INFO - 2015-04-01 17:28:51.133; org.apache.solr.cloud.ElectionContext; canceling election /collections/foo/leader_elect/shard1/election/93579450724974592-core_node1-n_0000000000 WARN - 2015-04-01 17:28:51.134; org.apache.solr.cloud.ElectionContext; cancelElection did not find election node to remove /collections/foo/leader_elect/shard1/election/93579450724974592-core_node1-n_0000000000 INFO - 2015-04-01 17:28:51.134; org.apache.solr.cloud.Overseer$ClusterStateUpdater; processMessage: queueSize: 1, message = { "core":"foo_shard1_replica1", "core_node_name":"core_node1", "roles":null, "base_url":"http://192.168.1.2:8983/solr", "node_name":"192.168.1.2:8983_solr", "numShards":"1", "state":"down", "shard":"shard1", "collection":"foo", "operation":"state"} current state version: 1 INFO - 2015-04-01 17:28:51.135; org.apache.solr.cloud.overseer.ReplicaMutator; Update state numShards=1 message={ "core":"foo_shard1_replica1", "core_node_name":"core_node1", "roles":null, "base_url":"http://192.168.1.2:8983/solr", "node_name":"192.168.1.2:8983_solr", "numShards":"1", "state":"down", "shard":"shard1", "collection":"foo", "operation":"state"} INFO - 2015-04-01 17:28:51.136; org.apache.solr.cloud.ShardLeaderElectionContext; Running the leader process for shard shard1 INFO - 2015-04-01 17:28:51.136; org.apache.solr.cloud.overseer.ZkStateWriter; going to update_collection /collections/foo/state.json version: 4 INFO - 2015-04-01 17:28:51.136; org.apache.solr.cloud.ActionThrottle; The last leader attempt started 198060ms ago. INFO - 2015-04-01 17:28:51.137; org.apache.solr.common.cloud.ZkStateReader$7; A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/collections/foo/state.json for collection foo has occurred - updating... (live nodes size: 1) INFO - 2015-04-01 17:28:51.137; org.apache.solr.common.cloud.ZkStateReader; Updating data for foo to ver 5 INFO - 2015-04-01 17:28:51.138; org.apache.solr.cloud.DistributedQueue$LatchWatcher; NodeChildrenChanged fired on path /overseer/queue state SyncConnected INFO - 2015-04-01 17:28:51.138; org.apache.solr.cloud.ShardLeaderElectionContext; Enough replicas found to continue. INFO - 2015-04-01 17:28:51.138; org.apache.solr.cloud.ShardLeaderElectionContext; I may be the new leader - try and sync INFO - 2015-04-01 17:28:51.138; org.apache.solr.cloud.SyncStrategy; Sync replicas to http://192.168.1.2:8983/solr/foo_shard1_replica1/ INFO - 2015-04-01 17:28:51.138; org.apache.solr.cloud.SyncStrategy; Sync Success - now sync replicas to me INFO - 2015-04-01 17:28:51.138; org.apache.solr.cloud.SyncStrategy; http://192.168.1.2:8983/solr/foo_shard1_replica1/ has no replicas INFO - 2015-04-01 17:28:51.139; org.apache.solr.cloud.ShardLeaderElectionContext; I am the new leader: http://192.168.1.2:8983/solr/foo_shard1_replica1/ shard1 INFO - 2015-04-01 17:28:51.139; org.apache.solr.common.cloud.SolrZkClient; makePath: /collections/foo/leaders/shard1 INFO - 2015-04-01 17:28:51.139; org.apache.solr.cloud.Overseer$ClusterStateUpdater; processMessage: queueSize: 1, message = { "operation":"leader", "shard":"shard1", "collection":"foo"} current state version: 1 INFO - 2015-04-01 17:28:51.140; org.apache.solr.cloud.overseer.ZkStateWriter; going to update_collection /collections/foo/state.json version: 5 INFO - 2015-04-01 17:28:51.141; org.apache.solr.common.cloud.ZkStateReader$7; A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/collections/foo/state.json for collection foo has occurred - updating... (live nodes size: 1) INFO - 2015-04-01 17:28:51.142; org.apache.solr.cloud.DistributedQueue$LatchWatcher; NodeChildrenChanged fired on path /overseer/queue state SyncConnected INFO - 2015-04-01 17:28:51.142; org.apache.solr.common.cloud.ZkStateReader; Updating data for foo to ver 6 INFO - 2015-04-01 17:28:51.144; org.apache.solr.cloud.Overseer$ClusterStateUpdater; processMessage: queueSize: 1, message = { "operation":"leader", "shard":"shard1", "collection":"foo", "base_url":"http://192.168.1.2:8983/solr", "core":"foo_shard1_replica1", "state":"active"} current state version: 1 INFO - 2015-04-01 17:28:51.145; org.apache.solr.cloud.overseer.ZkStateWriter; going to update_collection /collections/foo/state.json version: 6 INFO - 2015-04-01 17:28:51.145; org.apache.solr.common.cloud.ZkStateReader$7; A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/collections/foo/state.json for collection foo has occurred - updating... (live nodes size: 1) INFO - 2015-04-01 17:28:51.146; org.apache.solr.common.cloud.ZkStateReader; Updating data for foo to ver 7 INFO - 2015-04-01 17:28:51.147; org.apache.solr.cloud.DistributedQueue$LatchWatcher; NodeChildrenChanged fired on path /overseer/queue state SyncConnected INFO - 2015-04-01 17:28:51.196; org.apache.solr.cloud.ZkController; We are http://192.168.1.2:8983/solr/foo_shard1_replica1/ and leader is http://192.168.1.2:8983/solr/foo_shard1_replica1/ INFO - 2015-04-01 17:28:51.196; org.apache.solr.cloud.ZkController; In register, core.isReloaded? true INFO - 2015-04-01 17:28:51.199; org.apache.solr.common.cloud.ZkStateReader; Updating data for foo to ver 7 INFO - 2015-04-01 17:28:51.199; org.apache.solr.cloud.ZkController; watch zkdir /configs/foo INFO - 2015-04-01 17:28:51.199; org.apache.solr.schema.ZkIndexSchemaReader; Creating ZooKeeper watch for the managed schema at /configs/foo/managed-schema INFO - 2015-04-01 17:28:51.199; org.apache.solr.cloud.ZkController$4; Running listeners for /configs/foo INFO - 2015-04-01 17:28:51.200; org.apache.solr.core.SolrCore$11; config update listener called for core foo_shard1_replica1 INFO - 2015-04-01 17:28:51.200; org.apache.solr.schema.ZkIndexSchemaReader; Current schema version 5 is already the latest INFO - 2015-04-01 17:28:51.200; org.apache.solr.schema.ZkIndexSchemaReader; Creating ZooKeeper watch for the managed schema at /configs/foo/managed-schema INFO - 2015-04-01 17:28:51.200; org.apache.solr.core.SolrConfig; current version of requestparams : 0 INFO - 2015-04-01 17:28:51.201; org.apache.solr.schema.ZkIndexSchemaReader; Current schema version 5 is already the latest INFO - 2015-04-01 17:28:51.201; org.apache.solr.common.cloud.DefaultConnectionStrategy; Reconnected to ZooKeeper INFO - 2015-04-01 17:28:51.202; org.apache.solr.common.cloud.ConnectionManager; Connected:true INFO - 2015-04-01 17:28:51.211; org.apache.solr.cloud.ZkController$WatcherImpl; A node got unwatched for /configs/foo INFO - 2015-04-01 17:28:51.211; org.apache.solr.cloud.ZkController$4; Running listeners for /configs/foo INFO - 2015-04-01 17:28:51.211; org.apache.solr.core.SolrCore$11; config update listener called for core foo_shard1_replica1 INFO - 2015-04-01 17:28:51.212; org.apache.solr.core.SolrConfig; current version of requestparams : 0 INFO - 2015-04-01 17:29:07.080; org.apache.solr.servlet.SolrDispatchFilter; [admin] webapp=null path=/admin/cores params={indexInfo=false&wt=json&_=1427909347078} status=0 QTime=0 INFO - 2015-04-01 17:29:07.112; org.apache.solr.servlet.SolrDispatchFilter; [admin] webapp=null path=/admin/info/system params={wt=json&_=1427909347099} status=0 QTime=11
Attachments
Attachments
Issue Links
- relates to
-
SOLR-6583 Resuming connection with ZooKeeper causes log replay
- Closed