Solr
  1. Solr
  2. SOLR-7338

A reloaded core will never register itself as active after a ZK session expiration

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.1, 6.0
    • Component/s: SolrCloud
    • Labels:
      None

      Description

      If a collection gets reloaded, then a core's isReloaded flag is always true. If a core experiences a ZK session expiration after a reload, then it won't ever be able to set itself to active because of the check in ZkController#register:

              UpdateLog ulog = core.getUpdateHandler().getUpdateLog();
              if (!core.isReloaded() && ulog != null) {
                // disable recovery in case shard is in construction state (for shard splits)
                Slice slice = getClusterState().getSlice(collection, shardId);
                if (slice.getState() != Slice.State.CONSTRUCTION || !isLeader) {
                  Future<UpdateLog.RecoveryInfo> recoveryFuture = core.getUpdateHandler().getUpdateLog().recoverFromLog();
                  if (recoveryFuture != null) {
                    log.info("Replaying tlog for " + ourUrl + " during startup... NOTE: This can take a while.");
                    recoveryFuture.get(); // NOTE: this could potentially block for
                    // minutes or more!
                    // TODO: public as recovering in the mean time?
                    // TODO: in the future we could do peersync in parallel with recoverFromLog
                  } else {
                    log.info("No LogReplay needed for core=" + core.getName() + " baseURL=" + baseUrl);
                  }
                }
                boolean didRecovery = checkRecovery(coreName, desc, recoverReloadedCores, isLeader, cloudDesc,
                    collection, coreZkNodeName, shardId, leaderProps, core, cc);
                if (!didRecovery) {
                  publish(desc, ZkStateReader.ACTIVE);
                }
              }
      

      I can easily simulate this on trunk by doing:

      bin/solr -c -z localhost:2181
      bin/solr create -c foo
      bin/post -c foo example/exampledocs/*.xml
      curl "http://localhost:8983/solr/admin/collections?action=RELOAD&name=foo"
      kill -STOP <PID> && sleep <PAUSE_SECONDS> && kill -CONT <PID>
      

      Where <PID> is the process ID of the Solr node. Here are the logs after the CONT command. As you can see below, the core never gets to setting itself as active again. I think the bug is that the isReloaded flag needs to get set back to false once the reload is successful, but I don't understand what this flag is needed for anyway???

      INFO  - 2015-04-01 17:28:50.962; org.apache.solr.common.cloud.ConnectionManager; Watcher org.apache.solr.common.cloud.ConnectionManager@5519dba0 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent state:Disconnected type:None path:null path:null type:None
      INFO  - 2015-04-01 17:28:50.963; org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected
      INFO  - 2015-04-01 17:28:51.107; org.apache.solr.common.cloud.ConnectionManager; Watcher org.apache.solr.common.cloud.ConnectionManager@5519dba0 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent state:Expired type:None path:null path:null type:None
      INFO  - 2015-04-01 17:28:51.107; org.apache.solr.common.cloud.ConnectionManager; Our previous ZooKeeper session was expired. Attempting to reconnect to recover relationship with ZooKeeper...
      INFO  - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer; Overseer (id=93579450724974592-192.168.1.2:8983_solr-n_0000000000) closing
      INFO  - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$WatcherImpl; A node got unwatched for /configs/foo
      INFO  - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer$ClusterStateUpdater; Overseer Loop exiting : 192.168.1.2:8983_solr
      INFO  - 2015-04-01 17:28:51.109; org.apache.solr.cloud.OverseerCollectionProcessor; According to ZK I (id=93579450724974592-192.168.1.2:8983_solr-n_0000000000) am no longer a leader.
      INFO  - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$4; Running listeners for /configs/foo
      INFO  - 2015-04-01 17:28:51.109; org.apache.solr.common.cloud.DefaultConnectionStrategy; Connection expired - starting a new one...
      INFO  - 2015-04-01 17:28:51.109; org.apache.solr.core.SolrCore$11; config update listener called for core foo_shard1_replica1
      INFO  - 2015-04-01 17:28:51.110; org.apache.solr.common.cloud.ConnectionManager; Waiting for client to connect to ZooKeeper
      ERROR - 2015-04-01 17:28:51.110; org.apache.solr.common.SolrException; OverseerAutoReplicaFailoverThread had an error in its thread work loop.:org.apache.solr.common.SolrException: Error reading cluster properties
      	at org.apache.solr.common.cloud.ZkStateReader.getClusterProps(ZkStateReader.java:772)
      	at org.apache.solr.cloud.OverseerAutoReplicaFailoverThread.doWork(OverseerAutoReplicaFailoverThread.java:150)
      	at org.apache.solr.cloud.OverseerAutoReplicaFailoverThread.run(OverseerAutoReplicaFailoverThread.java:129)
      	at java.lang.Thread.run(Thread.java:745)
      Caused by: java.lang.InterruptedException: sleep interrupted
      	at java.lang.Thread.sleep(Native Method)
      	at org.apache.solr.common.cloud.ZkCmdExecutor.retryDelay(ZkCmdExecutor.java:108)
      	at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:76)
      	at org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:308)
      	at org.apache.solr.common.cloud.ZkStateReader.getClusterProps(ZkStateReader.java:765)
      	... 3 more
      
      WARN  - 2015-04-01 17:28:51.110; org.apache.solr.cloud.ZkController$4; listener throws error
      org.apache.solr.common.SolrException: org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /configs/foo/params.json
      	at org.apache.solr.core.RequestParams.getFreshRequestParams(RequestParams.java:160)
      	at org.apache.solr.core.SolrConfig.refreshRequestParams(SolrConfig.java:907)
      	at org.apache.solr.core.SolrCore$11.run(SolrCore.java:2503)
      	at org.apache.solr.cloud.ZkController$4.run(ZkController.java:2351)
      Caused by: org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /configs/foo/params.json
      	at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
      	at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
      	at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
      	at org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:294)
      	at org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:291)
      	at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61)
      	at org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:291)
      	at org.apache.solr.core.RequestParams.getFreshRequestParams(RequestParams.java:150)
      	... 3 more
      ERROR - 2015-04-01 17:28:51.110; org.apache.solr.cloud.Overseer$ClusterStateUpdater; could not read the data
      org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /overseer_elect/leader
      	at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
      	at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
      	at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155)
      	at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:345)
      	at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:342)
      	at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61)
      	at org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:342)
      	at org.apache.solr.cloud.Overseer$ClusterStateUpdater.checkIfIamStillLeader(Overseer.java:298)
      	at org.apache.solr.cloud.Overseer$ClusterStateUpdater.access$300(Overseer.java:83)
      	at org.apache.solr.cloud.Overseer$ClusterStateUpdater$2.run(Overseer.java:259)
      INFO  - 2015-04-01 17:28:51.114; org.apache.solr.common.cloud.ConnectionManager; Watcher org.apache.solr.common.cloud.ConnectionManager@5519dba0 name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent state:SyncConnected type:None path:null path:null type:None
      INFO  - 2015-04-01 17:28:51.115; org.apache.solr.common.cloud.ConnectionManager; Client is connected to ZooKeeper
      INFO  - 2015-04-01 17:28:51.115; org.apache.solr.common.cloud.ConnectionManager$1; Connection with ZooKeeper reestablished.
      INFO  - 2015-04-01 17:28:51.115; org.apache.solr.cloud.ZkController$1; ZooKeeper session re-connected ... refreshing core states after session expiration.
      INFO  - 2015-04-01 17:28:51.115; org.apache.solr.cloud.ZkController; publishing core=foo_shard1_replica1 state=down collection=foo
      INFO  - 2015-04-01 17:28:51.118; org.apache.solr.cloud.ElectionContext; canceling election /overseer_elect/election/93579450724974592-192.168.1.2:8983_solr-n_0000000000
      WARN  - 2015-04-01 17:28:51.119; org.apache.solr.cloud.ElectionContext; cancelElection did not find election node to remove /overseer_elect/election/93579450724974592-192.168.1.2:8983_solr-n_0000000000
      INFO  - 2015-04-01 17:28:51.121; org.apache.solr.cloud.OverseerElectionContext; I am going to be the leader 192.168.1.2:8983_solr
      INFO  - 2015-04-01 17:28:51.121; org.apache.solr.common.cloud.SolrZkClient; makePath: /overseer_elect/leader
      INFO  - 2015-04-01 17:28:51.122; org.apache.solr.cloud.Overseer; Overseer (id=93579450724974594-192.168.1.2:8983_solr-n_0000000001) starting
      INFO  - 2015-04-01 17:28:51.128; org.apache.solr.cloud.OverseerAutoReplicaFailoverThread; Starting OverseerAutoReplicaFailoverThread autoReplicaFailoverWorkLoopDelay=10000 autoReplicaFailoverWaitAfterExpiration=30000 autoReplicaFailoverBadNodeExpiration=60000
      INFO  - 2015-04-01 17:28:51.128; org.apache.solr.cloud.OverseerCollectionProcessor; Process current queue of collection creations
      INFO  - 2015-04-01 17:28:51.128; org.apache.solr.common.cloud.ZkStateReader; Updating cluster state from ZooKeeper... 
      INFO  - 2015-04-01 17:28:51.129; org.apache.solr.cloud.Overseer$ClusterStateUpdater; Starting to work on the main queue
      INFO  - 2015-04-01 17:28:51.130; org.apache.solr.common.cloud.ZkStateReader; addZkWatch foo
      INFO  - 2015-04-01 17:28:51.131; org.apache.solr.common.cloud.ZkStateReader; Updating collection state at /collections/foo/state.json from ZooKeeper... 
      INFO  - 2015-04-01 17:28:51.131; org.apache.solr.common.cloud.ZkStateReader; Updating data for foo to ver 4 
      INFO  - 2015-04-01 17:28:51.131; org.apache.solr.cloud.ZkController; Register node as live in ZooKeeper:/live_nodes/192.168.1.2:8983_solr
      INFO  - 2015-04-01 17:28:51.132; org.apache.solr.common.cloud.SolrZkClient; makePath: /live_nodes/192.168.1.2:8983_solr
      INFO  - 2015-04-01 17:28:51.133; org.apache.solr.cloud.ZkController; Register replica - core:foo_shard1_replica1 address:http://192.168.1.2:8983/solr collection:foo shard:shard1
      INFO  - 2015-04-01 17:28:51.133; org.apache.solr.common.cloud.ZkStateReader; Updating data for foo to ver 4 
      INFO  - 2015-04-01 17:28:51.133; org.apache.solr.cloud.ElectionContext; canceling election /collections/foo/leader_elect/shard1/election/93579450724974592-core_node1-n_0000000000
      WARN  - 2015-04-01 17:28:51.134; org.apache.solr.cloud.ElectionContext; cancelElection did not find election node to remove /collections/foo/leader_elect/shard1/election/93579450724974592-core_node1-n_0000000000
      INFO  - 2015-04-01 17:28:51.134; org.apache.solr.cloud.Overseer$ClusterStateUpdater; processMessage: queueSize: 1, message = {
        "core":"foo_shard1_replica1",
        "core_node_name":"core_node1",
        "roles":null,
        "base_url":"http://192.168.1.2:8983/solr",
        "node_name":"192.168.1.2:8983_solr",
        "numShards":"1",
        "state":"down",
        "shard":"shard1",
        "collection":"foo",
        "operation":"state"} current state version: 1
      INFO  - 2015-04-01 17:28:51.135; org.apache.solr.cloud.overseer.ReplicaMutator; Update state numShards=1 message={
        "core":"foo_shard1_replica1",
        "core_node_name":"core_node1",
        "roles":null,
        "base_url":"http://192.168.1.2:8983/solr",
        "node_name":"192.168.1.2:8983_solr",
        "numShards":"1",
        "state":"down",
        "shard":"shard1",
        "collection":"foo",
        "operation":"state"}
      INFO  - 2015-04-01 17:28:51.136; org.apache.solr.cloud.ShardLeaderElectionContext; Running the leader process for shard shard1
      INFO  - 2015-04-01 17:28:51.136; org.apache.solr.cloud.overseer.ZkStateWriter; going to update_collection /collections/foo/state.json version: 4
      INFO  - 2015-04-01 17:28:51.136; org.apache.solr.cloud.ActionThrottle; The last leader attempt started 198060ms ago.
      INFO  - 2015-04-01 17:28:51.137; org.apache.solr.common.cloud.ZkStateReader$7; A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/collections/foo/state.json for collection foo has occurred - updating... (live nodes size: 1)
      INFO  - 2015-04-01 17:28:51.137; org.apache.solr.common.cloud.ZkStateReader; Updating data for foo to ver 5 
      INFO  - 2015-04-01 17:28:51.138; org.apache.solr.cloud.DistributedQueue$LatchWatcher; NodeChildrenChanged fired on path /overseer/queue state SyncConnected
      INFO  - 2015-04-01 17:28:51.138; org.apache.solr.cloud.ShardLeaderElectionContext; Enough replicas found to continue.
      INFO  - 2015-04-01 17:28:51.138; org.apache.solr.cloud.ShardLeaderElectionContext; I may be the new leader - try and sync
      INFO  - 2015-04-01 17:28:51.138; org.apache.solr.cloud.SyncStrategy; Sync replicas to http://192.168.1.2:8983/solr/foo_shard1_replica1/
      INFO  - 2015-04-01 17:28:51.138; org.apache.solr.cloud.SyncStrategy; Sync Success - now sync replicas to me
      INFO  - 2015-04-01 17:28:51.138; org.apache.solr.cloud.SyncStrategy; http://192.168.1.2:8983/solr/foo_shard1_replica1/ has no replicas
      INFO  - 2015-04-01 17:28:51.139; org.apache.solr.cloud.ShardLeaderElectionContext; I am the new leader: http://192.168.1.2:8983/solr/foo_shard1_replica1/ shard1
      INFO  - 2015-04-01 17:28:51.139; org.apache.solr.common.cloud.SolrZkClient; makePath: /collections/foo/leaders/shard1
      INFO  - 2015-04-01 17:28:51.139; org.apache.solr.cloud.Overseer$ClusterStateUpdater; processMessage: queueSize: 1, message = {
        "operation":"leader",
        "shard":"shard1",
        "collection":"foo"} current state version: 1
      INFO  - 2015-04-01 17:28:51.140; org.apache.solr.cloud.overseer.ZkStateWriter; going to update_collection /collections/foo/state.json version: 5
      INFO  - 2015-04-01 17:28:51.141; org.apache.solr.common.cloud.ZkStateReader$7; A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/collections/foo/state.json for collection foo has occurred - updating... (live nodes size: 1)
      INFO  - 2015-04-01 17:28:51.142; org.apache.solr.cloud.DistributedQueue$LatchWatcher; NodeChildrenChanged fired on path /overseer/queue state SyncConnected
      INFO  - 2015-04-01 17:28:51.142; org.apache.solr.common.cloud.ZkStateReader; Updating data for foo to ver 6 
      INFO  - 2015-04-01 17:28:51.144; org.apache.solr.cloud.Overseer$ClusterStateUpdater; processMessage: queueSize: 1, message = {
        "operation":"leader",
        "shard":"shard1",
        "collection":"foo",
        "base_url":"http://192.168.1.2:8983/solr",
        "core":"foo_shard1_replica1",
        "state":"active"} current state version: 1
      INFO  - 2015-04-01 17:28:51.145; org.apache.solr.cloud.overseer.ZkStateWriter; going to update_collection /collections/foo/state.json version: 6
      INFO  - 2015-04-01 17:28:51.145; org.apache.solr.common.cloud.ZkStateReader$7; A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/collections/foo/state.json for collection foo has occurred - updating... (live nodes size: 1)
      INFO  - 2015-04-01 17:28:51.146; org.apache.solr.common.cloud.ZkStateReader; Updating data for foo to ver 7 
      INFO  - 2015-04-01 17:28:51.147; org.apache.solr.cloud.DistributedQueue$LatchWatcher; NodeChildrenChanged fired on path /overseer/queue state SyncConnected
      INFO  - 2015-04-01 17:28:51.196; org.apache.solr.cloud.ZkController; We are http://192.168.1.2:8983/solr/foo_shard1_replica1/ and leader is http://192.168.1.2:8983/solr/foo_shard1_replica1/
      INFO  - 2015-04-01 17:28:51.196; org.apache.solr.cloud.ZkController; 
      
       In register, core.isReloaded? true 
      
      
      INFO  - 2015-04-01 17:28:51.199; org.apache.solr.common.cloud.ZkStateReader; Updating data for foo to ver 7 
      INFO  - 2015-04-01 17:28:51.199; org.apache.solr.cloud.ZkController; watch zkdir /configs/foo
      INFO  - 2015-04-01 17:28:51.199; org.apache.solr.schema.ZkIndexSchemaReader; Creating ZooKeeper watch for the managed schema at /configs/foo/managed-schema
      INFO  - 2015-04-01 17:28:51.199; org.apache.solr.cloud.ZkController$4; Running listeners for /configs/foo
      INFO  - 2015-04-01 17:28:51.200; org.apache.solr.core.SolrCore$11; config update listener called for core foo_shard1_replica1
      INFO  - 2015-04-01 17:28:51.200; org.apache.solr.schema.ZkIndexSchemaReader; Current schema version 5 is already the latest
      INFO  - 2015-04-01 17:28:51.200; org.apache.solr.schema.ZkIndexSchemaReader; Creating ZooKeeper watch for the managed schema at /configs/foo/managed-schema
      INFO  - 2015-04-01 17:28:51.200; org.apache.solr.core.SolrConfig; current version of requestparams : 0
      INFO  - 2015-04-01 17:28:51.201; org.apache.solr.schema.ZkIndexSchemaReader; Current schema version 5 is already the latest
      INFO  - 2015-04-01 17:28:51.201; org.apache.solr.common.cloud.DefaultConnectionStrategy; Reconnected to ZooKeeper
      INFO  - 2015-04-01 17:28:51.202; org.apache.solr.common.cloud.ConnectionManager; Connected:true
      INFO  - 2015-04-01 17:28:51.211; org.apache.solr.cloud.ZkController$WatcherImpl; A node got unwatched for /configs/foo
      INFO  - 2015-04-01 17:28:51.211; org.apache.solr.cloud.ZkController$4; Running listeners for /configs/foo
      INFO  - 2015-04-01 17:28:51.211; org.apache.solr.core.SolrCore$11; config update listener called for core foo_shard1_replica1
      INFO  - 2015-04-01 17:28:51.212; org.apache.solr.core.SolrConfig; current version of requestparams : 0
      INFO  - 2015-04-01 17:29:07.080; org.apache.solr.servlet.SolrDispatchFilter; [admin] webapp=null path=/admin/cores params={indexInfo=false&wt=json&_=1427909347078} status=0 QTime=0 
      INFO  - 2015-04-01 17:29:07.112; org.apache.solr.servlet.SolrDispatchFilter; [admin] webapp=null path=/admin/info/system params={wt=json&_=1427909347099} status=0 QTime=11 
      
      1. SOLR-7338_test.patch
        8 kB
        Timothy Potter
      2. SOLR-7338.patch
        3 kB
        Mark Miller

        Issue Links

          Activity

          Hide
          Timothy Potter added a comment -

          but I don't understand what this flag is needed for anyway???

          Sorry, I wasn't clear ... I get the idea of having a flag to tell that a core is in the process of reloading, so that we can make various decisions about what to do with the tlogs, etc, however, I think that flag should be set back to false after a core is fully reloaded and active again, no? I can't see a reason why after a successful reload is complete, that flag should stay == true.

          Show
          Timothy Potter added a comment - but I don't understand what this flag is needed for anyway??? Sorry, I wasn't clear ... I get the idea of having a flag to tell that a core is in the process of reloading, so that we can make various decisions about what to do with the tlogs, etc, however, I think that flag should be set back to false after a core is fully reloaded and active again, no? I can't see a reason why after a successful reload is complete, that flag should stay == true.
          Hide
          Mark Miller added a comment -

          looks like some broken refactoring or something.

          Look at what it used to be:

                UpdateLog ulog = core.getUpdateHandler().getUpdateLog();
                if (!core.isReloaded() && ulog != null) {
                  Future<UpdateLog.RecoveryInfo> recoveryFuture = core.getUpdateHandler()
                      .getUpdateLog().recoverFromLog();
                  if (recoveryFuture != null) {
                    recoveryFuture.get(); // NOTE: this could potentially block for
                    // minutes or more!
                    // TODO: public as recovering in the mean time?
                    // TODO: in the future we could do peersync in parallel with recoverFromLog
                  } else {
                    log.info("No LogReplay needed for core="+core.getName() + " baseURL=" + baseUrl);
                  }
                }      
                boolean didRecovery = checkRecovery(coreName, desc, recoverReloadedCores, isLeader, cloudDesc,
                    collection, coreZkNodeName, shardId, leaderProps, core, cc);
                if (!didRecovery) {
                  publish(desc, ZkStateReader.ACTIVE);
                }
          
          Show
          Mark Miller added a comment - looks like some broken refactoring or something. Look at what it used to be: UpdateLog ulog = core.getUpdateHandler().getUpdateLog(); if (!core.isReloaded() && ulog != null ) { Future<UpdateLog.RecoveryInfo> recoveryFuture = core.getUpdateHandler() .getUpdateLog().recoverFromLog(); if (recoveryFuture != null ) { recoveryFuture.get(); // NOTE: this could potentially block for // minutes or more! // TODO: public as recovering in the mean time? // TODO: in the future we could do peersync in parallel with recoverFromLog } else { log.info( "No LogReplay needed for core=" +core.getName() + " baseURL=" + baseUrl); } } boolean didRecovery = checkRecovery(coreName, desc, recoverReloadedCores, isLeader, cloudDesc, collection, coreZkNodeName, shardId, leaderProps, core, cc); if (!didRecovery) { publish(desc, ZkStateReader.ACTIVE); }
          Hide
          Mark Miller added a comment -

          I can't see a reason why after a successful reload is complete, that flag should stay == true.

          Why not? A reloaded core should never do various things like replay it's transaction log in register. I don't see how it makes any sense if it doesn't stay true. The core has either been reloaded before or it has not.

          Show
          Mark Miller added a comment - I can't see a reason why after a successful reload is complete, that flag should stay == true. Why not? A reloaded core should never do various things like replay it's transaction log in register. I don't see how it makes any sense if it doesn't stay true. The core has either been reloaded before or it has not.
          Hide
          Mark Miller added a comment -

          looks like some broken refactoring or something.

          Or an incomplete attempt at a bug fix. We also do not want to recover on a reload, so this is probably an incomplete attempt at fixing that. It just didn't take into account that we still need that ACTIVE publish.

          Show
          Mark Miller added a comment - looks like some broken refactoring or something. Or an incomplete attempt at a bug fix. We also do not want to recover on a reload, so this is probably an incomplete attempt at fixing that. It just didn't take into account that we still need that ACTIVE publish.
          Hide
          Mark Miller added a comment -

          Okay, so, total story looks like:

          The isReloaded call was originally just used to prevent tlog replay other than on first startup.

          Recovery was always done, even on a reload.

          Later, it was noticed we should not be recovering on reload and the recovery check was also brought under isReloaded.

          This was not okay - the recovery check, unlike the tlog replay check, needs to know if this register attempt is the result of a reload - not if the core has ever been reloaded.

          We need a test that can catch this.

          Show
          Mark Miller added a comment - Okay, so, total story looks like: The isReloaded call was originally just used to prevent tlog replay other than on first startup. Recovery was always done, even on a reload. Later, it was noticed we should not be recovering on reload and the recovery check was also brought under isReloaded. This was not okay - the recovery check, unlike the tlog replay check, needs to know if this register attempt is the result of a reload - not if the core has ever been reloaded. We need a test that can catch this.
          Hide
          Timothy Potter added a comment -

          Thanks for digging that up. I still don't understand why we have to keep track of whether a core has ever been reloaded as part of the long term state of a core. Using a flag to decide that we don't need to replay the tlog after a reload, while the reloaded core is initializing is one thing, but what's the point of remembering core A has been reloaded at some point in the past after core A has fully initialized and become active? I guess my point is it seems like a temporary state and not part of the long term state of a core. But that's orthogonal to this issue anyway, so I'll fix the register code and add a test for this.

          Show
          Timothy Potter added a comment - Thanks for digging that up. I still don't understand why we have to keep track of whether a core has ever been reloaded as part of the long term state of a core. Using a flag to decide that we don't need to replay the tlog after a reload, while the reloaded core is initializing is one thing, but what's the point of remembering core A has been reloaded at some point in the past after core A has fully initialized and become active? I guess my point is it seems like a temporary state and not part of the long term state of a core. But that's orthogonal to this issue anyway, so I'll fix the register code and add a test for this.
          Hide
          Mark Miller added a comment - - edited

          The flag is "has this core been reloaded before" - how is that a temporary state?

          That whole change I show above was indeed a bug. You can see that in checkRecovery, that is where we deal with skipping the recovery on a reloaded core. The calling code just got mangled.

          Show
          Mark Miller added a comment - - edited The flag is "has this core been reloaded before" - how is that a temporary state? That whole change I show above was indeed a bug. You can see that in checkRecovery, that is where we deal with skipping the recovery on a reloaded core. The calling code just got mangled.
          Hide
          Mark Miller added a comment -

          A related bug is SOLR-6583.

          We should be skipping tlog replay on 'afterExpiration==true' as well.

          Show
          Mark Miller added a comment - A related bug is SOLR-6583 . We should be skipping tlog replay on 'afterExpiration==true' as well.
          Hide
          Timothy Potter added a comment -

          The flag is "has this core been reloaded before" - how is that a temporary state?

          I'm not saying it's temporary, I'm saying I can't see what possible value knowing that provides long term? I can see how that would be useful for reporting if it were a timestamp of some sort, such as last reloaded on, but a simple boolean makes no sense to me, nor do I see it being used anywhere in the code other than determining if things like tlog replay should be skipped while the core is initializing. Once it is fully active, the flag seems useless from either a reporting perspective or a state management perspective. But I think we've wasted enough time on this one ...

          Show
          Timothy Potter added a comment - The flag is "has this core been reloaded before" - how is that a temporary state? I'm not saying it's temporary, I'm saying I can't see what possible value knowing that provides long term? I can see how that would be useful for reporting if it were a timestamp of some sort, such as last reloaded on, but a simple boolean makes no sense to me, nor do I see it being used anywhere in the code other than determining if things like tlog replay should be skipped while the core is initializing. Once it is fully active, the flag seems useless from either a reporting perspective or a state management perspective. But I think we've wasted enough time on this one ...
          Hide
          Mark Miller added a comment -

          I think this is a lot closer to how it's supposed to work, though I'm sure it could all be cleaned up to be a little more clear.

          Show
          Mark Miller added a comment - I think this is a lot closer to how it's supposed to work, though I'm sure it could all be cleaned up to be a little more clear.
          Hide
          Mark Miller added a comment - - edited

          I'm saying I can't see what possible value knowing that provides long term?

          For example, the value it provides in the specific code we are talking about? If the core has been reloaded and its not afterExpiration, then it is a core reload register call. I don't understand how it doesn't provide value...

          I'm not saying it's temporary,

          But you said that a couple times...

          Show
          Mark Miller added a comment - - edited I'm saying I can't see what possible value knowing that provides long term? For example, the value it provides in the specific code we are talking about? If the core has been reloaded and its not afterExpiration, then it is a core reload register call. I don't understand how it doesn't provide value... I'm not saying it's temporary, But you said that a couple times...
          Hide
          Timothy Potter added a comment -

          whatever ... I also said several times I'm talking about after register runs, not during ... of course I see value up until register happens, but you win

          Show
          Timothy Potter added a comment - whatever ... I also said several times I'm talking about after register runs, not during ... of course I see value up until register happens, but you win
          Hide
          Mark Miller added a comment -

          Also keep in mind, a reloaded SolrCore has differences from a non reloaded core as well - for example, a reloaded core did not create it's index writer or SolrCoreState like a new core does - it inherits them.

          Show
          Mark Miller added a comment - Also keep in mind, a reloaded SolrCore has differences from a non reloaded core as well - for example, a reloaded core did not create it's index writer or SolrCoreState like a new core does - it inherits them.
          Hide
          Mark Miller added a comment -

          I also said several times I'm talking about after register runs, not during

          I guess I just don't understand that all. Being able to tell if a core is a reloaded core has absolutely nothing to do with register. Register just happens to use that call because it allows register to tell if the register is coming from a new core or a reloaded one.

          I am unable to spot the issue.

          Show
          Mark Miller added a comment - I also said several times I'm talking about after register runs, not during I guess I just don't understand that all. Being able to tell if a core is a reloaded core has absolutely nothing to do with register. Register just happens to use that call because it allows register to tell if the register is coming from a new core or a reloaded one. I am unable to spot the issue.
          Hide
          Mark Miller added a comment -

          If you do a call hierarchy on the SolrCore#isReloaded call, you can find a couple other uses for it as well. It's just kind of info we have to know. I guess we could use a different method of figuring out what we want in register, but this call works, is available, and is generally necessary.

          Show
          Mark Miller added a comment - If you do a call hierarchy on the SolrCore#isReloaded call, you can find a couple other uses for it as well. It's just kind of info we have to know. I guess we could use a different method of figuring out what we want in register, but this call works, is available, and is generally necessary.
          Hide
          Yonik Seeley added a comment -

          Without looking at the code/patches I understand what Tim is saying, and agree. "isReloaded" feels more like "isReloading" (i.e. it's state that is only temporarily used to make other decisions during initialization.) I don't know how hard it would be to factor out though... prob not worth it.

          Show
          Yonik Seeley added a comment - Without looking at the code/patches I understand what Tim is saying, and agree. "isReloaded" feels more like "isReloading" (i.e. it's state that is only temporarily used to make other decisions during initialization.) I don't know how hard it would be to factor out though... prob not worth it.
          Hide
          Timothy Potter added a comment - - edited

          Not sure if it's useful to you, but here's the unit test I started working on (basically implements the scenario I described in the issue description) ... it currently fails as expected until the register code is fixed up ... it passes with your patch applied though.

          Show
          Timothy Potter added a comment - - edited Not sure if it's useful to you, but here's the unit test I started working on (basically implements the scenario I described in the issue description) ... it currently fails as expected until the register code is fixed up ... it passes with your patch applied though.
          Hide
          Mark Miller added a comment -

          "isReloaded" feels more like "isReloading"

          What's the gain, what's the point, what's the alternative?

          I don't get that at all. It tells you if the core has been reloaded. This is often useful in things that happen on creating a new SolrCore.

          Who cares about isReloading? I'm lost.

          Is it just too difficult to understand what isReloaded means?

          I'd be more confused with this temporary isReloading call - seems so easy for that to be tricky. isReloaded is so permanent and easy to understand. The core has been reloaded or it hasn't. How the heck does trying to track exactly when the core is actually in the process of reload or not more useful?

          Anyone else?

          Show
          Mark Miller added a comment - "isReloaded" feels more like "isReloading" What's the gain, what's the point, what's the alternative? I don't get that at all. It tells you if the core has been reloaded. This is often useful in things that happen on creating a new SolrCore. Who cares about isReloading? I'm lost. Is it just too difficult to understand what isReloaded means? I'd be more confused with this temporary isReloading call - seems so easy for that to be tricky. isReloaded is so permanent and easy to understand. The core has been reloaded or it hasn't. How the heck does trying to track exactly when the core is actually in the process of reload or not more useful? Anyone else?
          Hide
          Mark Miller added a comment -

          Anyway, we can spin that off into a new issue if someone wants to try and refactor it. I just don't yet get the impetus for it. Onlookers feel free to chime in here until/unless a new issue is made.

          but here's the unit test I started working on

          Cool.

          Show
          Mark Miller added a comment - Anyway, we can spin that off into a new issue if someone wants to try and refactor it. I just don't yet get the impetus for it. Onlookers feel free to chime in here until/unless a new issue is made. but here's the unit test I started working on Cool.
          Hide
          Timothy Potter added a comment -

          Hi Mark Miller, do you think anything else needs to be done on this one? I'd actually like to get this into the 5.1 release - patch looks good to me. If you're comfortable with the unit test I posted, I can combine them and commit. Thanks.

          Show
          Timothy Potter added a comment - Hi Mark Miller , do you think anything else needs to be done on this one? I'd actually like to get this into the 5.1 release - patch looks good to me. If you're comfortable with the unit test I posted, I can combine them and commit. Thanks.
          Hide
          David Smiley added a comment -

          Here's a question for you Mark Miller: If every core were to be reloaded, would that change anything? What if I go and do that too all my cores. Can we just assume that all cores may have been reloaded at some point in the past? If we do assume that, we do we lose anything? – other than complexity

          Show
          David Smiley added a comment - Here's a question for you Mark Miller : If every core were to be reloaded, would that change anything? What if I go and do that too all my cores. Can we just assume that all cores may have been reloaded at some point in the past? If we do assume that, we do we lose anything? – other than complexity
          Hide
          Mark Miller added a comment -

          I can combine them and commit.

          Go ahead. I think that's the right current fix and it also addresses SOLR-6583.

          Show
          Mark Miller added a comment - I can combine them and commit. Go ahead. I think that's the right current fix and it also addresses SOLR-6583 .
          Hide
          ASF subversion and git services added a comment -

          Commit 1671554 from Timothy Potter in branch 'dev/trunk'
          [ https://svn.apache.org/r1671554 ]

          SOLR-7338: A reloaded core will never register itself as active after a ZK session expiration, also fixes SOLR-6583

          Show
          ASF subversion and git services added a comment - Commit 1671554 from Timothy Potter in branch 'dev/trunk' [ https://svn.apache.org/r1671554 ] SOLR-7338 : A reloaded core will never register itself as active after a ZK session expiration, also fixes SOLR-6583
          Hide
          ASF subversion and git services added a comment -

          Commit 1671562 from Timothy Potter in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1671562 ]

          SOLR-7338: A reloaded core will never register itself as active after a ZK session expiration, also fixes SOLR-6583

          Show
          ASF subversion and git services added a comment - Commit 1671562 from Timothy Potter in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1671562 ] SOLR-7338 : A reloaded core will never register itself as active after a ZK session expiration, also fixes SOLR-6583
          Hide
          ASF subversion and git services added a comment -

          Commit 1671570 from Timothy Potter in branch 'dev/branches/lucene_solr_5_1'
          [ https://svn.apache.org/r1671570 ]

          SOLR-7338: A reloaded core will never register itself as active after a ZK session expiration, also fixes SOLR-6583

          Show
          ASF subversion and git services added a comment - Commit 1671570 from Timothy Potter in branch 'dev/branches/lucene_solr_5_1' [ https://svn.apache.org/r1671570 ] SOLR-7338 : A reloaded core will never register itself as active after a ZK session expiration, also fixes SOLR-6583
          Hide
          Mark Miller added a comment -

          Given RecoveryZkTest and ChaosMonkey test fails I've seen since this went in, I think it either broke something or exposed something.

          Show
          Mark Miller added a comment - Given RecoveryZkTest and ChaosMonkey test fails I've seen since this went in, I think it either broke something or exposed something.
          Hide
          Timothy Potter added a comment -

          good catch ... going to try to get the failure to reproduce locally now

          Show
          Timothy Potter added a comment - good catch ... going to try to get the failure to reproduce locally now
          Hide
          Timothy Potter added a comment -

          running beast with 20 iters ... haven't been able to reproduce with 10 ... the logs from the failed test make it seem like the recovery process succeeded OK:

          [junit4]   2> added docs:1022 with 0 fails deletes:510
             [junit4]   2> 956835 T6623 C:collection1 S:shard1 c:collection1 oasc.AbstractFullDistribZkTestBase.waitForThingsToLevelOut Wait for recoveries to finish - wait 120 for each attempt
             [junit4]   2> 956836 T6623 C:collection1 S:shard1 c:collection1 oasc.AbstractDistribZkTestBase.waitForRecoveriesToFinish Wait for recoveries to finish - collection: collection1 failOnTimeout:true timeout (sec):120
             [junit4]   2> 959842 T6734 C:collection1 S:shard1 c:collection1 C1144 P10059 oasc.RecoveryStrategy.doRecovery Attempting to PeerSync from http://127.0.0.1:10050/p_/z/collection1/ core=collection1 - recoveringAfterStartup=true
             [junit4]   2> 959843 T6734 C:collection1 S:shard1 c:collection1 C1144 P10059 oasu.PeerSync.sync PeerSync: core=collection1 url=http://127.0.0.1:10059/p_/z START replicas=[http://127.0.0.1:10050/p_/z/collection1/] nUpdates=100
             [junit4]   2> 959845 T6734 C:collection1 S:shard1 c:collection1 C1144 P10059 oasu.PeerSync.sync WARN no frame of reference to tell if we've missed updates
             [junit4]   2> 959846 T6734 C:collection1 S:shard1 c:collection1 C1144 P10059 oasc.RecoveryStrategy.doRecovery PeerSync Recovery was not successful - trying replication. core=collection1
             [junit4]   2> 959846 T6734 C:collection1 S:shard1 c:collection1 C1144 P10059 oasc.RecoveryStrategy.doRecovery Recovery was cancelled
             [junit4]   2> 959846 T6734 C:collection1 S:shard1 c:collection1 C1144 P10059 oasc.RecoveryStrategy.doRecovery Finished recovery process. core=collection1
             [junit4]   2> 959846 T6737 oasc.ActionThrottle.minimumWaitBetweenActions The last recovery attempt started 7015ms ago.
             [junit4]   2> 959846 T6737 oasc.ActionThrottle.minimumWaitBetweenActions Throttling recovery attempts - waiting for 2984ms
             [junit4]   2> 959847 T6672 C:collection1 S:shard1 R:core_node1 c:collection1 C1145 P10050 oasc.SolrCore.execute [collection1] webapp=/p_/z path=/get params={version=2&getVersions=100&distrib=false&qt=/get&wt=javabin} status=0 QTime=1 
             [junit4]   2> 962833 T6739 C1144 P10059 oasc.RecoveryStrategy.run Starting recovery process.  core=collection1 recoveringAfterStartup=false
             [junit4]   2> 962834 T6739 C1144 P10059 oasc.RecoveryStrategy.doRecovery Publishing state of core collection1 as recovering, leader is http://127.0.0.1:10050/p_/z/collection1/ and I am http://127.0.0.1:10059/p_/z/collection1/
             [junit4]   2> 962834 T6739 C:collection1 c:collection1 C1144 P10059 oasc.ZkController.publish publishing core=collection1 state=recovering collection=collection1
             [junit4]   2> 962834 T6739 C:collection1 c:collection1 C1144 P10059 oasc.ZkController.publish numShards not found on descriptor - reading it from system property
             [junit4]   2> 962836 T6649 oasc.DistributedQueue$LatchWatcher.process NodeChildrenChanged fired on path /overseer/queue state SyncConnected
             [junit4]   2> 962837 T6650 oasc.Overseer$ClusterStateUpdater.run processMessage: queueSize: 1, message = {
             [junit4]   2> 	  "collection":"collection1",
             [junit4]   2> 	  "core_node_name":"core_node2",
             [junit4]   2> 	  "state":"recovering",
             [junit4]   2> 	  "shard":"shard1",
             [junit4]   2> 	  "base_url":"http://127.0.0.1:10059/p_/z",
             [junit4]   2> 	  "roles":null,
             [junit4]   2> 	  "node_name":"127.0.0.1:10059_p_%2Fz",
             [junit4]   2> 	  "core":"collection1",
             [junit4]   2> 	  "operation":"state",
             [junit4]   2> 	  "numShards":"1"} current state version: 5
             [junit4]   2> 962837 T6650 oasco.ReplicaMutator.updateState Update state numShards=1 message={
             [junit4]   2> 	  "collection":"collection1",
             [junit4]   2> 	  "core_node_name":"core_node2",
             [junit4]   2> 	  "state":"recovering",
             [junit4]   2> 	  "shard":"shard1",
             [junit4]   2> 	  "base_url":"http://127.0.0.1:10059/p_/z",
             [junit4]   2> 	  "roles":null,
             [junit4]   2> 	  "node_name":"127.0.0.1:10059_p_%2Fz",
             [junit4]   2> 	  "core":"collection1",
             [junit4]   2> 	  "operation":"state",
             [junit4]   2> 	  "numShards":"1"}
             [junit4]   2> 962837 T6739 C:collection1 S:shard1 c:collection1 C1144 P10059 oasc.RecoveryStrategy.sendPrepRecoveryCmd Sending prep recovery command to http://127.0.0.1:10050/p_/z; WaitForState: action=PREPRECOVERY&core=collection1&nodeName=127.0.0.1%3A10059_p_%252Fz&coreNodeName=core_node2&state=recovering&checkLive=true&onlyIfLeader=true&onlyIfLeaderActive=true
             [junit4]   2> 962838 T6650 oasco.ZkStateWriter.writePendingUpdates going to update_collection /collections/collection1/state.json version: 10
             [junit4]   2> 962839 T6671 oasha.CoreAdminHandler.handleWaitForStateAction Going to wait for coreNodeName: core_node2, state: recovering, checkLive: true, onlyIfLeader: true, onlyIfLeaderActive: true
             [junit4]   2> 962839 T6678 oascc.ZkStateReader$7.process A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/collections/collection1/state.json for collection collection1 has occurred - updating... (live nodes size: 3)
             [junit4]   2> 962839 T6728 C:collection1 S:shard1 c:collection1 oascc.ZkStateReader$7.process A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/collections/collection1/state.json for collection collection1 has occurred - updating... (live nodes size: 3)
             [junit4]   2> 962840 T6728 C:collection1 S:shard1 c:collection1 oascc.ZkStateReader.updateWatchedCollection Updating data for collection1 to ver 11 
             [junit4]   2> 962841 T6671 oascc.ZkStateReader.updateWatchedCollection Updating data for collection1 to ver 11 
             [junit4]   2> 962841 T6671 oasha.CoreAdminHandler.handleWaitForStateAction Will wait a max of 183 seconds to see collection1 (shard1 of collection1) have state: recovering
             [junit4]   2> 962842 T6678 oascc.ZkStateReader.updateWatchedCollection Updating data for collection1 to ver 11 
             [junit4]   2> 962842 T6671 oasha.CoreAdminHandler.handleWaitForStateAction In WaitForState(recovering): collection=collection1, shard=shard1, thisCore=collection1, leaderDoesNotNeedRecovery=false, isLeader? true, live=true, checkLive=true, currentState=recovering, localState=active, nodeName=127.0.0.1:10059_p_%2Fz, coreNodeName=core_node2, onlyIfActiveCheckResult=false, nodeProps: core_node2:{"state":"recovering","base_url":"http://127.0.0.1:10059/p_/z","node_name":"127.0.0.1:10059_p_%2Fz","core":"collection1"}
             [junit4]   2> 962842 T6671 oasha.CoreAdminHandler.handleWaitForStateAction Waited coreNodeName: core_node2, state: recovering, checkLive: true, onlyIfLeader: true for: 0 seconds.
             [junit4]   2> 962843 T6671 oass.SolrDispatchFilter.handleAdminRequest [admin] webapp=null path=/admin/cores params={version=2&onlyIfLeader=true&checkLive=true&coreNodeName=core_node2&state=recovering&nodeName=127.0.0.1:10059_p_%252Fz&onlyIfLeaderActive=true&core=collection1&action=PREPRECOVERY&wt=javabin} status=0 QTime=5 
             [junit4]   2> 969843 T6739 C:collection1 S:shard1 c:collection1 C1144 P10059 oasc.RecoveryStrategy.doRecovery Attempting to PeerSync from http://127.0.0.1:10050/p_/z/collection1/ core=collection1 - recoveringAfterStartup=false
             [junit4]   2> 969844 T6739 C:collection1 S:shard1 c:collection1 C1144 P10059 oasu.PeerSync.sync PeerSync: core=collection1 url=http://127.0.0.1:10059/p_/z START replicas=[http://127.0.0.1:10050/p_/z/collection1/] nUpdates=100
             [junit4]   2> 969847 T6672 C:collection1 S:shard1 R:core_node1 c:collection1 C1145 P10050 oasc.SolrCore.execute [collection1] webapp=/p_/z path=/get params={version=2&getVersions=100&distrib=false&qt=/get&wt=javabin} status=0 QTime=0 
             [junit4]   2> 969848 T6739 C:collection1 S:shard1 c:collection1 C1144 P10059 oasu.PeerSync.handleVersions PeerSync: core=collection1 url=http://127.0.0.1:10059/p_/z  Received 100 versions from http://127.0.0.1:10050/p_/z/collection1/
             [junit4]   2> 969849 T6739 C:collection1 S:shard1 c:collection1 C1144 P10059 oasu.PeerSync.handleVersions PeerSync: core=collection1 url=http://127.0.0.1:10059/p_/z  Our versions are newer. ourLowThreshold=1497723899508424704 otherHigh=1497723899760082944
             [junit4]   2> 969849 T6739 C:collection1 S:shard1 c:collection1 C1144 P10059 oasu.PeerSync.sync PeerSync: core=collection1 url=http://127.0.0.1:10059/p_/z DONE. sync succeeded
             [junit4]   2> 969849 T6739 C:collection1 S:shard1 c:collection1 C1144 P10059 oasu.DirectUpdateHandler2.commit start commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
             [junit4]   2> 969865 T6739 C:collection1 S:shard1 c:collection1 C1144 P10059 oasc.SolrDeletionPolicy.onCommit SolrDeletionPolicy.onCommit: commits: num=2
             [junit4]   2> 		commit{dir=MockDirectoryWrapper(RAMDirectory@e80270f lockFactory=org.apache.lucene.store.SingleInstanceLockFactory@7f845138),segFN=segments_1,generation=1}
             [junit4]   2> 		commit{dir=MockDirectoryWrapper(RAMDirectory@e80270f lockFactory=org.apache.lucene.store.SingleInstanceLockFactory@7f845138),segFN=segments_2,generation=2}
             [junit4]   2> 969866 T6739 C:collection1 S:shard1 c:collection1 C1144 P10059 oasc.SolrDeletionPolicy.updateCommits newest commit generation = 2
             [junit4]   2> 969869 T6739 C:collection1 S:shard1 c:collection1 C1144 P10059 oass.SolrIndexSearcher.<init> Opening Searcher@46806cf3[collection1] main
             [junit4]   2> 969869 T6739 C:collection1 S:shard1 c:collection1 C1144 P10059 oasu.DirectUpdateHandler2.commit end_commit_flush
             [junit4]   2> 969870 T6730 C:collection1 S:shard1 c:collection1 oasc.SolrCore.registerSearcher [collection1] Registered new searcher Searcher@46806cf3[collection1] main{ExitableDirectoryReader(UninvertingDirectoryReader(Uninverting(_0(5.2.0):C115/43:delGen=1) Uninverting(_1(5.2.0):C360/137:delGen=1)))}
             [junit4]   2> 969870 T6739 C:collection1 S:shard1 c:collection1 C1144 P10059 oasc.RecoveryStrategy.doRecovery PeerSync Recovery was successful - registering as Active. core=collection1
             [junit4]   2> 969871 T6739 C:collection1 S:shard1 c:collection1 C1144 P10059 oasc.ZkController.publish publishing core=collection1 state=active collection=collection1
             [junit4]   2> 969871 T6739 C:collection1 S:shard1 c:collection1 C1144 P10059 oasc.ZkController.publish numShards not found on descriptor - reading it from system property
             [junit4]   2> 969872 T6649 oasc.DistributedQueue$LatchWatcher.process NodeChildrenChanged fired on path /overseer/queue state SyncConnected
             [junit4]   2> 969874 T6650 oasc.Overseer$ClusterStateUpdater.run processMessage: queueSize: 1, message = {
             [junit4]   2> 	  "collection":"collection1",
             [junit4]   2> 	  "core_node_name":"core_node2",
             [junit4]   2> 	  "state":"active",
             [junit4]   2> 	  "shard":"shard1",
             [junit4]   2> 	  "base_url":"http://127.0.0.1:10059/p_/z",
             [junit4]   2> 	  "roles":null,
             [junit4]   2> 	  "node_name":"127.0.0.1:10059_p_%2Fz",
             [junit4]   2> 	  "core":"collection1",
             [junit4]   2> 	  "operation":"state",
             [junit4]   2> 	  "numShards":"1"} current state version: 5
             [junit4]   2> 969875 T6650 oasco.ReplicaMutator.updateState Update state numShards=1 message={
             [junit4]   2> 	  "collection":"collection1",
             [junit4]   2> 	  "core_node_name":"core_node2",
             [junit4]   2> 	  "state":"active",
             [junit4]   2> 	  "shard":"shard1",
             [junit4]   2> 	  "base_url":"http://127.0.0.1:10059/p_/z",
             [junit4]   2> 	  "roles":null,
             [junit4]   2> 	  "node_name":"127.0.0.1:10059_p_%2Fz",
             [junit4]   2> 	  "core":"collection1",
             [junit4]   2> 	  "operation":"state",
             [junit4]   2> 	  "numShards":"1"}
             [junit4]   2> 969876 T6650 oasco.ZkStateWriter.writePendingUpdates going to update_collection /collections/collection1/state.json version: 11
             [junit4]   2> 969876 T6728 C:collection1 S:shard1 c:collection1 oascc.ZkStateReader$7.process A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/collections/collection1/state.json for collection collection1 has occurred - updating... (live nodes size: 3)
             [junit4]   2> 969876 T6678 oascc.ZkStateReader$7.process A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/collections/collection1/state.json for collection collection1 has occurred - updating... (live nodes size: 3)
             [junit4]   2> 969877 T6728 C:collection1 S:shard1 c:collection1 oascc.ZkStateReader.updateWatchedCollection Updating data for collection1 to ver 12 
             [junit4]   2> 969877 T6678 oascc.ZkStateReader.updateWatchedCollection Updating data for collection1 to ver 12 
             [junit4]   2> 970870 T6623 C:collection1 S:shard1 c:collection1 oasc.AbstractDistribZkTestBase.waitForRecoveriesToFinish Recoveries finished - collection: collection1
             [junit4]   2> 970873 T6638 C:control_collection S:shard1 R:core_node1 c:collection1 C1146 P10043 oasu.DirectUpdateHandler2.commit start commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
             [junit4]   2> 970920 T6638 C:control_collection S:shard1 R:core_node1 c:collection1 C1146 P10043 oasc.SolrDeletionPolicy.onCommit SolrDeletionPolicy.onCommit: commits: num=2
             [junit4]   2> 		commit{dir=MockDirectoryWrapper(RAMDirectory@25082424 lockFactory=org.apache.lucene.store.SingleInstanceLockFactory@397cca7a),segFN=segments_1,generation=1}
             [junit4]   2> 		commit{dir=MockDirectoryWrapper(RAMDirectory@25082424 lockFactory=org.apache.lucene.store.SingleInstanceLockFactory@397cca7a),segFN=segments_2,generation=2}
             [junit4]   2> 970921 T6638 C:control_collection S:shard1 R:core_node1 c:collection1 C1146 P10043 oasc.SolrDeletionPolicy.updateCommits newest commit generation = 2
             [junit4]   2> 970924 T6638 C:control_collection S:shard1 R:core_node1 c:collection1 C1146 P10043 oass.SolrIndexSearcher.<init> Opening Searcher@1508b423[collection1] main
             [junit4]   2> 970925 T6638 C:control_collection S:shard1 R:core_node1 c:collection1 C1146 P10043 oasu.DirectUpdateHandler2.commit end_commit_flush
             [junit4]   2> 970926 T6654 C:control_collection c:collection1 oasc.SolrCore.registerSearcher [collection1] Registered new searcher Searcher@1508b423[collection1] main{ExitableDirectoryReader(UninvertingDirectoryReader(Uninverting(_0(5.2.0):C1303/622:delGen=1) Uninverting(_1(5.2.0):C740/398:delGen=1)))}
             [junit4]   2> 970926 T6638 C:control_collection S:shard1 R:core_node1 c:collection1 C1146 P10043 oasup.LogUpdateProcessor.finish [collection1] webapp=/p_/z path=/update params={version=2&waitSearcher=true&softCommit=false&wt=javabin&commit=true} {commit=} 0 53
             [junit4]   2> 970931 T6668 C:collection1 S:shard1 R:core_node1 c:collection1 C1145 P10050 oasu.DirectUpdateHandler2.commit start commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
             [junit4]   2>  C1144_STATE=coll:collection1 core:collection1 props:{state=active, base_url=http://127.0.0.1:10059/p_/z, node_name=127.0.0.1:10059_p_%2Fz, core=collection1}
             [junit4]   2> 970932 T6717 C:collection1 S:shard1 R:core_node2 c:collection1 C1144 P10059 oasu.DirectUpdateHandler2.commit start commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
             [junit4]   2> 970933 T6717 C:collection1 S:shard1 R:core_node2 c:collection1 C1144 P10059 oasu.DirectUpdateHandler2.commit No uncommitted changes. Skipping IW.commit.
             [junit4]   2> 970933 T6717 C:collection1 S:shard1 R:core_node2 c:collection1 C1144 P10059 oasc.SolrCore.openNewSearcher SolrIndexSearcher has not changed - not re-opening: org.apache.solr.search.SolrIndexSearcher
             [junit4]   2> 970934 T6717 C:collection1 S:shard1 R:core_node2 c:collection1 C1144 P10059 oasu.DirectUpdateHandler2.commit end_commit_flush
             [junit4]   2> 970934 T6717 C:collection1 S:shard1 R:core_node2 c:collection1 C1144 P10059 oasup.LogUpdateProcessor.finish [collection1] webapp=/p_/z path=/update params={version=2&commit_end_point=true&waitSearcher=true&update.distrib=FROMLEADER&openSearcher=true&distrib.from=http://127.0.0.1:10050/p_/z/collection1/&expungeDeletes=false&softCommit=false&wt=javabin&commit=true} {commit=} 0 2
             [junit4]   2> 970978 T6668 C:collection1 S:shard1 R:core_node1 c:collection1 C1145 P10050 oasc.SolrDeletionPolicy.onCommit SolrDeletionPolicy.onCommit: commits: num=2
             [junit4]   2> 		commit{dir=MockDirectoryWrapper(RAMDirectory@5b80c596 lockFactory=org.apache.lucene.store.SingleInstanceLockFactory@632c29c8),segFN=segments_1,generation=1}
             [junit4]   2> 		commit{dir=MockDirectoryWrapper(RAMDirectory@5b80c596 lockFactory=org.apache.lucene.store.SingleInstanceLockFactory@632c29c8),segFN=segments_2,generation=2}
             [junit4]   2> 970979 T6668 C:collection1 S:shard1 R:core_node1 c:collection1 C1145 P10050 oasc.SolrDeletionPolicy.updateCommits newest commit generation = 2
             [junit4]   2> 970982 T6668 C:collection1 S:shard1 R:core_node1 c:collection1 C1145 P10050 oass.SolrIndexSearcher.<init> Opening Searcher@4273e191[collection1] main
             [junit4]   2> 970983 T6668 C:collection1 S:shard1 R:core_node1 c:collection1 C1145 P10050 oasu.DirectUpdateHandler2.commit end_commit_flush
             [junit4]   2> 970983 T6680 C:collection1 c:collection1 oasc.SolrCore.registerSearcher [collection1] Registered new searcher Searcher@4273e191[collection1] main{ExitableDirectoryReader(UninvertingDirectoryReader(Uninverting(_0(5.2.0):C861/415:delGen=1) Uninverting(_1(5.2.0):C1182/605:delGen=1)))}
             [junit4]   2> 970984 T6668 C:collection1 S:shard1 R:core_node1 c:collection1 C1145 P10050 oasup.LogUpdateProcessor.finish [collection1] webapp=/p_/z path=/update params={version=2&commit_end_point=true&waitSearcher=true&update.distrib=FROMLEADER&openSearcher=true&distrib.from=http://127.0.0.1:10050/p_/z/collection1/&expungeDeletes=false&softCommit=false&wt=javabin&commit=true} {commit=} 0 53
             [junit4]   2> 970986 T6670 C:collection1 S:shard1 R:core_node1 c:collection1 C1145 P10050 oasup.LogUpdateProcessor.finish [collection1] webapp=/p_/z path=/update params={version=2&waitSearcher=true&_stateVer_=collection1:6&softCommit=false&wt=javabin&commit=true} {commit=} 0 57
             [junit4]   2> 970991 T6666 C:collection1 S:shard1 R:core_node1 c:collection1 C1145 P10050 oasc.SolrCore.execute [collection1] webapp=/p_/z path=/select params={version=2&tests=checkShardConsistency&distrib=false&rows=0&q=*:*&wt=javabin} hits=1023 status=0 QTime=1 
             [junit4]   2> 970993 T6721 C:collection1 S:shard1 R:core_node2 c:collection1 C1144 P10059 oasc.SolrCore.execute [collection1] webapp=/p_/z path=/select params={version=2&tests=checkShardConsistency&distrib=false&rows=0&q=*:*&wt=javabin} hits=295 status=0 QTime=0 
             [junit4]   2> 970994 T6623 C:collection1 S:shard1 c:collection1 oasc.AbstractFullDistribZkTestBase.waitForThingsToLevelOut shard inconsistency - waiting ...
          
          Show
          Timothy Potter added a comment - running beast with 20 iters ... haven't been able to reproduce with 10 ... the logs from the failed test make it seem like the recovery process succeeded OK: [junit4] 2> added docs:1022 with 0 fails deletes:510 [junit4] 2> 956835 T6623 C:collection1 S:shard1 c:collection1 oasc.AbstractFullDistribZkTestBase.waitForThingsToLevelOut Wait for recoveries to finish - wait 120 for each attempt [junit4] 2> 956836 T6623 C:collection1 S:shard1 c:collection1 oasc.AbstractDistribZkTestBase.waitForRecoveriesToFinish Wait for recoveries to finish - collection: collection1 failOnTimeout: true timeout (sec):120 [junit4] 2> 959842 T6734 C:collection1 S:shard1 c:collection1 C1144 P10059 oasc.RecoveryStrategy.doRecovery Attempting to PeerSync from http: //127.0.0.1:10050/p_/z/collection1/ core=collection1 - recoveringAfterStartup= true [junit4] 2> 959843 T6734 C:collection1 S:shard1 c:collection1 C1144 P10059 oasu.PeerSync.sync PeerSync: core=collection1 url=http: //127.0.0.1:10059/p_/z START replicas=[http://127.0.0.1:10050/p_/z/collection1/] nUpdates=100 [junit4] 2> 959845 T6734 C:collection1 S:shard1 c:collection1 C1144 P10059 oasu.PeerSync.sync WARN no frame of reference to tell if we've missed updates [junit4] 2> 959846 T6734 C:collection1 S:shard1 c:collection1 C1144 P10059 oasc.RecoveryStrategy.doRecovery PeerSync Recovery was not successful - trying replication. core=collection1 [junit4] 2> 959846 T6734 C:collection1 S:shard1 c:collection1 C1144 P10059 oasc.RecoveryStrategy.doRecovery Recovery was cancelled [junit4] 2> 959846 T6734 C:collection1 S:shard1 c:collection1 C1144 P10059 oasc.RecoveryStrategy.doRecovery Finished recovery process. core=collection1 [junit4] 2> 959846 T6737 oasc.ActionThrottle.minimumWaitBetweenActions The last recovery attempt started 7015ms ago. [junit4] 2> 959846 T6737 oasc.ActionThrottle.minimumWaitBetweenActions Throttling recovery attempts - waiting for 2984ms [junit4] 2> 959847 T6672 C:collection1 S:shard1 R:core_node1 c:collection1 C1145 P10050 oasc.SolrCore.execute [collection1] webapp=/p_/z path=/get params={version=2&getVersions=100&distrib= false &qt=/get&wt=javabin} status=0 QTime=1 [junit4] 2> 962833 T6739 C1144 P10059 oasc.RecoveryStrategy.run Starting recovery process. core=collection1 recoveringAfterStartup= false [junit4] 2> 962834 T6739 C1144 P10059 oasc.RecoveryStrategy.doRecovery Publishing state of core collection1 as recovering, leader is http: //127.0.0.1:10050/p_/z/collection1/ and I am http://127.0.0.1:10059/p_/z/collection1/ [junit4] 2> 962834 T6739 C:collection1 c:collection1 C1144 P10059 oasc.ZkController.publish publishing core=collection1 state=recovering collection=collection1 [junit4] 2> 962834 T6739 C:collection1 c:collection1 C1144 P10059 oasc.ZkController.publish numShards not found on descriptor - reading it from system property [junit4] 2> 962836 T6649 oasc.DistributedQueue$LatchWatcher.process NodeChildrenChanged fired on path /overseer/queue state SyncConnected [junit4] 2> 962837 T6650 oasc.Overseer$ClusterStateUpdater.run processMessage: queueSize: 1, message = { [junit4] 2> "collection" : "collection1" , [junit4] 2> "core_node_name" : "core_node2" , [junit4] 2> "state" : "recovering" , [junit4] 2> "shard" : "shard1" , [junit4] 2> "base_url" : "http: //127.0.0.1:10059/p_/z" , [junit4] 2> "roles" : null , [junit4] 2> "node_name" : "127.0.0.1:10059_p_%2Fz" , [junit4] 2> "core" : "collection1" , [junit4] 2> "operation" : "state" , [junit4] 2> "numShards" : "1" } current state version: 5 [junit4] 2> 962837 T6650 oasco.ReplicaMutator.updateState Update state numShards=1 message={ [junit4] 2> "collection" : "collection1" , [junit4] 2> "core_node_name" : "core_node2" , [junit4] 2> "state" : "recovering" , [junit4] 2> "shard" : "shard1" , [junit4] 2> "base_url" : "http: //127.0.0.1:10059/p_/z" , [junit4] 2> "roles" : null , [junit4] 2> "node_name" : "127.0.0.1:10059_p_%2Fz" , [junit4] 2> "core" : "collection1" , [junit4] 2> "operation" : "state" , [junit4] 2> "numShards" : "1" } [junit4] 2> 962837 T6739 C:collection1 S:shard1 c:collection1 C1144 P10059 oasc.RecoveryStrategy.sendPrepRecoveryCmd Sending prep recovery command to http: //127.0.0.1:10050/p_/z; WaitForState: action=PREPRECOVERY&core=collection1&nodeName=127.0.0.1%3A10059_p_%252Fz&coreNodeName=core_node2&state=recovering&checkLive= true &onlyIfLeader= true &onlyIfLeaderActive= true [junit4] 2> 962838 T6650 oasco.ZkStateWriter.writePendingUpdates going to update_collection /collections/collection1/state.json version: 10 [junit4] 2> 962839 T6671 oasha.CoreAdminHandler.handleWaitForStateAction Going to wait for coreNodeName: core_node2, state: recovering, checkLive: true , onlyIfLeader: true , onlyIfLeaderActive: true [junit4] 2> 962839 T6678 oascc.ZkStateReader$7.process A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/collections/collection1/state.json for collection collection1 has occurred - updating... (live nodes size: 3) [junit4] 2> 962839 T6728 C:collection1 S:shard1 c:collection1 oascc.ZkStateReader$7.process A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/collections/collection1/state.json for collection collection1 has occurred - updating... (live nodes size: 3) [junit4] 2> 962840 T6728 C:collection1 S:shard1 c:collection1 oascc.ZkStateReader.updateWatchedCollection Updating data for collection1 to ver 11 [junit4] 2> 962841 T6671 oascc.ZkStateReader.updateWatchedCollection Updating data for collection1 to ver 11 [junit4] 2> 962841 T6671 oasha.CoreAdminHandler.handleWaitForStateAction Will wait a max of 183 seconds to see collection1 (shard1 of collection1) have state: recovering [junit4] 2> 962842 T6678 oascc.ZkStateReader.updateWatchedCollection Updating data for collection1 to ver 11 [junit4] 2> 962842 T6671 oasha.CoreAdminHandler.handleWaitForStateAction In WaitForState(recovering): collection=collection1, shard=shard1, thisCore=collection1, leaderDoesNotNeedRecovery= false , isLeader? true , live= true , checkLive= true , currentState=recovering, localState=active, nodeName=127.0.0.1:10059_p_%2Fz, coreNodeName=core_node2, onlyIfActiveCheckResult= false , nodeProps: core_node2:{ "state" : "recovering" , "base_url" : "http: //127.0.0.1:10059/p_/z" , "node_name" : "127.0.0.1:10059_p_%2Fz" , "core" : "collection1" } [junit4] 2> 962842 T6671 oasha.CoreAdminHandler.handleWaitForStateAction Waited coreNodeName: core_node2, state: recovering, checkLive: true , onlyIfLeader: true for : 0 seconds. [junit4] 2> 962843 T6671 oass.SolrDispatchFilter.handleAdminRequest [admin] webapp= null path=/admin/cores params={version=2&onlyIfLeader= true &checkLive= true &coreNodeName=core_node2&state=recovering&nodeName=127.0.0.1:10059_p_%252Fz&onlyIfLeaderActive= true &core=collection1&action=PREPRECOVERY&wt=javabin} status=0 QTime=5 [junit4] 2> 969843 T6739 C:collection1 S:shard1 c:collection1 C1144 P10059 oasc.RecoveryStrategy.doRecovery Attempting to PeerSync from http: //127.0.0.1:10050/p_/z/collection1/ core=collection1 - recoveringAfterStartup= false [junit4] 2> 969844 T6739 C:collection1 S:shard1 c:collection1 C1144 P10059 oasu.PeerSync.sync PeerSync: core=collection1 url=http: //127.0.0.1:10059/p_/z START replicas=[http://127.0.0.1:10050/p_/z/collection1/] nUpdates=100 [junit4] 2> 969847 T6672 C:collection1 S:shard1 R:core_node1 c:collection1 C1145 P10050 oasc.SolrCore.execute [collection1] webapp=/p_/z path=/get params={version=2&getVersions=100&distrib= false &qt=/get&wt=javabin} status=0 QTime=0 [junit4] 2> 969848 T6739 C:collection1 S:shard1 c:collection1 C1144 P10059 oasu.PeerSync.handleVersions PeerSync: core=collection1 url=http: //127.0.0.1:10059/p_/z Received 100 versions from http://127.0.0.1:10050/p_/z/collection1/ [junit4] 2> 969849 T6739 C:collection1 S:shard1 c:collection1 C1144 P10059 oasu.PeerSync.handleVersions PeerSync: core=collection1 url=http: //127.0.0.1:10059/p_/z Our versions are newer. ourLowThreshold=1497723899508424704 otherHigh=1497723899760082944 [junit4] 2> 969849 T6739 C:collection1 S:shard1 c:collection1 C1144 P10059 oasu.PeerSync.sync PeerSync: core=collection1 url=http: //127.0.0.1:10059/p_/z DONE. sync succeeded [junit4] 2> 969849 T6739 C:collection1 S:shard1 c:collection1 C1144 P10059 oasu.DirectUpdateHandler2.commit start commit{,optimize= false ,openSearcher= true ,waitSearcher= true ,expungeDeletes= false ,softCommit= false ,prepareCommit= false } [junit4] 2> 969865 T6739 C:collection1 S:shard1 c:collection1 C1144 P10059 oasc.SolrDeletionPolicy.onCommit SolrDeletionPolicy.onCommit: commits: num=2 [junit4] 2> commit{dir=MockDirectoryWrapper(RAMDirectory@e80270f lockFactory=org.apache.lucene.store.SingleInstanceLockFactory@7f845138),segFN=segments_1,generation=1} [junit4] 2> commit{dir=MockDirectoryWrapper(RAMDirectory@e80270f lockFactory=org.apache.lucene.store.SingleInstanceLockFactory@7f845138),segFN=segments_2,generation=2} [junit4] 2> 969866 T6739 C:collection1 S:shard1 c:collection1 C1144 P10059 oasc.SolrDeletionPolicy.updateCommits newest commit generation = 2 [junit4] 2> 969869 T6739 C:collection1 S:shard1 c:collection1 C1144 P10059 oass.SolrIndexSearcher.<init> Opening Searcher@46806cf3[collection1] main [junit4] 2> 969869 T6739 C:collection1 S:shard1 c:collection1 C1144 P10059 oasu.DirectUpdateHandler2.commit end_commit_flush [junit4] 2> 969870 T6730 C:collection1 S:shard1 c:collection1 oasc.SolrCore.registerSearcher [collection1] Registered new searcher Searcher@46806cf3[collection1] main{ExitableDirectoryReader(UninvertingDirectoryReader(Uninverting(_0(5.2.0):C115/43:delGen=1) Uninverting(_1(5.2.0):C360/137:delGen=1)))} [junit4] 2> 969870 T6739 C:collection1 S:shard1 c:collection1 C1144 P10059 oasc.RecoveryStrategy.doRecovery PeerSync Recovery was successful - registering as Active. core=collection1 [junit4] 2> 969871 T6739 C:collection1 S:shard1 c:collection1 C1144 P10059 oasc.ZkController.publish publishing core=collection1 state=active collection=collection1 [junit4] 2> 969871 T6739 C:collection1 S:shard1 c:collection1 C1144 P10059 oasc.ZkController.publish numShards not found on descriptor - reading it from system property [junit4] 2> 969872 T6649 oasc.DistributedQueue$LatchWatcher.process NodeChildrenChanged fired on path /overseer/queue state SyncConnected [junit4] 2> 969874 T6650 oasc.Overseer$ClusterStateUpdater.run processMessage: queueSize: 1, message = { [junit4] 2> "collection" : "collection1" , [junit4] 2> "core_node_name" : "core_node2" , [junit4] 2> "state" : "active" , [junit4] 2> "shard" : "shard1" , [junit4] 2> "base_url" : "http: //127.0.0.1:10059/p_/z" , [junit4] 2> "roles" : null , [junit4] 2> "node_name" : "127.0.0.1:10059_p_%2Fz" , [junit4] 2> "core" : "collection1" , [junit4] 2> "operation" : "state" , [junit4] 2> "numShards" : "1" } current state version: 5 [junit4] 2> 969875 T6650 oasco.ReplicaMutator.updateState Update state numShards=1 message={ [junit4] 2> "collection" : "collection1" , [junit4] 2> "core_node_name" : "core_node2" , [junit4] 2> "state" : "active" , [junit4] 2> "shard" : "shard1" , [junit4] 2> "base_url" : "http: //127.0.0.1:10059/p_/z" , [junit4] 2> "roles" : null , [junit4] 2> "node_name" : "127.0.0.1:10059_p_%2Fz" , [junit4] 2> "core" : "collection1" , [junit4] 2> "operation" : "state" , [junit4] 2> "numShards" : "1" } [junit4] 2> 969876 T6650 oasco.ZkStateWriter.writePendingUpdates going to update_collection /collections/collection1/state.json version: 11 [junit4] 2> 969876 T6728 C:collection1 S:shard1 c:collection1 oascc.ZkStateReader$7.process A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/collections/collection1/state.json for collection collection1 has occurred - updating... (live nodes size: 3) [junit4] 2> 969876 T6678 oascc.ZkStateReader$7.process A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/collections/collection1/state.json for collection collection1 has occurred - updating... (live nodes size: 3) [junit4] 2> 969877 T6728 C:collection1 S:shard1 c:collection1 oascc.ZkStateReader.updateWatchedCollection Updating data for collection1 to ver 12 [junit4] 2> 969877 T6678 oascc.ZkStateReader.updateWatchedCollection Updating data for collection1 to ver 12 [junit4] 2> 970870 T6623 C:collection1 S:shard1 c:collection1 oasc.AbstractDistribZkTestBase.waitForRecoveriesToFinish Recoveries finished - collection: collection1 [junit4] 2> 970873 T6638 C:control_collection S:shard1 R:core_node1 c:collection1 C1146 P10043 oasu.DirectUpdateHandler2.commit start commit{,optimize= false ,openSearcher= true ,waitSearcher= true ,expungeDeletes= false ,softCommit= false ,prepareCommit= false } [junit4] 2> 970920 T6638 C:control_collection S:shard1 R:core_node1 c:collection1 C1146 P10043 oasc.SolrDeletionPolicy.onCommit SolrDeletionPolicy.onCommit: commits: num=2 [junit4] 2> commit{dir=MockDirectoryWrapper(RAMDirectory@25082424 lockFactory=org.apache.lucene.store.SingleInstanceLockFactory@397cca7a),segFN=segments_1,generation=1} [junit4] 2> commit{dir=MockDirectoryWrapper(RAMDirectory@25082424 lockFactory=org.apache.lucene.store.SingleInstanceLockFactory@397cca7a),segFN=segments_2,generation=2} [junit4] 2> 970921 T6638 C:control_collection S:shard1 R:core_node1 c:collection1 C1146 P10043 oasc.SolrDeletionPolicy.updateCommits newest commit generation = 2 [junit4] 2> 970924 T6638 C:control_collection S:shard1 R:core_node1 c:collection1 C1146 P10043 oass.SolrIndexSearcher.<init> Opening Searcher@1508b423[collection1] main [junit4] 2> 970925 T6638 C:control_collection S:shard1 R:core_node1 c:collection1 C1146 P10043 oasu.DirectUpdateHandler2.commit end_commit_flush [junit4] 2> 970926 T6654 C:control_collection c:collection1 oasc.SolrCore.registerSearcher [collection1] Registered new searcher Searcher@1508b423[collection1] main{ExitableDirectoryReader(UninvertingDirectoryReader(Uninverting(_0(5.2.0):C1303/622:delGen=1) Uninverting(_1(5.2.0):C740/398:delGen=1)))} [junit4] 2> 970926 T6638 C:control_collection S:shard1 R:core_node1 c:collection1 C1146 P10043 oasup.LogUpdateProcessor.finish [collection1] webapp=/p_/z path=/update params={version=2&waitSearcher= true &softCommit= false &wt=javabin&commit= true } {commit=} 0 53 [junit4] 2> 970931 T6668 C:collection1 S:shard1 R:core_node1 c:collection1 C1145 P10050 oasu.DirectUpdateHandler2.commit start commit{,optimize= false ,openSearcher= true ,waitSearcher= true ,expungeDeletes= false ,softCommit= false ,prepareCommit= false } [junit4] 2> C1144_STATE=coll:collection1 core:collection1 props:{state=active, base_url=http: //127.0.0.1:10059/p_/z, node_name=127.0.0.1:10059_p_%2Fz, core=collection1} [junit4] 2> 970932 T6717 C:collection1 S:shard1 R:core_node2 c:collection1 C1144 P10059 oasu.DirectUpdateHandler2.commit start commit{,optimize= false ,openSearcher= true ,waitSearcher= true ,expungeDeletes= false ,softCommit= false ,prepareCommit= false } [junit4] 2> 970933 T6717 C:collection1 S:shard1 R:core_node2 c:collection1 C1144 P10059 oasu.DirectUpdateHandler2.commit No uncommitted changes. Skipping IW.commit. [junit4] 2> 970933 T6717 C:collection1 S:shard1 R:core_node2 c:collection1 C1144 P10059 oasc.SolrCore.openNewSearcher SolrIndexSearcher has not changed - not re-opening: org.apache.solr.search.SolrIndexSearcher [junit4] 2> 970934 T6717 C:collection1 S:shard1 R:core_node2 c:collection1 C1144 P10059 oasu.DirectUpdateHandler2.commit end_commit_flush [junit4] 2> 970934 T6717 C:collection1 S:shard1 R:core_node2 c:collection1 C1144 P10059 oasup.LogUpdateProcessor.finish [collection1] webapp=/p_/z path=/update params={version=2&commit_end_point= true &waitSearcher= true &update.distrib=FROMLEADER&openSearcher= true &distrib.from=http: //127.0.0.1:10050/p_/z/collection1/&expungeDeletes= false &softCommit= false &wt=javabin&commit= true } {commit=} 0 2 [junit4] 2> 970978 T6668 C:collection1 S:shard1 R:core_node1 c:collection1 C1145 P10050 oasc.SolrDeletionPolicy.onCommit SolrDeletionPolicy.onCommit: commits: num=2 [junit4] 2> commit{dir=MockDirectoryWrapper(RAMDirectory@5b80c596 lockFactory=org.apache.lucene.store.SingleInstanceLockFactory@632c29c8),segFN=segments_1,generation=1} [junit4] 2> commit{dir=MockDirectoryWrapper(RAMDirectory@5b80c596 lockFactory=org.apache.lucene.store.SingleInstanceLockFactory@632c29c8),segFN=segments_2,generation=2} [junit4] 2> 970979 T6668 C:collection1 S:shard1 R:core_node1 c:collection1 C1145 P10050 oasc.SolrDeletionPolicy.updateCommits newest commit generation = 2 [junit4] 2> 970982 T6668 C:collection1 S:shard1 R:core_node1 c:collection1 C1145 P10050 oass.SolrIndexSearcher.<init> Opening Searcher@4273e191[collection1] main [junit4] 2> 970983 T6668 C:collection1 S:shard1 R:core_node1 c:collection1 C1145 P10050 oasu.DirectUpdateHandler2.commit end_commit_flush [junit4] 2> 970983 T6680 C:collection1 c:collection1 oasc.SolrCore.registerSearcher [collection1] Registered new searcher Searcher@4273e191[collection1] main{ExitableDirectoryReader(UninvertingDirectoryReader(Uninverting(_0(5.2.0):C861/415:delGen=1) Uninverting(_1(5.2.0):C1182/605:delGen=1)))} [junit4] 2> 970984 T6668 C:collection1 S:shard1 R:core_node1 c:collection1 C1145 P10050 oasup.LogUpdateProcessor.finish [collection1] webapp=/p_/z path=/update params={version=2&commit_end_point= true &waitSearcher= true &update.distrib=FROMLEADER&openSearcher= true &distrib.from=http: //127.0.0.1:10050/p_/z/collection1/&expungeDeletes= false &softCommit= false &wt=javabin&commit= true } {commit=} 0 53 [junit4] 2> 970986 T6670 C:collection1 S:shard1 R:core_node1 c:collection1 C1145 P10050 oasup.LogUpdateProcessor.finish [collection1] webapp=/p_/z path=/update params={version=2&waitSearcher= true &_stateVer_=collection1:6&softCommit= false &wt=javabin&commit= true } {commit=} 0 57 [junit4] 2> 970991 T6666 C:collection1 S:shard1 R:core_node1 c:collection1 C1145 P10050 oasc.SolrCore.execute [collection1] webapp=/p_/z path=/select params={version=2&tests=checkShardConsistency&distrib= false &rows=0&q=*:*&wt=javabin} hits=1023 status=0 QTime=1 [junit4] 2> 970993 T6721 C:collection1 S:shard1 R:core_node2 c:collection1 C1144 P10059 oasc.SolrCore.execute [collection1] webapp=/p_/z path=/select params={version=2&tests=checkShardConsistency&distrib= false &rows=0&q=*:*&wt=javabin} hits=295 status=0 QTime=0 [junit4] 2> 970994 T6623 C:collection1 S:shard1 c:collection1 oasc.AbstractFullDistribZkTestBase.waitForThingsToLevelOut shard inconsistency - waiting ...
          Hide
          Timothy Potter added a comment -

          hmmm ... no failures with -Dbeast.iters=20 either ... not much jumping out at me in the logs either. The replica with missing data definitely received the update that is being reported at the end of the test as missing, for instance:

          ###### Only in http://127.0.0.1:10050/p_/z/collection1: [{_version_=1497723893329166337, id=2-472}
          

          but earlier in the logs, we see:

          [junit4]   2> 949616 T6637 C:control_collection S:shard1 R:core_node1 c:collection1 C1136 P10043 oasup.LogUpdateProcessor.finish [collection1] webapp=/p_/z path=/update params={version=2&CONTROL=TRUE&wt=javabin} {add=[2-472 (1497723893326020609)]} 0 0
             [junit4]   2> 949621 T6686 C:collection1 S:shard1 R:core_node2 c:collection1 C1134 P10059 oasup.LogUpdateProcessor.finish [collection1] webapp=/p_/z path=/update params={version=2&update.distrib=FROMLEADER&distrib.from=http://127.0.0.1:10050/p_/z/collection1/&wt=javabin} {add=[2-472 (1497723893329166337)]} 0 0
             [junit4]   2> 949622 T6670 C:collection1 S:shard1 R:core_node1 c:collection1 C1135 P10050 oasup.LogUpdateProcessor.finish [collection1] webapp=/p_/z path=/update params={version=2&wt=javabin} {add=[2-472 (1497723893329166337)]} 0 3
          

          I used the same seed as the failed build as well - FBFBECE5D5AABA29

          You see anything on your side Mark Miller?

          Show
          Timothy Potter added a comment - hmmm ... no failures with -Dbeast.iters=20 either ... not much jumping out at me in the logs either. The replica with missing data definitely received the update that is being reported at the end of the test as missing, for instance: ###### Only in http: //127.0.0.1:10050/p_/z/collection1: [{_version_=1497723893329166337, id=2-472} but earlier in the logs, we see: [junit4] 2> 949616 T6637 C:control_collection S:shard1 R:core_node1 c:collection1 C1136 P10043 oasup.LogUpdateProcessor.finish [collection1] webapp=/p_/z path=/update params={version=2&CONTROL=TRUE&wt=javabin} {add=[2-472 (1497723893326020609)]} 0 0 [junit4] 2> 949621 T6686 C:collection1 S:shard1 R:core_node2 c:collection1 C1134 P10059 oasup.LogUpdateProcessor.finish [collection1] webapp=/p_/z path=/update params={version=2&update.distrib=FROMLEADER&distrib.from=http: //127.0.0.1:10050/p_/z/collection1/&wt=javabin} {add=[2-472 (1497723893329166337)]} 0 0 [junit4] 2> 949622 T6670 C:collection1 S:shard1 R:core_node1 c:collection1 C1135 P10050 oasup.LogUpdateProcessor.finish [collection1] webapp=/p_/z path=/update params={version=2&wt=javabin} {add=[2-472 (1497723893329166337)]} 0 3 I used the same seed as the failed build as well - FBFBECE5D5AABA29 You see anything on your side Mark Miller ?
          Hide
          Mark Miller added a comment -

          A lot of my ChaosMonkey test runs on my local jenkins machine starting failing after that commit. I have not had a chance to dig into the logs yet though.

          Show
          Mark Miller added a comment - A lot of my ChaosMonkey test runs on my local jenkins machine starting failing after that commit. I have not had a chance to dig into the logs yet though.
          Hide
          Mark Miller added a comment -

          May have been coincidental fails (still, bad new(ish) replicas out of sync fails) - I'll spend some time tomorrow looking closer and post what I find or close this issue again.

          Show
          Mark Miller added a comment - May have been coincidental fails (still, bad new(ish) replicas out of sync fails) - I'll spend some time tomorrow looking closer and post what I find or close this issue again.
          Hide
          Timothy Potter added a comment -

          Ok cool - I have RC built for 5.1 with this fix in, but will do it if needed ... can't get it to reproduce with 50 beasts!

          ant beast -Dbeast.iters=50  -Dtestcase=RecoveryZkTest -Dtests.method=test -Dtests.seed=FBFBECE5D5AABA29 -Dtests.multiplier=2 -Dtests.slow=true -Dtests.locale=ar_SD -Dtests.timezone=America/Marigot -Dtests.asserts=true -Dtests.file.encoding=US-ASCII
          
          Show
          Timothy Potter added a comment - Ok cool - I have RC built for 5.1 with this fix in, but will do it if needed ... can't get it to reproduce with 50 beasts! ant beast -Dbeast.iters=50 -Dtestcase=RecoveryZkTest -Dtests.method=test -Dtests.seed=FBFBECE5D5AABA29 -Dtests.multiplier=2 -Dtests.slow= true -Dtests.locale=ar_SD -Dtests.timezone=America/Marigot -Dtests.asserts= true -Dtests.file.encoding=US-ASCII
          Hide
          Timothy Potter added a comment -

          Mark Miller I don't think the RecoveryZkTest failure is due to this ticket as it failed in a similar fashion prior to that commit:
          https://builds.apache.org/job/Lucene-Solr-Tests-5.x-Java7/2888/

          I've actually beast'd that test with 5.1 100 times w/o failure so I'd like to move forward with the RC candidate I've already built and staged unless you advise otherwise

          Show
          Timothy Potter added a comment - Mark Miller I don't think the RecoveryZkTest failure is due to this ticket as it failed in a similar fashion prior to that commit: https://builds.apache.org/job/Lucene-Solr-Tests-5.x-Java7/2888/ I've actually beast'd that test with 5.1 100 times w/o failure so I'd like to move forward with the RC candidate I've already built and staged unless you advise otherwise
          Hide
          Mark Miller added a comment -

          Same status as yesterday - I'll look into this today.

          Show
          Mark Miller added a comment - Same status as yesterday - I'll look into this today.
          Hide
          Mark Miller added a comment -

          Okay - I've looked through my Jenkins logs. There are some concerning new fails, but just looks like coincidence that they and the regular jenkins 'replica out of sync' popped up together after this commit. Nothing to indicate an ongoing problem started by this commit looking deeper.

          Show
          Mark Miller added a comment - Okay - I've looked through my Jenkins logs. There are some concerning new fails, but just looks like coincidence that they and the regular jenkins 'replica out of sync' popped up together after this commit. Nothing to indicate an ongoing problem started by this commit looking deeper.
          Hide
          Timothy Potter added a comment -

          Bulk close after 5.1 release

          Show
          Timothy Potter added a comment - Bulk close after 5.1 release

            People

            • Assignee:
              Mark Miller
              Reporter:
              Timothy Potter
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development