Solr
  1. Solr
  2. SOLR-6956

OverseerCollectionProcessor and replicas on the overseer node can operate on stale cluster state

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 5.0
    • Fix Version/s: 5.1, 6.0
    • Component/s: SolrCloud
    • Labels:
      None

      Description

      I still see fails of this test sometimes:

      java.lang.AssertionError: Should have had a good message here
      	at __randomizedtesting.SeedInfo.seed([D765D9019AAF2D1E:56835719EDF04D22]:0)
      	at org.junit.Assert.fail(Assert.java:93)
      	at org.junit.Assert.assertTrue(Assert.java:43)
      	at org.apache.solr.cloud.DeleteReplicaTest.deleteLiveReplicaTest(DeleteReplicaTest.java:138)
      	at org.apache.solr.cloud.DeleteReplicaTest.doTest(DeleteReplicaTest.java:89)
      
      1. DeleteReplicaTest-failure.log
        854 kB
        Shalin Shekhar Mangar
      2. SOLR-6956.patch
        14 kB
        Shalin Shekhar Mangar
      3. SOLR-6956.patch
        13 kB
        Shalin Shekhar Mangar
      4. SOLR-6956.patch
        12 kB
        Shalin Shekhar Mangar

        Issue Links

          Activity

          Hide
          Shalin Shekhar Mangar added a comment - - edited

          Looking through the logs of the latest failure of this test at http://jenkins.thetaphi.de/job/Lucene-Solr-trunk-Linux/11802/

          This test waits for all replicas of a new collection to be 'active' and then calls delete replica API with onlyIfDown=true and expects it to fail. It does this by matching the error message returned from the API call against the following string:

          'with onlyIfDown='true', but state is 'active'

          However, in this particular run, the API call does fail but because the OverseerCollectionProcessor finds the replica in 'recovering' state instead of 'active', the returned error message doesn't match. So the fix to the test is easy. But what's more interesting is why the OverseerCollectionProcessor couldn't see the latest cluster state when the client could. In the logs too, I can see watchers being fired on the new cluster state before the call to delete replica is made. I'll dig a bit more before I commit a fix to the test.

          Show
          Shalin Shekhar Mangar added a comment - - edited Looking through the logs of the latest failure of this test at http://jenkins.thetaphi.de/job/Lucene-Solr-trunk-Linux/11802/ This test waits for all replicas of a new collection to be 'active' and then calls delete replica API with onlyIfDown=true and expects it to fail. It does this by matching the error message returned from the API call against the following string: 'with onlyIfDown='true', but state is 'active' However, in this particular run, the API call does fail but because the OverseerCollectionProcessor finds the replica in 'recovering' state instead of 'active', the returned error message doesn't match. So the fix to the test is easy. But what's more interesting is why the OverseerCollectionProcessor couldn't see the latest cluster state when the client could. In the logs too, I can see watchers being fired on the new cluster state before the call to delete replica is made. I'll dig a bit more before I commit a fix to the test.
          Hide
          Shalin Shekhar Mangar added a comment -

          The full test log from jenkins is attached.

          The following is the interesting excerpt from the log:

          [junit4]   2> 357522 T2350 oasco.ZkStateWriter.writePendingUpdates going to update_collection /collections/delLiveColl/state.json version: 10
             [junit4]   2> 357523 T2349 oascc.ZkStateReader$7.process A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/collections/delLiveColl/state.json, has occurred - updating... 
             [junit4]   2> 357523 T2450 oascc.ZkStateReader$7.process A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/collections/delLiveColl/state.json, has occurred - updating... 
             [junit4]   2> 357523 T2425 oascc.ZkStateReader$7.process A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/collections/delLiveColl/state.json, has occurred - updating... 
             [junit4]   2> 357523 T2381 oascc.ZkStateReader$7.process A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/collections/delLiveColl/state.json, has occurred - updating... 
             [junit4]   2> 357523 T2450 oascc.ZkStateReader.updateWatchedCollection Updating data for delLiveColl to ver 11 
             [junit4]   2> 357524 T2425 oascc.ZkStateReader.updateWatchedCollection Updating data for delLiveColl to ver 11 
             [junit4]   2> 357524 T2381 oascc.ZkStateReader.updateWatchedCollection Updating data for delLiveColl to ver 11 
             [junit4]   2> 357559 T2321 oasc.AbstractDistribZkTestBase.waitForRecoveriesToFinish Recoveries finished - collection: delLiveColl
             [junit4]   2> 357562 T2372 oass.SolrDispatchFilter.handleAdminRequest [admin] webapp=null path=/admin/cores params={core=delLiveColl_shard1_replica1&action=STATUS&indexInfo=true&wt=javabin&version=2} status=0 QTime=1 
             [junit4]   2> 357565 T2339 oasha.CollectionsHandler.handleRemoveReplica Remove replica: replica=core_node1&action=deletereplica&collection=delLiveColl&shard=shard1&onlyIfDown=true&wt=javabin&version=2
             [junit4]   2> 357566 T2364 oasc.DistributedQueue$LatchWatcher.process NodeChildrenChanged fired on path /overseer/collection-queue-work state SyncConnected
             [junit4]   2> 357567 T2351 oasc.OverseerCollectionProcessor.run Overseer Collection Processor: Get the message id:/overseer/collection-queue-work/qn-0000000002 message:{
             [junit4]   2> 	  "operation":"deletereplica",
             [junit4]   2> 	  "collection":"delLiveColl",
             [junit4]   2> 	  "shard":"shard1",
             [junit4]   2> 	  "replica":"core_node1",
             [junit4]   2> 	  "onlyIfDown":"true"}
             [junit4]   2> 357567 T2489 oasc.OverseerCollectionProcessor.processMessage WARN OverseerCollectionProcessor.processMessage : deletereplica , {
             [junit4]   2> 	  "operation":"deletereplica",
             [junit4]   2> 	  "collection":"delLiveColl",
             [junit4]   2> 	  "shard":"shard1",
             [junit4]   2> 	  "replica":"core_node1",
             [junit4]   2> 	  "onlyIfDown":"true"}
             [junit4]   2> 357567 T2489 oasc.SolrException.log ERROR Collection: delLiveColl operation: deletereplica failed:org.apache.solr.common.SolrException: Attempted to remove replica : delLiveColl/shard1/core_node1 with onlyIfDown='true', but state is 'recovering'
             [junit4]   2> 		at org.apache.solr.cloud.OverseerCollectionProcessor.deleteReplica(OverseerCollectionProcessor.java:1079)
             [junit4]   2> 		at org.apache.solr.cloud.OverseerCollectionProcessor.processMessage(OverseerCollectionProcessor.java:631)
             [junit4]   2> 		at org.apache.solr.cloud.OverseerCollectionProcessor$Runner.run(OverseerCollectionProcessor.java:2862)
             [junit4]   2> 		at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
             [junit4]   2> 		at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
             [junit4]   2> 		at java.lang.Thread.run(Thread.java:745)
             [junit4]   2> 	
             [junit4]   2> 357569 T2489 oasc.OverseerCollectionProcessor$Runner.run Overseer Collection Processor: Message id:/overseer/collection-queue-work/qn-0000000002 complete, response:{Operation deletereplica caused exception:=org.apache.solr.common.SolrException: Attempted to remove replica : delLiveColl/shard1/core_node1 with onlyIfDown='true', but state is 'recovering',exception={msg=Attempted to remove replica : delLiveColl/shard1/core_node1 with onlyIfDown='true', but state is 'recovering',rspCode=400}}
             [junit4]   2> 357570 T2364 oasc.DistributedQueue$LatchWatcher.process NodeDataChanged fired on path /overseer/collection-queue-work/qnr-0000000002 state SyncConnected
             [junit4]   2> 357570 T2364 oasc.DistributedQueue$LatchWatcher.process NodeChildrenChanged fired on path /overseer/collection-queue-work state SyncConnected
             [junit4]   2> 357571 T2364 oasc.DistributedQueue$LatchWatcher.process NodeChildrenChanged fired on path /overseer/collection-queue-work state SyncConnected
             [junit4]   2> 357572 T2339 oass.SolrDispatchFilter.handleAdminRequest [admin] webapp=null path=/admin/collections params={replica=core_node1&action=deletereplica&collection=delLiveColl&shard=shard1&onlyIfDown=true&wt=javabin&version=2} status=400 QTime=7 
             [junit4]   2> 357574 T2321 oas.SolrTestCaseJ4.tearDown ###Ending deleteLiveReplicaTest
             [junit4]   2> 357577 T2321 oasc.ChaosMonkey.monkeyLog monkey: stop shard! 55369
             [junit4]   2> 357577 T2321 oasc.CoreContainer.shutdown Shutting down CoreContainer instance=802937226
             [junit4]   2> 357577 T2321 oasc.ZkController.publish publishing core=collection1 state=down collection=control_collection
             [junit4]   2> 357577 T2321 oasc.ZkController.publish numShards not found on descriptor - reading it from system property
             [junit4]   2> 357578 T2364 oasc.DistributedQueue$LatchWatcher.process NodeChildrenChanged fired on path /overseer/queue state SyncConnected
             [junit4]   2> 357578 T2321 oasc.ZkController.publish publishing core=delLiveColl_shard1_replica2 state=down collection=delLiveColl
             [junit4]   2> 357579 T2350 oasc.Overseer$ClusterStateUpdater.run processMessage: queueSize: 1, message = {
             [junit4]   2> 	  "core":"collection1",
             [junit4]   2> 	  "core_node_name":"core_node1",
             [junit4]   2> 	  "roles":null,
             [junit4]   2> 	  "base_url":"http://127.0.0.1:55369",
             [junit4]   2> 	  "node_name":"127.0.0.1:55369_",
             [junit4]   2> 	  "numShards":"2",
             [junit4]   2> 	  "state":"down",
             [junit4]   2> 	  "shard":"shard1",
             [junit4]   2> 	  "collection":"control_collection",
             [junit4]   2> 	  "operation":"state"} current state version: 6
             [junit4]   2> 357579 T2321 oasc.SolrCore.close [collection1]  CLOSING SolrCore org.apache.solr.core.SolrCore@6eb4277d
             [junit4]   2> 357579 T2350 oasco.ReplicaMutator.updateState Update state numShards=2 message={
             [junit4]   2> 	  "core":"collection1",
             [junit4]   2> 	  "core_node_name":"core_node1",
             [junit4]   2> 	  "roles":null,
             [junit4]   2> 	  "base_url":"http://127.0.0.1:55369",
             [junit4]   2> 	  "node_name":"127.0.0.1:55369_",
             [junit4]   2> 	  "numShards":"2",
             [junit4]   2> 	  "state":"down",
             [junit4]   2> 	  "shard":"shard1",
             [junit4]   2> 	  "collection":"control_collection",
             [junit4]   2> 	  "operation":"state"}
             [junit4]   2> 357579 T2321 oasc.ZkController.unRegisterConfListener  a listener was removed because of core close
             [junit4]   2> 357579 T2321 oasu.DirectUpdateHandler2.close closing DirectUpdateHandler2{commits=0,autocommits=0,soft autocommits=0,optimizes=0,rollbacks=0,expungeDeletes=0,docsPending=0,adds=0,deletesById=0,deletesByQuery=0,errors=0,cumulative_adds=0,cumulative_deletesById=0,cumulative_deletesByQuery=0,cumulative_errors=0,transaction_logs_total_size=0,transaction_logs_total_number=0}
             [junit4]   2> 357579 T2321 oasu.SolrCoreState.decrefSolrCoreState Closing SolrCoreState
             [junit4]   2> 357580 T2321 oasu.DefaultSolrCoreState.closeIndexWriter SolrCoreState ref count has reached 0 - closing IndexWriter
             [junit4]   2> 357580 T2321 oasu.DefaultSolrCoreState.closeIndexWriter closing IndexWriter with IndexWriterCloser
             [junit4]   2> 357580 T2364 oascc.ZkStateReader$2.process A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 5)
             [junit4]   2> 357580 T2381 oascc.ZkStateReader$2.process A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 5)
             [junit4]   2> 357580 T2425 oascc.ZkStateReader$2.process A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 5)
             [junit4]   2> 357580 T2450 oascc.ZkStateReader$2.process A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 5)
             [junit4]   2> 357580 T2403 oascc.ZkStateReader$2.process A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 5)
             [junit4]   2> 357580 T2321 oasc.SolrCore.closeSearcher [collection1] Closing main searcher on request.
             [junit4]   2> 357586 T2350 oasc.Overseer$ClusterStateUpdater.run processMessage: queueSize: 1, message = {
             [junit4]   2> 	  "core":"delLiveColl_shard1_replica2",
             [junit4]   2> 	  "core_node_name":"core_node2",
             [junit4]   2> 	  "roles":null,
             [junit4]   2> 	  "base_url":"http://127.0.0.1:55369",
             [junit4]   2> 	  "node_name":"127.0.0.1:55369_",
             [junit4]   2> 	  "numShards":"2",
             [junit4]   2> 	  "state":"down",
             [junit4]   2> 	  "shard":"shard1",
             [junit4]   2> 	  "collection":"delLiveColl",
             [junit4]   2> 	  "operation":"state"} current state version: 7
             [junit4]   2> 357587 T2350 oasco.ReplicaMutator.updateState Update state numShards=2 message={
             [junit4]   2> 	  "core":"delLiveColl_shard1_replica2",
             [junit4]   2> 	  "core_node_name":"core_node2",
             [junit4]   2> 	  "roles":null,
             [junit4]   2> 	  "base_url":"http://127.0.0.1:55369",
             [junit4]   2> 	  "node_name":"127.0.0.1:55369_",
             [junit4]   2> 	  "numShards":"2",
             [junit4]   2> 	  "state":"down",
             [junit4]   2> 	  "shard":"shard1",
             [junit4]   2> 	  "collection":"delLiveColl",
             [junit4]   2> 	  "operation":"state"}
             [junit4]   2> 357589 T2490 oasc.DistributedQueue$LatchWatcher.process NodeChildrenChanged fired on path /overseer/queue state SyncConnected
             [junit4]   2> 357589 T2350 oasco.ZkStateWriter.writePendingUpdates going to update_collection /collections/delLiveColl/state.json version: 11
             [junit4]   2> 357589 T2425 oascc.ZkStateReader$7.process A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/collections/delLiveColl/state.json, has occurred - updating... 
             [junit4]   2> 357589 T2450 oascc.ZkStateReader$7.process A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/collections/delLiveColl/state.json, has occurred - updating... 
             [junit4]   2> 357589 T2381 oascc.ZkStateReader$7.process A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/collections/delLiveColl/state.json, has occurred - updating... 
             [junit4]   2> 357590 T2425 oascc.ZkStateReader.updateWatchedCollection Updating data for delLiveColl to ver 12 
             [junit4]   2> 357590 T2381 oascc.ZkStateReader.updateWatchedCollection Updating data for delLiveColl to ver 12 
             [junit4]   2> 357590 T2450 oascc.ZkStateReader.updateWatchedCollection Updating data for delLiveColl to ver 12 
             [junit4]   2> 357593 T2349 oascc.ZkStateReader.updateWatchedCollection Updating data for delLiveColl to ver 12 
          

          In this run, the test has 4 nodes and the collection being deleted has 2 shards and 4 replicas distributed across those 4 nodes. The overseer node itself hosts a replica of the collection being deleted.

          Notice that the Overseer updates the collection state currently at version 10 and 4 watcher threads fire to update state of 'delLiveColl' to version 11. But the ZkStateReader.updateWatchedCollection method is only fired by three watchers. The thread T2349 doesn't execute the updateWatchedCollection method presumably because it is waiting for the update lock.

          I think this failure is linked to SOLR-6554 where my effort in making the overseer faster causes the update lock to be held up for a long time on the overseer node causing state updates to be not visible to anyone except the Overseer on that node. This is causing the OverseerCollectionProcessor to operate on stale state. This may likely be the reason for various other failures in our cloud tests.

          Show
          Shalin Shekhar Mangar added a comment - The full test log from jenkins is attached. The following is the interesting excerpt from the log: [junit4] 2> 357522 T2350 oasco.ZkStateWriter.writePendingUpdates going to update_collection /collections/delLiveColl/state.json version: 10 [junit4] 2> 357523 T2349 oascc.ZkStateReader$7.process A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/collections/delLiveColl/state.json, has occurred - updating... [junit4] 2> 357523 T2450 oascc.ZkStateReader$7.process A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/collections/delLiveColl/state.json, has occurred - updating... [junit4] 2> 357523 T2425 oascc.ZkStateReader$7.process A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/collections/delLiveColl/state.json, has occurred - updating... [junit4] 2> 357523 T2381 oascc.ZkStateReader$7.process A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/collections/delLiveColl/state.json, has occurred - updating... [junit4] 2> 357523 T2450 oascc.ZkStateReader.updateWatchedCollection Updating data for delLiveColl to ver 11 [junit4] 2> 357524 T2425 oascc.ZkStateReader.updateWatchedCollection Updating data for delLiveColl to ver 11 [junit4] 2> 357524 T2381 oascc.ZkStateReader.updateWatchedCollection Updating data for delLiveColl to ver 11 [junit4] 2> 357559 T2321 oasc.AbstractDistribZkTestBase.waitForRecoveriesToFinish Recoveries finished - collection: delLiveColl [junit4] 2> 357562 T2372 oass.SolrDispatchFilter.handleAdminRequest [admin] webapp= null path=/admin/cores params={core=delLiveColl_shard1_replica1&action=STATUS&indexInfo= true &wt=javabin&version=2} status=0 QTime=1 [junit4] 2> 357565 T2339 oasha.CollectionsHandler.handleRemoveReplica Remove replica: replica=core_node1&action=deletereplica&collection=delLiveColl&shard=shard1&onlyIfDown= true &wt=javabin&version=2 [junit4] 2> 357566 T2364 oasc.DistributedQueue$LatchWatcher.process NodeChildrenChanged fired on path /overseer/collection-queue-work state SyncConnected [junit4] 2> 357567 T2351 oasc.OverseerCollectionProcessor.run Overseer Collection Processor: Get the message id:/overseer/collection-queue-work/qn-0000000002 message:{ [junit4] 2> "operation" : "deletereplica" , [junit4] 2> "collection" : "delLiveColl" , [junit4] 2> "shard" : "shard1" , [junit4] 2> "replica" : "core_node1" , [junit4] 2> "onlyIfDown" : " true " } [junit4] 2> 357567 T2489 oasc.OverseerCollectionProcessor.processMessage WARN OverseerCollectionProcessor.processMessage : deletereplica , { [junit4] 2> "operation" : "deletereplica" , [junit4] 2> "collection" : "delLiveColl" , [junit4] 2> "shard" : "shard1" , [junit4] 2> "replica" : "core_node1" , [junit4] 2> "onlyIfDown" : " true " } [junit4] 2> 357567 T2489 oasc.SolrException.log ERROR Collection: delLiveColl operation: deletereplica failed:org.apache.solr.common.SolrException: Attempted to remove replica : delLiveColl/shard1/core_node1 with onlyIfDown=' true ', but state is 'recovering' [junit4] 2> at org.apache.solr.cloud.OverseerCollectionProcessor.deleteReplica(OverseerCollectionProcessor.java:1079) [junit4] 2> at org.apache.solr.cloud.OverseerCollectionProcessor.processMessage(OverseerCollectionProcessor.java:631) [junit4] 2> at org.apache.solr.cloud.OverseerCollectionProcessor$Runner.run(OverseerCollectionProcessor.java:2862) [junit4] 2> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [junit4] 2> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [junit4] 2> at java.lang. Thread .run( Thread .java:745) [junit4] 2> [junit4] 2> 357569 T2489 oasc.OverseerCollectionProcessor$Runner.run Overseer Collection Processor: Message id:/overseer/collection-queue-work/qn-0000000002 complete, response:{Operation deletereplica caused exception:=org.apache.solr.common.SolrException: Attempted to remove replica : delLiveColl/shard1/core_node1 with onlyIfDown=' true ', but state is 'recovering',exception={msg=Attempted to remove replica : delLiveColl/shard1/core_node1 with onlyIfDown=' true ', but state is 'recovering',rspCode=400}} [junit4] 2> 357570 T2364 oasc.DistributedQueue$LatchWatcher.process NodeDataChanged fired on path /overseer/collection-queue-work/qnr-0000000002 state SyncConnected [junit4] 2> 357570 T2364 oasc.DistributedQueue$LatchWatcher.process NodeChildrenChanged fired on path /overseer/collection-queue-work state SyncConnected [junit4] 2> 357571 T2364 oasc.DistributedQueue$LatchWatcher.process NodeChildrenChanged fired on path /overseer/collection-queue-work state SyncConnected [junit4] 2> 357572 T2339 oass.SolrDispatchFilter.handleAdminRequest [admin] webapp= null path=/admin/collections params={replica=core_node1&action=deletereplica&collection=delLiveColl&shard=shard1&onlyIfDown= true &wt=javabin&version=2} status=400 QTime=7 [junit4] 2> 357574 T2321 oas.SolrTestCaseJ4.tearDown ###Ending deleteLiveReplicaTest [junit4] 2> 357577 T2321 oasc.ChaosMonkey.monkeyLog monkey: stop shard! 55369 [junit4] 2> 357577 T2321 oasc.CoreContainer.shutdown Shutting down CoreContainer instance=802937226 [junit4] 2> 357577 T2321 oasc.ZkController.publish publishing core=collection1 state=down collection=control_collection [junit4] 2> 357577 T2321 oasc.ZkController.publish numShards not found on descriptor - reading it from system property [junit4] 2> 357578 T2364 oasc.DistributedQueue$LatchWatcher.process NodeChildrenChanged fired on path /overseer/queue state SyncConnected [junit4] 2> 357578 T2321 oasc.ZkController.publish publishing core=delLiveColl_shard1_replica2 state=down collection=delLiveColl [junit4] 2> 357579 T2350 oasc.Overseer$ClusterStateUpdater.run processMessage: queueSize: 1, message = { [junit4] 2> "core" : "collection1" , [junit4] 2> "core_node_name" : "core_node1" , [junit4] 2> "roles" : null , [junit4] 2> "base_url" : "http: //127.0.0.1:55369" , [junit4] 2> "node_name" : "127.0.0.1:55369_" , [junit4] 2> "numShards" : "2" , [junit4] 2> "state" : "down" , [junit4] 2> "shard" : "shard1" , [junit4] 2> "collection" : "control_collection" , [junit4] 2> "operation" : "state" } current state version: 6 [junit4] 2> 357579 T2321 oasc.SolrCore.close [collection1] CLOSING SolrCore org.apache.solr.core.SolrCore@6eb4277d [junit4] 2> 357579 T2350 oasco.ReplicaMutator.updateState Update state numShards=2 message={ [junit4] 2> "core" : "collection1" , [junit4] 2> "core_node_name" : "core_node1" , [junit4] 2> "roles" : null , [junit4] 2> "base_url" : "http: //127.0.0.1:55369" , [junit4] 2> "node_name" : "127.0.0.1:55369_" , [junit4] 2> "numShards" : "2" , [junit4] 2> "state" : "down" , [junit4] 2> "shard" : "shard1" , [junit4] 2> "collection" : "control_collection" , [junit4] 2> "operation" : "state" } [junit4] 2> 357579 T2321 oasc.ZkController.unRegisterConfListener a listener was removed because of core close [junit4] 2> 357579 T2321 oasu.DirectUpdateHandler2.close closing DirectUpdateHandler2{commits=0,autocommits=0,soft autocommits=0,optimizes=0,rollbacks=0,expungeDeletes=0,docsPending=0,adds=0,deletesById=0,deletesByQuery=0,errors=0,cumulative_adds=0,cumulative_deletesById=0,cumulative_deletesByQuery=0,cumulative_errors=0,transaction_logs_total_size=0,transaction_logs_total_number=0} [junit4] 2> 357579 T2321 oasu.SolrCoreState.decrefSolrCoreState Closing SolrCoreState [junit4] 2> 357580 T2321 oasu.DefaultSolrCoreState.closeIndexWriter SolrCoreState ref count has reached 0 - closing IndexWriter [junit4] 2> 357580 T2321 oasu.DefaultSolrCoreState.closeIndexWriter closing IndexWriter with IndexWriterCloser [junit4] 2> 357580 T2364 oascc.ZkStateReader$2.process A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 5) [junit4] 2> 357580 T2381 oascc.ZkStateReader$2.process A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 5) [junit4] 2> 357580 T2425 oascc.ZkStateReader$2.process A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 5) [junit4] 2> 357580 T2450 oascc.ZkStateReader$2.process A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 5) [junit4] 2> 357580 T2403 oascc.ZkStateReader$2.process A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 5) [junit4] 2> 357580 T2321 oasc.SolrCore.closeSearcher [collection1] Closing main searcher on request. [junit4] 2> 357586 T2350 oasc.Overseer$ClusterStateUpdater.run processMessage: queueSize: 1, message = { [junit4] 2> "core" : "delLiveColl_shard1_replica2" , [junit4] 2> "core_node_name" : "core_node2" , [junit4] 2> "roles" : null , [junit4] 2> "base_url" : "http: //127.0.0.1:55369" , [junit4] 2> "node_name" : "127.0.0.1:55369_" , [junit4] 2> "numShards" : "2" , [junit4] 2> "state" : "down" , [junit4] 2> "shard" : "shard1" , [junit4] 2> "collection" : "delLiveColl" , [junit4] 2> "operation" : "state" } current state version: 7 [junit4] 2> 357587 T2350 oasco.ReplicaMutator.updateState Update state numShards=2 message={ [junit4] 2> "core" : "delLiveColl_shard1_replica2" , [junit4] 2> "core_node_name" : "core_node2" , [junit4] 2> "roles" : null , [junit4] 2> "base_url" : "http: //127.0.0.1:55369" , [junit4] 2> "node_name" : "127.0.0.1:55369_" , [junit4] 2> "numShards" : "2" , [junit4] 2> "state" : "down" , [junit4] 2> "shard" : "shard1" , [junit4] 2> "collection" : "delLiveColl" , [junit4] 2> "operation" : "state" } [junit4] 2> 357589 T2490 oasc.DistributedQueue$LatchWatcher.process NodeChildrenChanged fired on path /overseer/queue state SyncConnected [junit4] 2> 357589 T2350 oasco.ZkStateWriter.writePendingUpdates going to update_collection /collections/delLiveColl/state.json version: 11 [junit4] 2> 357589 T2425 oascc.ZkStateReader$7.process A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/collections/delLiveColl/state.json, has occurred - updating... [junit4] 2> 357589 T2450 oascc.ZkStateReader$7.process A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/collections/delLiveColl/state.json, has occurred - updating... [junit4] 2> 357589 T2381 oascc.ZkStateReader$7.process A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/collections/delLiveColl/state.json, has occurred - updating... [junit4] 2> 357590 T2425 oascc.ZkStateReader.updateWatchedCollection Updating data for delLiveColl to ver 12 [junit4] 2> 357590 T2381 oascc.ZkStateReader.updateWatchedCollection Updating data for delLiveColl to ver 12 [junit4] 2> 357590 T2450 oascc.ZkStateReader.updateWatchedCollection Updating data for delLiveColl to ver 12 [junit4] 2> 357593 T2349 oascc.ZkStateReader.updateWatchedCollection Updating data for delLiveColl to ver 12 In this run, the test has 4 nodes and the collection being deleted has 2 shards and 4 replicas distributed across those 4 nodes. The overseer node itself hosts a replica of the collection being deleted. Notice that the Overseer updates the collection state currently at version 10 and 4 watcher threads fire to update state of 'delLiveColl' to version 11. But the ZkStateReader.updateWatchedCollection method is only fired by three watchers. The thread T2349 doesn't execute the updateWatchedCollection method presumably because it is waiting for the update lock. I think this failure is linked to SOLR-6554 where my effort in making the overseer faster causes the update lock to be held up for a long time on the overseer node causing state updates to be not visible to anyone except the Overseer on that node. This is causing the OverseerCollectionProcessor to operate on stale state. This may likely be the reason for various other failures in our cloud tests.
          Hide
          ASF subversion and git services added a comment -

          Commit 1660104 from shalin@apache.org in branch 'dev/trunk'
          [ https://svn.apache.org/r1660104 ]

          SOLR-6956: Log collection name as well as live nodes size

          Show
          ASF subversion and git services added a comment - Commit 1660104 from shalin@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1660104 ] SOLR-6956 : Log collection name as well as live nodes size
          Hide
          ASF subversion and git services added a comment -

          Commit 1660105 from shalin@apache.org in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1660105 ]

          SOLR-6956: Log collection name as well as live nodes size

          Show
          ASF subversion and git services added a comment - Commit 1660105 from shalin@apache.org in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1660105 ] SOLR-6956 : Log collection name as well as live nodes size
          Hide
          Shalin Shekhar Mangar added a comment -

          There is no need to synchronize on the ZkStateReader's update lock inside the Overseer. It may have been necessary in the past but I don't see why it is useful today. The lock is anyway acquired by the reader if the cluster state is refreshed forcefully.

          This patch removes synchronization on the ZkStateReader.getUpdateLock inside the Overseer. This patch also combines the two work queue loops in the Overseer (first at overseer startup and second if refreshClusterState == true). I had kept them separate during SOLR-6554 because I thought that acquiring the update lock was necessary and we don't want to hold the lock while waiting for a peek.

          Show
          Shalin Shekhar Mangar added a comment - There is no need to synchronize on the ZkStateReader's update lock inside the Overseer. It may have been necessary in the past but I don't see why it is useful today. The lock is anyway acquired by the reader if the cluster state is refreshed forcefully. This patch removes synchronization on the ZkStateReader.getUpdateLock inside the Overseer. This patch also combines the two work queue loops in the Overseer (first at overseer startup and second if refreshClusterState == true). I had kept them separate during SOLR-6554 because I thought that acquiring the update lock was necessary and we don't want to hold the lock while waiting for a peek.
          Hide
          Shalin Shekhar Mangar added a comment -

          I added a force update cluster state before executing a collection API command. These are few and far in between and a force update will ensure that the latest state is visible to OverseerCollectionProcessor.

          Show
          Shalin Shekhar Mangar added a comment - I added a force update cluster state before executing a collection API command. These are few and far in between and a force update will ensure that the latest state is visible to OverseerCollectionProcessor.
          Hide
          Shalin Shekhar Mangar added a comment -

          Moved the updateClusterState call inside the try-catch block of processMessage. I'll commit this shortly.

          Show
          Shalin Shekhar Mangar added a comment - Moved the updateClusterState call inside the try-catch block of processMessage. I'll commit this shortly.
          Hide
          ASF subversion and git services added a comment -

          Commit 1660449 from shalin@apache.org in branch 'dev/trunk'
          [ https://svn.apache.org/r1660449 ]

          SOLR-6956: OverseerCollectionProcessor and replicas on the overseer node can sometimes operate on stale cluster state

          Show
          ASF subversion and git services added a comment - Commit 1660449 from shalin@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1660449 ] SOLR-6956 : OverseerCollectionProcessor and replicas on the overseer node can sometimes operate on stale cluster state
          Hide
          ASF subversion and git services added a comment -

          Commit 1660450 from shalin@apache.org in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1660450 ]

          SOLR-6956: OverseerCollectionProcessor and replicas on the overseer node can sometimes operate on stale cluster state

          Show
          ASF subversion and git services added a comment - Commit 1660450 from shalin@apache.org in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1660450 ] SOLR-6956 : OverseerCollectionProcessor and replicas on the overseer node can sometimes operate on stale cluster state
          Hide
          ASF subversion and git services added a comment -

          Commit 1660453 from shalin@apache.org in branch 'dev/trunk'
          [ https://svn.apache.org/r1660453 ]

          SOLR-6956: A better message for DeleteReplicaTest failures

          Show
          ASF subversion and git services added a comment - Commit 1660453 from shalin@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1660453 ] SOLR-6956 : A better message for DeleteReplicaTest failures
          Hide
          ASF subversion and git services added a comment -

          Commit 1660455 from shalin@apache.org in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1660455 ]

          SOLR-6956: A better message for DeleteReplicaTest failures

          Show
          ASF subversion and git services added a comment - Commit 1660455 from shalin@apache.org in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1660455 ] SOLR-6956 : A better message for DeleteReplicaTest failures
          Hide
          Anshum Gupta added a comment -

          Seems like this broke the build (Jenkins is yet to get there).
          The OCP tests are failing.

             [junit4] ERROR   10.1s | OverseerCollectionProcessorTest.testReplicationCollectionNotCreatedDueToMaxShardsPerNodeLimit <<<
             [junit4]    > Throwable #1: java.lang.AssertionError:  Queue not empty within 10000 ms1424216414061
             [junit4]    > 	at org.apache.solr.cloud.OverseerCollectionProcessorTest.waitForEmptyQueue(OverseerCollectionProcessorTest.java:556)
             [junit4]    > 	at org.apache.solr.cloud.OverseerCollectionProcessorTest.testTemplate(OverseerCollectionProcessorTest.java:606)
             [junit4]    > 	at org.apache.solr.cloud.OverseerCollectionProcessorTest.testReplicationCollectionNotCreatedDueToMaxShardsPerNodeLimit(OverseerCollectionProcessorTest.java:737)
             [junit4]    > 	at java.lang.Thread.run(Thread.java:745)Throwable #2: com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=21, name=OverseerThreadFactory-3-thread-1, state=RUNNABLE, group=TGRP-OverseerCollectionProcessorTest]
             [junit4]    > Caused by: java.lang.AssertionError: 
             [junit4]    >   Unexpected method call updateClusterState(true):
             [junit4]    >     getClusterProps(): expected: 1, actual: 0
             [junit4]    > 	at __randomizedtesting.SeedInfo.seed([1247AEAB11BEC8AF]:0)
             [junit4]    > 	at org.easymock.internal.MockInvocationHandler.invoke(MockInvocationHandler.java:45)
             [junit4]    > 	at org.easymock.internal.ObjectMethodsFilter.invoke(ObjectMethodsFilter.java:73)
             [junit4]    > 	at org.easymock.internal.ClassProxyFactory$MockMethodInterceptor.intercept(ClassProxyFactory.java:92)
             [junit4]    > 	at org.apache.solr.common.cloud.ZkStateReader$$EnhancerByCGLIB$$bb1422ab.updateClusterState(<generated>)
             [junit4]    > 	at org.apache.solr.cloud.OverseerCollectionProcessor.processMessage(OverseerCollectionProcessor.java:584)
             [junit4]    > 	at org.apache.solr.cloud.OverseerCollectionProcessor$Runner.run(OverseerCollectionProcessor.java:2864)
             [junit4]    > 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
             [junit4]    > 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
             [junit4]    > 	at java.lang.Thread.run(Thread.java:745)
          
          Show
          Anshum Gupta added a comment - Seems like this broke the build (Jenkins is yet to get there). The OCP tests are failing. [junit4] ERROR 10.1s | OverseerCollectionProcessorTest.testReplicationCollectionNotCreatedDueToMaxShardsPerNodeLimit <<< [junit4] > Throwable #1: java.lang.AssertionError: Queue not empty within 10000 ms1424216414061 [junit4] > at org.apache.solr.cloud.OverseerCollectionProcessorTest.waitForEmptyQueue(OverseerCollectionProcessorTest.java:556) [junit4] > at org.apache.solr.cloud.OverseerCollectionProcessorTest.testTemplate(OverseerCollectionProcessorTest.java:606) [junit4] > at org.apache.solr.cloud.OverseerCollectionProcessorTest.testReplicationCollectionNotCreatedDueToMaxShardsPerNodeLimit(OverseerCollectionProcessorTest.java:737) [junit4] > at java.lang. Thread .run( Thread .java:745)Throwable #2: com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread [id=21, name=OverseerThreadFactory-3-thread-1, state=RUNNABLE, group=TGRP-OverseerCollectionProcessorTest] [junit4] > Caused by: java.lang.AssertionError: [junit4] > Unexpected method call updateClusterState( true ): [junit4] > getClusterProps(): expected: 1, actual: 0 [junit4] > at __randomizedtesting.SeedInfo.seed([1247AEAB11BEC8AF]:0) [junit4] > at org.easymock.internal.MockInvocationHandler.invoke(MockInvocationHandler.java:45) [junit4] > at org.easymock.internal.ObjectMethodsFilter.invoke(ObjectMethodsFilter.java:73) [junit4] > at org.easymock.internal.ClassProxyFactory$MockMethodInterceptor.intercept(ClassProxyFactory.java:92) [junit4] > at org.apache.solr.common.cloud.ZkStateReader$$EnhancerByCGLIB$$bb1422ab.updateClusterState(<generated>) [junit4] > at org.apache.solr.cloud.OverseerCollectionProcessor.processMessage(OverseerCollectionProcessor.java:584) [junit4] > at org.apache.solr.cloud.OverseerCollectionProcessor$Runner.run(OverseerCollectionProcessor.java:2864) [junit4] > at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [junit4] > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [junit4] > at java.lang. Thread .run( Thread .java:745)
          Hide
          ASF subversion and git services added a comment -

          Commit 1660544 from Anshum Gupta in branch 'dev/trunk'
          [ https://svn.apache.org/r1660544 ]

          SOLR-6956: Fixing the mocked OCP in OverseerCollectionProcessorTest

          Show
          ASF subversion and git services added a comment - Commit 1660544 from Anshum Gupta in branch 'dev/trunk' [ https://svn.apache.org/r1660544 ] SOLR-6956 : Fixing the mocked OCP in OverseerCollectionProcessorTest
          Hide
          ASF subversion and git services added a comment -

          Commit 1660546 from Anshum Gupta in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1660546 ]

          SOLR-6956: Fixing the mocked OCP in OverseerCollectionProcessorTest (merge from trunk)

          Show
          ASF subversion and git services added a comment - Commit 1660546 from Anshum Gupta in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1660546 ] SOLR-6956 : Fixing the mocked OCP in OverseerCollectionProcessorTest (merge from trunk)
          Hide
          Shalin Shekhar Mangar added a comment -

          Thanks Anshum for fixing the OCP test. I haven't seen this failure since this patch was committed so I'll resolve this now.

          Show
          Shalin Shekhar Mangar added a comment - Thanks Anshum for fixing the OCP test. I haven't seen this failure since this patch was committed so I'll resolve this now.
          Hide
          Timothy Potter added a comment -

          Bulk close after 5.1 release

          Show
          Timothy Potter added a comment - Bulk close after 5.1 release

            People

            • Assignee:
              Shalin Shekhar Mangar
              Reporter:
              Mark Miller
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development