Solr
  1. Solr
  2. SOLR-6591

Cluster state updates can be lost on exception in main queue loop

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 6.0
    • Fix Version/s: 5.0, 6.0
    • Component/s: SolrCloud
    • Labels:
      None

      Description

      I found this bug while going through the failure on jenkins:
      https://builds.apache.org/job/Lucene-Solr-NightlyTests-trunk/648/

      2 tests failed.
      REGRESSION:  org.apache.solr.cloud.CollectionsAPIDistributedZkTest.testDistribSearch
      
      Error Message:
      Error CREATEing SolrCore 'halfcollection_shard1_replica1': Unable to create core [halfcollection_shard1_replica1] Caused by: Could not get shard id for core: halfcollection_shard1_replica1
      
      Stack Trace:
      org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Error CREATEing SolrCore 'halfcollection_shard1_replica1': Unable to create core [halfcollection_shard1_replica1] Caused by: Could not get shard id for core: halfcollection_shard1_replica1
              at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:570)
              at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:215)
              at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:211)
              at org.apache.solr.cloud.CollectionsAPIDistributedZkTest.testErrorHandling(CollectionsAPIDistributedZkTest.java:583)
              at org.apache.solr.cloud.CollectionsAPIDistributedZkTest.doTest(CollectionsAPIDistributedZkTest.java:205)
              at org.apache.solr.BaseDistributedSearchTestCase.testDistribSearch(BaseDistributedSearchTestCase.java:869)
              at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
              at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
              at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
              at java.lang.reflect.Method.invoke(Method.java:606)
              at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1618)
      
      1. SOLR-6591.patch
        6 kB
        Shalin Shekhar Mangar
      2. SOLR-6591-constructStateFix.patch
        2 kB
        Shalin Shekhar Mangar
      3. SOLR-6591-ignore-no-collection-path.patch
        2 kB
        Shalin Shekhar Mangar
      4. SOLR-6591-no-mixed-batches.patch
        2 kB
        Shalin Shekhar Mangar

        Activity

        Hide
        Shalin Shekhar Mangar added a comment - - edited

        What happens here is:

        1. The Stress Collection Creation thread in that test is trying to create collections (which have stateFormat=2)
        2. The overseer gets a "state" message from a new core created using core admin API. This should implicitly create a new collection:
             [junit4]   2> 561673 T45931 oasc.Overseer$ClusterStateUpdater.updateState Update state numShards=1 message={
             [junit4]   2>          "collection":"halfcollectionblocker",
             [junit4]   2>          "base_url":"http://127.0.0.1:42021",
             [junit4]   2>          "state":"down",
             [junit4]   2>          "numShards":"1",
             [junit4]   2>          "node_name":"127.0.0.1:42021_",
             [junit4]   2>          "roles":null,
             [junit4]   2>          "shard":null,
             [junit4]   2>          "operation":"state",
             [junit4]   2>          "core":"halfcollection_shard1_replica1"}
             [junit4]   2> 561674 T45931 oasc.Overseer$ClusterStateUpdater.createCollection Create collection halfcollectionblocker with shards [shard1]
             [junit4]   2> 561674 T45931 oasc.Overseer$ClusterStateUpdater.createCollection state version halfcollectionblocker 1
             [junit4]   2> 561679 T45931 oasc.Overseer$ClusterStateUpdater.updateState Assigning new node to shard shard=shard1
          
        3. Right after the above message, the overseer gets a message to create 'awholynewstresscollection_collection4_1' (I'm assuming through a "state" message). This fails with the following message:
             [junit4]   2> 561682 T45931 oasc.Overseer$ClusterStateUpdater.run ERROR Exception in Overseer main queue loop org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /collections/awholynewstresscollection_collection4_1/state.json
             [junit4]   2>        at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
             [junit4]   2>        at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
             [junit4]   2>        at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)
             [junit4]   2>        at org.apache.solr.common.cloud.SolrZkClient$9.execute(SolrZkClient.java:382)
             [junit4]   2>        at org.apache.solr.common.cloud.SolrZkClient$9.execute(SolrZkClient.java:379)
             [junit4]   2>        at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61)
             [junit4]   2>        at org.apache.solr.common.cloud.SolrZkClient.create(SolrZkClient.java:379)
             [junit4]   2>        at org.apache.solr.cloud.Overseer$ClusterStateUpdater.updateZkStates(Overseer.java:358)
             [junit4]   2>        at org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:311)
             [junit4]   2>        at java.lang.Thread.run(Thread.java:745)
             [junit4]   2>
          
        4. This exception causes the "state" messaged executed for 'halfcollectionblocker' collection to be lost. The message is still present in the work queue but because the overseer is healthy, it will continue to execute the main queue.
             [junit4]   2> 881993 T46259 oasc.ZkController.waitForShardId waiting to find shard id in clusterstate for halfcollection_shard1_replica1
             [junit4]   2> 1202711 T46259 oasc.CoreContainer.create ERROR Error creating core [halfcollection_shard1_replica1]: Could not get shard id for core: halfcollection_shard1_replica1 org.apache.solr.common.SolrException: Could not get shard id for core: halfcollection_shard1_replica1
             [junit4]   2>        at org.apache.solr.cloud.ZkController.waitForShardId(ZkController.java:1425)
             [junit4]   2>        at org.apache.solr.cloud.ZkController.doGetShardIdAndNodeNameProcess(ZkController.java:1371)
             [junit4]   2>        at org.apache.solr.cloud.ZkController.preRegister(ZkController.java:1513)
             [junit4]   2>        at org.apache.solr.core.CoreContainer.create(CoreContainer.java:504)
             [junit4]   2>        at org.apache.solr.core.CoreContainer.create(CoreContainer.java:484)
             [junit4]   2>        at org.apache.solr.handler.admin.CoreAdminHandler.handleCreateAction(CoreAdminHandler.java:575)
          
        Show
        Shalin Shekhar Mangar added a comment - - edited What happens here is: The Stress Collection Creation thread in that test is trying to create collections (which have stateFormat=2) The overseer gets a "state" message from a new core created using core admin API. This should implicitly create a new collection: [junit4] 2> 561673 T45931 oasc.Overseer$ClusterStateUpdater.updateState Update state numShards=1 message={ [junit4] 2> "collection" : "halfcollectionblocker" , [junit4] 2> "base_url" : "http: //127.0.0.1:42021" , [junit4] 2> "state" : "down" , [junit4] 2> "numShards" : "1" , [junit4] 2> "node_name" : "127.0.0.1:42021_" , [junit4] 2> "roles" : null , [junit4] 2> "shard" : null , [junit4] 2> "operation" : "state" , [junit4] 2> "core" : "halfcollection_shard1_replica1" } [junit4] 2> 561674 T45931 oasc.Overseer$ClusterStateUpdater.createCollection Create collection halfcollectionblocker with shards [shard1] [junit4] 2> 561674 T45931 oasc.Overseer$ClusterStateUpdater.createCollection state version halfcollectionblocker 1 [junit4] 2> 561679 T45931 oasc.Overseer$ClusterStateUpdater.updateState Assigning new node to shard shard=shard1 Right after the above message, the overseer gets a message to create 'awholynewstresscollection_collection4_1' (I'm assuming through a "state" message). This fails with the following message: [junit4] 2> 561682 T45931 oasc.Overseer$ClusterStateUpdater.run ERROR Exception in Overseer main queue loop org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /collections/awholynewstresscollection_collection4_1/state.json [junit4] 2> at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) [junit4] 2> at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) [junit4] 2> at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783) [junit4] 2> at org.apache.solr.common.cloud.SolrZkClient$9.execute(SolrZkClient.java:382) [junit4] 2> at org.apache.solr.common.cloud.SolrZkClient$9.execute(SolrZkClient.java:379) [junit4] 2> at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61) [junit4] 2> at org.apache.solr.common.cloud.SolrZkClient.create(SolrZkClient.java:379) [junit4] 2> at org.apache.solr.cloud.Overseer$ClusterStateUpdater.updateZkStates(Overseer.java:358) [junit4] 2> at org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:311) [junit4] 2> at java.lang. Thread .run( Thread .java:745) [junit4] 2> This exception causes the "state" messaged executed for 'halfcollectionblocker' collection to be lost. The message is still present in the work queue but because the overseer is healthy, it will continue to execute the main queue. [junit4] 2> 881993 T46259 oasc.ZkController.waitForShardId waiting to find shard id in clusterstate for halfcollection_shard1_replica1 [junit4] 2> 1202711 T46259 oasc.CoreContainer.create ERROR Error creating core [halfcollection_shard1_replica1]: Could not get shard id for core: halfcollection_shard1_replica1 org.apache.solr.common.SolrException: Could not get shard id for core: halfcollection_shard1_replica1 [junit4] 2> at org.apache.solr.cloud.ZkController.waitForShardId(ZkController.java:1425) [junit4] 2> at org.apache.solr.cloud.ZkController.doGetShardIdAndNodeNameProcess(ZkController.java:1371) [junit4] 2> at org.apache.solr.cloud.ZkController.preRegister(ZkController.java:1513) [junit4] 2> at org.apache.solr.core.CoreContainer.create(CoreContainer.java:504) [junit4] 2> at org.apache.solr.core.CoreContainer.create(CoreContainer.java:484) [junit4] 2> at org.apache.solr.handler.admin.CoreAdminHandler.handleCreateAction(CoreAdminHandler.java:575)
        Hide
        Shalin Shekhar Mangar added a comment -

        This bug affects trunk and branch_5x only. It was introduced after SOLR-5473 and does not affect any released version of Solr.

        Show
        Shalin Shekhar Mangar added a comment - This bug affects trunk and branch_5x only. It was introduced after SOLR-5473 and does not affect any released version of Solr.
        Hide
        Noble Paul added a comment -

        Shalin Shekhar Mangar this would not happen when the collection is created through the collection API?

        Show
        Noble Paul added a comment - Shalin Shekhar Mangar this would not happen when the collection is created through the collection API?
        Hide
        Shalin Shekhar Mangar added a comment -

        Noble Paul - I don't think this is affected by how a collection is created. The gist of the issue is that any exception thrown in updateZkStates for state version > 1 can cause updates for the main cluster state to be lost.

        Show
        Shalin Shekhar Mangar added a comment - Noble Paul - I don't think this is affected by how a collection is created. The gist of the issue is that any exception thrown in updateZkStates for state version > 1 can cause updates for the main cluster state to be lost.
        Hide
        Shalin Shekhar Mangar added a comment -

        So there are two separate bugs at play here:

        1. Due to an exception in persisting a collection with state format > 1. main cluster state events can be lost
        2. A rapid create+delete loop for collections with state format > 1 causes the above exception to happen. This is because the updateZkState method assumes that the collection exists and it tries to write to /collections/collection_name/state.json directly without verifying whether the /collections/collection_name zk node exists
        Show
        Shalin Shekhar Mangar added a comment - So there are two separate bugs at play here: Due to an exception in persisting a collection with state format > 1. main cluster state events can be lost A rapid create+delete loop for collections with state format > 1 causes the above exception to happen. This is because the updateZkState method assumes that the collection exists and it tries to write to /collections/collection_name/state.json directly without verifying whether the /collections/collection_name zk node exists
        Hide
        Shalin Shekhar Mangar added a comment -

        I found a related (same?) bug while investigating #2 above. The overseer loop can sometimes use stale cluster state for collections with stateFormat > 1. This happens because ZkStateReader.removeZKWatch removes collection from the 'watchedCollections' set but doesn't remove the cached state in the watchedCollectionStates map. So when the replica of a collection is unloaded, the watch is also removed but the cached state still exists. If the overseer happens to be on the same node which had hosted the replica then it will continue reading the old state causing replica information or leader information to be lost.

        I've added a test which reproduces the problem (it hangs for a long time on getLeaderRetry before failing to create the collection). The patch fixes the problem by removing collection from watchedCollectionStates in ZkStateReader.removeZKWatch.

        Show
        Shalin Shekhar Mangar added a comment - I found a related (same?) bug while investigating #2 above. The overseer loop can sometimes use stale cluster state for collections with stateFormat > 1. This happens because ZkStateReader.removeZKWatch removes collection from the 'watchedCollections' set but doesn't remove the cached state in the watchedCollectionStates map. So when the replica of a collection is unloaded, the watch is also removed but the cached state still exists. If the overseer happens to be on the same node which had hosted the replica then it will continue reading the old state causing replica information or leader information to be lost. I've added a test which reproduces the problem (it hangs for a long time on getLeaderRetry before failing to create the collection). The patch fixes the problem by removing collection from watchedCollectionStates in ZkStateReader.removeZKWatch.
        Hide
        ASF subversion and git services added a comment -

        Commit 1634243 from shalin@apache.org in branch 'dev/trunk'
        [ https://svn.apache.org/r1634243 ]

        SOLR-6591: Overseer can use stale cluster state and lose updates for collections with stateFormat > 1

        Show
        ASF subversion and git services added a comment - Commit 1634243 from shalin@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1634243 ] SOLR-6591 : Overseer can use stale cluster state and lose updates for collections with stateFormat > 1
        Hide
        ASF subversion and git services added a comment -

        Commit 1634246 from shalin@apache.org in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1634246 ]

        SOLR-6591: Overseer can use stale cluster state and lose updates for collections with stateFormat > 1

        Show
        ASF subversion and git services added a comment - Commit 1634246 from shalin@apache.org in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1634246 ] SOLR-6591 : Overseer can use stale cluster state and lose updates for collections with stateFormat > 1
        Hide
        Shalin Shekhar Mangar added a comment - - edited

        The overseer can still use stale cached cluster state because ZkStateReader.updateClusterState returns cached state for watched collections. Here is a patch which fixes that by reading collection state live during ZkStateReader.updateClusterState and setting it into the watchedCollectionStates map.

        Show
        Shalin Shekhar Mangar added a comment - - edited The overseer can still use stale cached cluster state because ZkStateReader.updateClusterState returns cached state for watched collections. Here is a patch which fixes that by reading collection state live during ZkStateReader.updateClusterState and setting it into the watchedCollectionStates map.
        Hide
        ASF subversion and git services added a comment -

        Commit 1634554 from shalin@apache.org in branch 'dev/trunk'
        [ https://svn.apache.org/r1634554 ]

        SOLR-6591: ZkStateReader.updateClusterState should refresh cluster state for watched collections

        Show
        ASF subversion and git services added a comment - Commit 1634554 from shalin@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1634554 ] SOLR-6591 : ZkStateReader.updateClusterState should refresh cluster state for watched collections
        Hide
        ASF subversion and git services added a comment -

        Commit 1634555 from shalin@apache.org in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1634555 ]

        SOLR-6591: ZkStateReader.updateClusterState should refresh cluster state for watched collections

        Show
        ASF subversion and git services added a comment - Commit 1634555 from shalin@apache.org in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1634555 ] SOLR-6591 : ZkStateReader.updateClusterState should refresh cluster state for watched collections
        Hide
        Shalin Shekhar Mangar added a comment -

        Right now main cluster states are batched together and updates to collections with stateFormat > 1 are not batched (I'll create another issue for that). However updates to both can be mixed together e.g. if overseer gets 5 messages for main cluster state and then 1 for stateFormat > 1 then the resulting updates are written to ZK together. This is error prone and we shouldn't batch updates for different stateFormats together.

        This patch tracks the last stateFormat for which message was processed and breaks out of the loop if a different one is encountered.

        Show
        Shalin Shekhar Mangar added a comment - Right now main cluster states are batched together and updates to collections with stateFormat > 1 are not batched (I'll create another issue for that). However updates to both can be mixed together e.g. if overseer gets 5 messages for main cluster state and then 1 for stateFormat > 1 then the resulting updates are written to ZK together. This is error prone and we shouldn't batch updates for different stateFormats together. This patch tracks the last stateFormat for which message was processed and breaks out of the loop if a different one is encountered.
        Hide
        ASF subversion and git services added a comment -

        Commit 1634684 from shalin@apache.org in branch 'dev/trunk'
        [ https://svn.apache.org/r1634684 ]

        SOLR-6591: Do not batch updates for different stateFormats together

        Show
        ASF subversion and git services added a comment - Commit 1634684 from shalin@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1634684 ] SOLR-6591 : Do not batch updates for different stateFormats together
        Hide
        ASF subversion and git services added a comment -

        Commit 1634685 from shalin@apache.org in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1634685 ]

        SOLR-6591: Do not batch updates for different stateFormats together

        Show
        ASF subversion and git services added a comment - Commit 1634685 from shalin@apache.org in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1634685 ] SOLR-6591 : Do not batch updates for different stateFormats together
        Hide
        Shalin Shekhar Mangar added a comment -

        A rapid create+delete loop for collections with state format > 1 causes the above exception to happen. This is because the updateZkState method assumes that the collection exists and it tries to write to /collections/collection_name/state.json directly without verifying whether the /collections/collection_name zk node exists

        This patch ignores state messages which are trying to create new collections when the parent zk path doesn't exist. I've added the following comment in the code to explain the situation:

        // if the /collections/collection_name path doesn't exist then it means that
        // 1) the user invoked a DELETE collection API and the OverseerCollectionProcessor has deleted
        // this zk path.
        // 2) these are most likely old "state" messages which are only being processed now because
        // if they were new "state" messages then in legacy mode, a new collection would have been
        // created with stateFormat = 1 (which is the default state format)
        // 3) these can't be new "state" messages created for a new collection because
        // otherwise the OverseerCollectionProcessor would have already created this path
        // as part of the create collection API call – which is the only way in which a collection
        // with stateFormat > 1 can possibly be created

        Show
        Shalin Shekhar Mangar added a comment - A rapid create+delete loop for collections with state format > 1 causes the above exception to happen. This is because the updateZkState method assumes that the collection exists and it tries to write to /collections/collection_name/state.json directly without verifying whether the /collections/collection_name zk node exists This patch ignores state messages which are trying to create new collections when the parent zk path doesn't exist. I've added the following comment in the code to explain the situation: // if the /collections/collection_name path doesn't exist then it means that // 1) the user invoked a DELETE collection API and the OverseerCollectionProcessor has deleted // this zk path. // 2) these are most likely old "state" messages which are only being processed now because // if they were new "state" messages then in legacy mode, a new collection would have been // created with stateFormat = 1 (which is the default state format) // 3) these can't be new "state" messages created for a new collection because // otherwise the OverseerCollectionProcessor would have already created this path // as part of the create collection API call – which is the only way in which a collection // with stateFormat > 1 can possibly be created
        Hide
        Shalin Shekhar Mangar added a comment -

        With the right patch (SOLR-6591-ignore-no-collection-path.patch) this time.

        Show
        Shalin Shekhar Mangar added a comment - With the right patch ( SOLR-6591 -ignore-no-collection-path.patch) this time.
        Hide
        ASF subversion and git services added a comment -

        Commit 1636400 from shalin@apache.org in branch 'dev/trunk'
        [ https://svn.apache.org/r1636400 ]

        SOLR-6591: Ignore overseer operations for collections with stateFormat > 1 if the parent ZK path doesn't exist

        Show
        ASF subversion and git services added a comment - Commit 1636400 from shalin@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1636400 ] SOLR-6591 : Ignore overseer operations for collections with stateFormat > 1 if the parent ZK path doesn't exist
        Hide
        ASF subversion and git services added a comment -

        Commit 1636401 from shalin@apache.org in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1636401 ]

        SOLR-6591: Ignore overseer operations for collections with stateFormat > 1 if the parent ZK path doesn't exist

        Show
        ASF subversion and git services added a comment - Commit 1636401 from shalin@apache.org in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1636401 ] SOLR-6591 : Ignore overseer operations for collections with stateFormat > 1 if the parent ZK path doesn't exist
        Hide
        Anshum Gupta added a comment -

        Bulk close after 5.0 release.

        Show
        Anshum Gupta added a comment - Bulk close after 5.0 release.

          People

          • Assignee:
            Shalin Shekhar Mangar
            Reporter:
            Shalin Shekhar Mangar
          • Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development