Solr
  1. Solr
  2. SOLR-5961

Solr gets crazy on /overseer/queue state change

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 4.7.1
    • Fix Version/s: 4.10.4, 5.0
    • Component/s: SolrCloud
    • Labels:
      None
    • Environment:

      CentOS, 1 shard - 3 replicas, ZK cluster with 3 nodes (separate machines)

      Description

      No idea how to reproduce it, but sometimes Solr stars littering the log with the following messages:

      419158 [localhost-startStop-1-EventThread] INFO org.apache.solr.cloud.DistributedQueue ? LatchChildWatcher fired on path: /overseer/queue state: SyncConnected type NodeChildrenChanged

      419190 [Thread-3] INFO org.apache.solr.cloud.Overseer ? Update state numShards=1 message={
      "operation":"state",
      "state":"recovering",
      "base_url":"http://$

      {IP_ADDRESS}

      /solr",
      "core":"$

      {CORE_NAME}

      ",
      "roles":null,
      "node_name":"$

      {NODE_NAME}

      _solr",
      "shard":"shard1",
      "collection":"$

      {COLLECTION_NAME}

      ",
      "numShards":"1",
      "core_node_name":"core_node2"}

      It continues spamming these messages with no delay and the restarting of all the nodes does not help. I have even tried to stop all the nodes in the cluster first, but then when I start one, the behavior doesn't change, it gets crazy nuts with this " /overseer/queue state" again.

      PS The only way to handle this was to stop everything, manually clean up all the data in ZooKeeper related to Solr, and then rebuild everything from scratch. As you should understand, it is kinda unbearable in the production environment.

        Issue Links

          Activity

          Hide
          Ugo Matrangolo added a comment -

          Hi,

          we got same behaviour that caused major outage. After a rolling restart of our production zk cluster the /overseer/queue started to be spammed by events and the Overseer was looping on the following:

          2014-08-28 22:43:06,753 [main-EventThread] INFO org.apache.solr.cloud.DistributedQueue - LatchChildWatcher fired on path: /overseer/queue state: SyncConnected type NodeChildrenChanged
          2014-08-28 22:43:06,820 [Thread-125] INFO org.apache.solr.cloud.Overseer - Update state numShards=3 message={
          "operation":"state",
          "state":"recovering",
          "base_url":"http://

          {ip}:{port}",
          "core":"warehouse-skus_shard2_replica2",
          "roles":null,
          "node_name":"{ip}

          :

          {port}

          _",
          "shard":"shard2",
          "collection":"warehouse-skus",
          "numShards":"3",
          "core_node_name":"core_node4"}

          The fix was to:
          1. Shutdown completely the cluster
          2. Using zkCli we rmr-ed all the /queue(s) under the /overseer
          3. Restarted

          Solr started fine after that as nothing has happened.

          The outage was caused by clusterstate.json being in an inconsistent state (most of the nodes were incorrectly marked as down) with no chance to get things right given that the Overseer was too busy processing the spammed queues (more than 13k+ msgs in there) to fix the clusterstate.json with the latest situation.

          Show
          Ugo Matrangolo added a comment - Hi, we got same behaviour that caused major outage. After a rolling restart of our production zk cluster the /overseer/queue started to be spammed by events and the Overseer was looping on the following: 2014-08-28 22:43:06,753 [main-EventThread] INFO org.apache.solr.cloud.DistributedQueue - LatchChildWatcher fired on path: /overseer/queue state: SyncConnected type NodeChildrenChanged 2014-08-28 22:43:06,820 [Thread-125] INFO org.apache.solr.cloud.Overseer - Update state numShards=3 message={ "operation":"state", "state":"recovering", "base_url":"http:// {ip}:{port}", "core":"warehouse-skus_shard2_replica2", "roles":null, "node_name":"{ip} : {port} _", "shard":"shard2", "collection":"warehouse-skus", "numShards":"3", "core_node_name":"core_node4"} The fix was to: 1. Shutdown completely the cluster 2. Using zkCli we rmr-ed all the /queue(s) under the /overseer 3. Restarted Solr started fine after that as nothing has happened. The outage was caused by clusterstate.json being in an inconsistent state (most of the nodes were incorrectly marked as down) with no chance to get things right given that the Overseer was too busy processing the spammed queues (more than 13k+ msgs in there) to fix the clusterstate.json with the latest situation.
          Hide
          Frans Lawaetz added a comment -

          Also saw this issue as well. Solr filled 2T worth of disk, recording hundreds of those messages per ms. In our case it may have been related to zkeeper /overseer/queue becoming overloaded with znodes such that it was unable to function normally.

          Show
          Frans Lawaetz added a comment - Also saw this issue as well. Solr filled 2T worth of disk, recording hundreds of those messages per ms. In our case it may have been related to zkeeper /overseer/queue becoming overloaded with znodes such that it was unable to function normally.
          Hide
          Ugo Matrangolo added a comment -

          Happened again :/

          After a routine maintenance of our network causing a 30 secs connectivity hiccup the SOLR cluster started to spam overseer/queue with more than 47k+ events.

          [zk: zookeeper4:2181(CONNECTED) 26] get /gilt/config/solr/overseer/queue
          null
          cZxid = 0x290008df29
          ctime = Fri Aug 29 02:06:47 GMT+00:00 2014
          mZxid = 0x290008df29
          mtime = Fri Aug 29 02:06:47 GMT+00:00 2014
          pZxid = 0x290023cedd
          cversion = 60632
          dataVersion = 0
          aclVersion = 0
          ephemeralOwner = 0x0
          dataLength = 0
          numChildren = 47822
          [zk: zookeeper4:2181(CONNECTED) 27]
          

          This time we tried to wait for it to heal itself and we watched the numChildren count go down but then up again: no way it was going to fix alone.

          As usual we had to shutdown all the cluster, rmr /overseer/queue and restart.

          Annoying :/

          Show
          Ugo Matrangolo added a comment - Happened again :/ After a routine maintenance of our network causing a 30 secs connectivity hiccup the SOLR cluster started to spam overseer/queue with more than 47k+ events. [zk: zookeeper4:2181(CONNECTED) 26] get /gilt/config/solr/overseer/queue null cZxid = 0x290008df29 ctime = Fri Aug 29 02:06:47 GMT+00:00 2014 mZxid = 0x290008df29 mtime = Fri Aug 29 02:06:47 GMT+00:00 2014 pZxid = 0x290023cedd cversion = 60632 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x0 dataLength = 0 numChildren = 47822 [zk: zookeeper4:2181(CONNECTED) 27] This time we tried to wait for it to heal itself and we watched the numChildren count go down but then up again: no way it was going to fix alone. As usual we had to shutdown all the cluster, rmr /overseer/queue and restart. Annoying :/
          Hide
          Gopal Patwa added a comment -

          we also had similar problem today as Ugo mention in our Production system, this was cause after machine reboot for zookeeper (5 node) and 8 node solr cloud (single shard) to install some unix security patch.

          JDK 7, Solr 4.10.3, CentOS

          But after reboot, we saw huge amount of message were in overseer/queue

          ./zkCli.sh -server localhost:2181 ls /search/catalog/overseer/queue | sed 's/,/\n/g' | wc -l
          178587

          We have very small cluster (8 nodes), how come overseer/queue should have 17k+ messages, due to this leader node took almost few hours to come from recovery.

          Logs from zookeeper:
          org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /overseer/queue

          Show
          Gopal Patwa added a comment - we also had similar problem today as Ugo mention in our Production system, this was cause after machine reboot for zookeeper (5 node) and 8 node solr cloud (single shard) to install some unix security patch. JDK 7, Solr 4.10.3, CentOS But after reboot, we saw huge amount of message were in overseer/queue ./zkCli.sh -server localhost:2181 ls /search/catalog/overseer/queue | sed 's/,/\n/g' | wc -l 178587 We have very small cluster (8 nodes), how come overseer/queue should have 17k+ messages, due to this leader node took almost few hours to come from recovery. Logs from zookeeper: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /overseer/queue
          Hide
          Shalin Shekhar Mangar added a comment -

          Hi Gopal, is there anything else in the logs? Any other kinds of exceptions? What was the sequence of steps for installing this patch – did you take down ZooKeeper first or Solr nodes?

          Show
          Shalin Shekhar Mangar added a comment - Hi Gopal, is there anything else in the logs? Any other kinds of exceptions? What was the sequence of steps for installing this patch – did you take down ZooKeeper first or Solr nodes?
          Hide
          Mark Miller added a comment -

          I think part of the problem here is SOLR-7033. The other part is that we should probably make sure there is a minimum time between recovery retries on failure in all cases.

          Show
          Mark Miller added a comment - I think part of the problem here is SOLR-7033 . The other part is that we should probably make sure there is a minimum time between recovery retries on failure in all cases.
          Hide
          Mark Miller added a comment -

          The other part is that we should probably make sure there is a minimum time between recovery retries on failure in all cases.

          I've added that to my patch in SOLR-7033.

          Show
          Mark Miller added a comment - The other part is that we should probably make sure there is a minimum time between recovery retries on failure in all cases. I've added that to my patch in SOLR-7033 .
          Hide
          ASF subversion and git services added a comment -

          Commit 1658236 from Mark Miller in branch 'dev/trunk'
          [ https://svn.apache.org/r1658236 ]

          SOLR-7033, SOLR-5961: RecoveryStrategy should not publish any state when closed / cancelled and there should always be a pause between recoveries even when recoveries are rapidly stopped and started as well as when a node attempts to become the leader for a shard.

          Show
          ASF subversion and git services added a comment - Commit 1658236 from Mark Miller in branch 'dev/trunk' [ https://svn.apache.org/r1658236 ] SOLR-7033 , SOLR-5961 : RecoveryStrategy should not publish any state when closed / cancelled and there should always be a pause between recoveries even when recoveries are rapidly stopped and started as well as when a node attempts to become the leader for a shard.
          Hide
          ASF subversion and git services added a comment -

          Commit 1658237 from Mark Miller in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1658237 ]

          SOLR-7033, SOLR-5961: RecoveryStrategy should not publish any state when closed / cancelled and there should always be a pause between recoveries even when recoveries are rapidly stopped and started as well as when a node attempts to become the leader for a shard.

          Show
          ASF subversion and git services added a comment - Commit 1658237 from Mark Miller in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1658237 ] SOLR-7033 , SOLR-5961 : RecoveryStrategy should not publish any state when closed / cancelled and there should always be a pause between recoveries even when recoveries are rapidly stopped and started as well as when a node attempts to become the leader for a shard.
          Hide
          ASF subversion and git services added a comment -

          Commit 1658251 from Mark Miller in branch 'dev/branches/lucene_solr_5_0'
          [ https://svn.apache.org/r1658251 ]

          SOLR-7033, SOLR-5961: RecoveryStrategy should not publish any state when closed / cancelled and there should always be a pause between recoveries even when recoveries are rapidly stopped and started as well as when a node attempts to become the leader for a shard.

          Show
          ASF subversion and git services added a comment - Commit 1658251 from Mark Miller in branch 'dev/branches/lucene_solr_5_0' [ https://svn.apache.org/r1658251 ] SOLR-7033 , SOLR-5961 : RecoveryStrategy should not publish any state when closed / cancelled and there should always be a pause between recoveries even when recoveries are rapidly stopped and started as well as when a node attempts to become the leader for a shard.
          Hide
          Gopal Patwa added a comment -

          Thanks Mark, here more details for our production issue, I will try to reproduce this issue.

          Restart sequence:
          Solr - Restarted 02/03/3015 (8 Nodes, 10 Collection)
          ZKR - Restarted 02/04/2015 (5 Nodes)

          Normal Index Size are approx 5GB. Only few nodes had this issue

          When replica was in recovery, transaction logs size was over 100GB.
          Possible reason it starts writing all updates sent by the leader in this period to the transaction log .

          Due to overseer queue size large, Admin UI Cloud tree view hangs, may be similar to below jira issue
          https://issues.apache.org/jira/browse/SOLR-6395

          Exceptions During this time:

          Zookeeper Log:
          org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /overseer/queue

          Solr Log:

          2015-02-05 23:23:13,174 [] priority=ERROR app_name= thread=RecoveryThread location=RecoveryStrategy line=142 Error while trying to recover. core=city_shard1_replica2:java.util.concurrent.ExecutionException: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: I was asked to wait on state recovering for shard1 in city on srwp01usc002.stubprod.com:8080_solr but I still do not see the requested state. I see state: active live:true leader from ZK: http://srwp01usc001.stubprod.com:8080/solr/city_shard1_replica1/
          at java.util.concurrent.FutureTask.report(FutureTask.java:122)
          at java.util.concurrent.FutureTask.get(FutureTask.java:188)
          at org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:615)

          Show
          Gopal Patwa added a comment - Thanks Mark, here more details for our production issue, I will try to reproduce this issue. Restart sequence: Solr - Restarted 02/03/3015 (8 Nodes, 10 Collection) ZKR - Restarted 02/04/2015 (5 Nodes) Normal Index Size are approx 5GB. Only few nodes had this issue When replica was in recovery, transaction logs size was over 100GB. Possible reason it starts writing all updates sent by the leader in this period to the transaction log . Due to overseer queue size large, Admin UI Cloud tree view hangs, may be similar to below jira issue https://issues.apache.org/jira/browse/SOLR-6395 Exceptions During this time: Zookeeper Log: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /overseer/queue Solr Log: 2015-02-05 23:23:13,174 [] priority=ERROR app_name= thread=RecoveryThread location=RecoveryStrategy line=142 Error while trying to recover. core=city_shard1_replica2:java.util.concurrent.ExecutionException: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: I was asked to wait on state recovering for shard1 in city on srwp01usc002.stubprod.com:8080_solr but I still do not see the requested state. I see state: active live:true leader from ZK: http://srwp01usc001.stubprod.com:8080/solr/city_shard1_replica1/ at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:188) at org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:615)
          Hide
          Mark Miller added a comment -

          Yeah, once the Overseer work queue is flooded beyond keeping up, all bets are off because the cluster state won't change. SOLR-7033 puts in limits to avoid the behavior we know of that can cause this. We should not have any paths that can write to zk so frequently.

          Show
          Mark Miller added a comment - Yeah, once the Overseer work queue is flooded beyond keeping up, all bets are off because the cluster state won't change. SOLR-7033 puts in limits to avoid the behavior we know of that can cause this. We should not have any paths that can write to zk so frequently.
          Hide
          Steve Rowe added a comment -

          Fixed as part of SOLR-7033.

          Show
          Steve Rowe added a comment - Fixed as part of SOLR-7033 .
          Hide
          ASF subversion and git services added a comment -

          Commit 1662784 from Steve Rowe in branch 'dev/branches/lucene_solr_4_10'
          [ https://svn.apache.org/r1662784 ]

          SOLR-7033, SOLR-5961: RecoveryStrategy should not publish any state when closed / cancelled and there should always be a pause between recoveries even when recoveries are rapidly stopped and started as well as when a node attempts to become the leader for a shard. (merged branch_5x r1658237)

          Show
          ASF subversion and git services added a comment - Commit 1662784 from Steve Rowe in branch 'dev/branches/lucene_solr_4_10' [ https://svn.apache.org/r1662784 ] SOLR-7033 , SOLR-5961 : RecoveryStrategy should not publish any state when closed / cancelled and there should always be a pause between recoveries even when recoveries are rapidly stopped and started as well as when a node attempts to become the leader for a shard. (merged branch_5x r1658237)
          Hide
          Michael McCandless added a comment -

          Bulk close for 4.10.4 release

          Show
          Michael McCandless added a comment - Bulk close for 4.10.4 release
          Hide
          davidchiu added a comment -

          I have the same problem in Solr 5.3.1. There are too many collections and shards(replicas)'s, when their state changes and will product a lot of task and write into /overseer/queue, but overseer is too busy and can't handle them in time, then the /overseer/queue will get very big and will make zookeeper down!!

          I think solr should limit the size of /overseer/queue according to the jute.buffersize of zookeeper.

          Show
          davidchiu added a comment - I have the same problem in Solr 5.3.1. There are too many collections and shards(replicas)'s, when their state changes and will product a lot of task and write into /overseer/queue, but overseer is too busy and can't handle them in time, then the /overseer/queue will get very big and will make zookeeper down!! I think solr should limit the size of /overseer/queue according to the jute.buffersize of zookeeper.

            People

            • Assignee:
              Shalin Shekhar Mangar
              Reporter:
              Maxim Novikov
            • Votes:
              4 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development