Solr
  1. Solr
  2. SOLR-6402

OverseerCollectionProcessor should not exit for ZK ConnectionLoss

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.8, 6.0
    • Fix Version/s: 4.10, 6.0
    • Component/s: SolrCloud
    • Labels:
      None

      Description

      We saw an occurrence where we had some ZK connection blip and the OverseerCollectionProcessor thread stopped but the ClusterStateUpdater output some error but kept running, and the node didn't lose its leadership. this caused our collection work queue to back up.

      Right now OverseerCollectionProcessor's run method has on trunk:

      344 if (e.code() == KeeperException.Code.SESSIONEXPIRED
      345 || e.code() == KeeperException.Code.CONNECTIONLOSS) {
      346 log.warn("Overseer cannot talk to ZK");
      347 return;
      348 }

      I think this if statement should only be for SESSIONEXPIRED. If it just experiences a connection loss but then reconnect before the session expired, it'll keep being the leader.

        Issue Links

          Activity

          Hide
          Mark Miller added a comment -

          Indeed!

          Show
          Mark Miller added a comment - Indeed!
          Hide
          Mark Miller added a comment -

          Any chance you have a stack trace or something?

          What surprises me is that commands should be retried passed expiration - that takes a bit of guesswork and padding, but I'm surprised a connectionloss leaked out that didn't lead to an expiration.

          Show
          Mark Miller added a comment - Any chance you have a stack trace or something? What surprises me is that commands should be retried passed expiration - that takes a bit of guesswork and padding, but I'm surprised a connectionloss leaked out that didn't lead to an expiration.
          Hide
          ASF subversion and git services added a comment -

          Commit 1619641 from Mark Miller in branch 'dev/trunk'
          [ https://svn.apache.org/r1619641 ]

          SOLR-6402: OverseerCollectionProcessor should not exit for ZooKeeper ConnectionLoss.

          Show
          ASF subversion and git services added a comment - Commit 1619641 from Mark Miller in branch 'dev/trunk' [ https://svn.apache.org/r1619641 ] SOLR-6402 : OverseerCollectionProcessor should not exit for ZooKeeper ConnectionLoss.
          Hide
          ASF subversion and git services added a comment -

          Commit 1619643 from Mark Miller in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1619643 ]

          SOLR-6402: OverseerCollectionProcessor should not exit for ZooKeeper ConnectionLoss.

          Show
          ASF subversion and git services added a comment - Commit 1619643 from Mark Miller in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1619643 ] SOLR-6402 : OverseerCollectionProcessor should not exit for ZooKeeper ConnectionLoss.
          Hide
          ASF subversion and git services added a comment -

          Commit 1619645 from Mark Miller in branch 'dev/branches/lucene_solr_4_10'
          [ https://svn.apache.org/r1619645 ]

          SOLR-6402: OverseerCollectionProcessor should not exit for ZooKeeper ConnectionLoss.

          Show
          ASF subversion and git services added a comment - Commit 1619645 from Mark Miller in branch 'dev/branches/lucene_solr_4_10' [ https://svn.apache.org/r1619645 ] SOLR-6402 : OverseerCollectionProcessor should not exit for ZooKeeper ConnectionLoss.
          Hide
          Jessica Cheng Mallet added a comment -

          Unfortunately since it just logs and return, I just have the log line

          2014-08-21 10:51:39,060 WARN [Overseer-164353762238923913-<scrubbed IP>:8983_solr-n_0000000757] OverseerCollectionProcessor.java (line 350) Overseer cannot talk to ZK

          Unfortunately, even though amILeader() tries to handle connection loss, there are lots of other operations past the if check that don't. E.g. all those workqueue manipulations.

          Show
          Jessica Cheng Mallet added a comment - Unfortunately since it just logs and return, I just have the log line 2014-08-21 10:51:39,060 WARN [Overseer-164353762238923913-<scrubbed IP>:8983_solr-n_0000000757] OverseerCollectionProcessor.java (line 350) Overseer cannot talk to ZK Unfortunately, even though amILeader() tries to handle connection loss, there are lots of other operations past the if check that don't. E.g. all those workqueue manipulations.
          Hide
          Jessica Cheng Mallet added a comment -

          And thanks for fixing!

          Show
          Jessica Cheng Mallet added a comment - And thanks for fixing!
          Hide
          Mark Miller added a comment -

          there are lots of other operations past the if check that don't. E.g. all those workqueue manipulations.

          That should not be the case. All ZK manipulation should be through SolrZkClient, which should use ZkCmdExecutor to retry on connection loss passed expiration unless explicitly asked not to.

          Show
          Mark Miller added a comment - there are lots of other operations past the if check that don't. E.g. all those workqueue manipulations. That should not be the case. All ZK manipulation should be through SolrZkClient, which should use ZkCmdExecutor to retry on connection loss passed expiration unless explicitly asked not to.
          Hide
          Anshum Gupta added a comment -

          All ZK manipulation should be through SolrZkClient, which should use ZkCmdExecutor

          AFAIR, that's the case.

          Show
          Anshum Gupta added a comment - All ZK manipulation should be through SolrZkClient, which should use ZkCmdExecutor AFAIR, that's the case.
          Hide
          Jessica Cheng Mallet added a comment -

          All ZK manipulation should be through SolrZkClient, which should use ZkCmdExecutor to retry on connection loss passed expiration unless explicitly asked not to.

          Ah, I missed that.

          So I took a look at ZkCmdExecutor.retryOperation(), we have this effect (for the default of 15s timeout and therefore retryCount=5):
          i sleep
          0 0s
          1 1.5s
          2 3s
          3 4.5s
          4 6s

          which adds up to 15s, the timeout. However, what if on loop i=4, the operation threw connection loss again, but then since the sleep is at the end of the catch block, while it slept the last time for 6s, the client reconnected so the session didn't expire? Maybe the intended thing is to do retryDelay(i+1) so that it would've slept 1.5s when i=0,..., and 6s when i=3, but retry i=4 at the end of 15s?

          Disclaimer that I actually don't know that what I think may have happened happened at all, since, like I said, I only have that one log message and the fact that while OverseerCollectionProcessor died, the ClusterStateUpdater didn't die.

          Show
          Jessica Cheng Mallet added a comment - All ZK manipulation should be through SolrZkClient, which should use ZkCmdExecutor to retry on connection loss passed expiration unless explicitly asked not to. Ah, I missed that. So I took a look at ZkCmdExecutor.retryOperation(), we have this effect (for the default of 15s timeout and therefore retryCount=5): i sleep 0 0s 1 1.5s 2 3s 3 4.5s 4 6s which adds up to 15s, the timeout. However, what if on loop i=4, the operation threw connection loss again, but then since the sleep is at the end of the catch block, while it slept the last time for 6s, the client reconnected so the session didn't expire? Maybe the intended thing is to do retryDelay(i+1) so that it would've slept 1.5s when i=0,..., and 6s when i=3, but retry i=4 at the end of 15s? Disclaimer that I actually don't know that what I think may have happened happened at all, since, like I said, I only have that one log message and the fact that while OverseerCollectionProcessor died, the ClusterStateUpdater didn't die.
          Hide
          ASF subversion and git services added a comment -

          Commit 1619792 from Mark Miller in branch 'dev/trunk'
          [ https://svn.apache.org/r1619792 ]

          SOLR-6402: Only exit the thread loop on a KeeperException if it's expiration.

          Show
          ASF subversion and git services added a comment - Commit 1619792 from Mark Miller in branch 'dev/trunk' [ https://svn.apache.org/r1619792 ] SOLR-6402 : Only exit the thread loop on a KeeperException if it's expiration.
          Hide
          ASF subversion and git services added a comment -

          Commit 1619794 from Mark Miller in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1619794 ]

          SOLR-6402: Only exit the thread loop on a KeeperException if it's expiration.

          Show
          ASF subversion and git services added a comment - Commit 1619794 from Mark Miller in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1619794 ] SOLR-6402 : Only exit the thread loop on a KeeperException if it's expiration.
          Hide
          ASF subversion and git services added a comment -

          Commit 1619795 from Mark Miller in branch 'dev/branches/lucene_solr_4_10'
          [ https://svn.apache.org/r1619795 ]

          SOLR-6402: Only exit the thread loop on a KeeperException if it's expiration.

          Show
          ASF subversion and git services added a comment - Commit 1619795 from Mark Miller in branch 'dev/branches/lucene_solr_4_10' [ https://svn.apache.org/r1619795 ] SOLR-6402 : Only exit the thread loop on a KeeperException if it's expiration.
          Hide
          Mark Miller added a comment -

          Maybe the intended thing is to do

          Yes! That sounds off. The intention is to try very hard to retry beyond expiration as we are guestimating time on the zk server. This is fairly important. I'll work on a test.

          Show
          Mark Miller added a comment - Maybe the intended thing is to do Yes! That sounds off. The intention is to try very hard to retry beyond expiration as we are guestimating time on the zk server. This is fairly important. I'll work on a test.
          Hide
          Mark Miller added a comment -

          SOLR-6405 ZooKeeper calls can easily not be retried enough on ConnectionLoss.

          Show
          Mark Miller added a comment - SOLR-6405 ZooKeeper calls can easily not be retried enough on ConnectionLoss.
          Hide
          Mark Miller added a comment -

          Thanks Jessica!

          Show
          Mark Miller added a comment - Thanks Jessica!

            People

            • Assignee:
              Mark Miller
              Reporter:
              Jessica Cheng Mallet
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development