Solr
  1. Solr
  2. SOLR-5325

zk connection loss causes overseer leader loss

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.3, 4.4, 4.5
    • Fix Version/s: 4.5.1, 4.6, 6.0
    • Component/s: None
    • Labels:
      None

      Description

      The problem we saw was that when the solr overseer leader experienced temporary zk connectivity problems it stopped processing overseer queue events.

      This first happened when quorum within the external zk ensemble was lost due to too many zookeepers being stopped (similar to SOLR-5199). The second time it happened when there was a sufficient number of zookeepers but they were holding zookeeper leadership elections and thus refused connections (the elections were taking several seconds, we were using the default zookeeper.cnxTimeout=5s value and it was hit for one ensemble member).

      1. SOLR-5325.patch
        11 kB
        Mark Miller
      2. SOLR-5325.patch
        7 kB
        Mark Miller
      3. SOLR-5325.patch
        14 kB
        Christine Poerschke

        Issue Links

          Activity

          Hide
          Christine Poerschke added a comment -

          Attaching Overseer.java patch for solr 4.4.0, OverseerCollectionProcessor.java could be changed in similar way.

          Show
          Christine Poerschke added a comment - Attaching Overseer.java patch for solr 4.4.0, OverseerCollectionProcessor.java could be changed in similar way.
          Hide
          Mark Miller added a comment -

          Thanks guys - I'll try and get this in quickly as it would be a great to fix it for 4.5.1.

          Show
          Mark Miller added a comment - Thanks guys - I'll try and get this in quickly as it would be a great to fix it for 4.5.1.
          Hide
          Mark Miller added a comment -

          Quick first pass patch.

          Show
          Mark Miller added a comment - Quick first pass patch.
          Hide
          Mark Miller added a comment -

          New patch: A fix to the OverseerCollectionProcessor fix and some more random testing that attempts to catch this - it doesn't seem to yet though.

          Show
          Mark Miller added a comment - New patch: A fix to the OverseerCollectionProcessor fix and some more random testing that attempts to catch this - it doesn't seem to yet though.
          Hide
          ASF subversion and git services added a comment -

          Commit 1531313 from Mark Miller in branch 'dev/trunk'
          [ https://svn.apache.org/r1531313 ]

          SOLR-5325: ZooKeeper connection loss can cause the Overseer to stop processing commands.

          Show
          ASF subversion and git services added a comment - Commit 1531313 from Mark Miller in branch 'dev/trunk' [ https://svn.apache.org/r1531313 ] SOLR-5325 : ZooKeeper connection loss can cause the Overseer to stop processing commands.
          Hide
          ASF subversion and git services added a comment -

          Commit 1531315 from Mark Miller in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1531315 ]

          SOLR-5325: ZooKeeper connection loss can cause the Overseer to stop processing commands.

          Show
          ASF subversion and git services added a comment - Commit 1531315 from Mark Miller in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1531315 ] SOLR-5325 : ZooKeeper connection loss can cause the Overseer to stop processing commands.
          Hide
          Mark Miller added a comment - - edited

          Added some more testing that I thought would catch it, but it has not yet on my system. Still poking around a bit.

          Anyway, I've committed the fix.

          Show
          Mark Miller added a comment - - edited Added some more testing that I thought would catch it, but it has not yet on my system. Still poking around a bit. Anyway, I've committed the fix.
          Hide
          Mark Miller added a comment -

          I'm still kind of surprised this would happen - we should be retrying on connectionloss up to an expiration - which would make us the leader no longer. Perhaps the length of retrying can be a little short or something. And perhaps that is part of why it is more difficult for me to reproduce in a test.

          Show
          Mark Miller added a comment - I'm still kind of surprised this would happen - we should be retrying on connectionloss up to an expiration - which would make us the leader no longer. Perhaps the length of retrying can be a little short or something. And perhaps that is part of why it is more difficult for me to reproduce in a test.
          Hide
          ASF subversion and git services added a comment -

          Commit 1531323 from Mark Miller in branch 'dev/trunk'
          [ https://svn.apache.org/r1531323 ]

          SOLR-5325: raise retry padding a bit

          Show
          ASF subversion and git services added a comment - Commit 1531323 from Mark Miller in branch 'dev/trunk' [ https://svn.apache.org/r1531323 ] SOLR-5325 : raise retry padding a bit
          Hide
          ASF subversion and git services added a comment -

          Commit 1531324 from Mark Miller in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1531324 ]

          SOLR-5325: raise retry padding a bit

          Show
          ASF subversion and git services added a comment - Commit 1531324 from Mark Miller in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1531324 ] SOLR-5325 : raise retry padding a bit
          Hide
          ASF subversion and git services added a comment -

          Commit 1531325 from Mark Miller in branch 'dev/branches/lucene_solr_4_5'
          [ https://svn.apache.org/r1531325 ]

          SOLR-5325: ZooKeeper connection loss can cause the Overseer to stop processing commands.

          Show
          ASF subversion and git services added a comment - Commit 1531325 from Mark Miller in branch 'dev/branches/lucene_solr_4_5' [ https://svn.apache.org/r1531325 ] SOLR-5325 : ZooKeeper connection loss can cause the Overseer to stop processing commands.
          Hide
          ASF subversion and git services added a comment -

          Commit 1531327 from Mark Miller in branch 'dev/branches/lucene_solr_4_5'
          [ https://svn.apache.org/r1531327 ]

          SOLR-5325: raise retry padding a bit

          Show
          ASF subversion and git services added a comment - Commit 1531327 from Mark Miller in branch 'dev/branches/lucene_solr_4_5' [ https://svn.apache.org/r1531327 ] SOLR-5325 : raise retry padding a bit
          Hide
          Mark Miller added a comment -

          I think that the reason that this is hard to catch in a test is that we try and do retries on connectionloss up to the expiration time - there must be some case where we were still getting a connectionloss and no expiration though. This issue should handle that case for this particular bit of code, but as an overall precautionary measure, I have also bumped up the retries just a bit to try and ensure they are going beyond the session timeout.

          Show
          Mark Miller added a comment - I think that the reason that this is hard to catch in a test is that we try and do retries on connectionloss up to the expiration time - there must be some case where we were still getting a connectionloss and no expiration though. This issue should handle that case for this particular bit of code, but as an overall precautionary measure, I have also bumped up the retries just a bit to try and ensure they are going beyond the session timeout.
          Hide
          Mark Miller added a comment -

          Thanks Christine!

          Show
          Mark Miller added a comment - Thanks Christine!

            People

            • Assignee:
              Mark Miller
              Reporter:
              Christine Poerschke
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development