Uploaded image for project: 'ZooKeeper'
  1. ZooKeeper
  2. ZOOKEEPER-2783

follower disconnects and cannot reconnect

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Duplicate
    • Affects Version/s: 3.4.10
    • Fix Version/s: 3.4.11, 3.5.4
    • Component/s: leaderElection
    • Labels:
      None
    • Environment:

      centos 7, AWS EC2

      Description

      We have a 5 node cluster running 3.4.10 we saw this in .8 and .9 as well), and sometimes, a node gets a read timeout, drops all the connections and tries to re-establish itself to the quorum. It can usually do this in a few seconds, but last night it took almost 15 minutes to reconnect.

      These are 5 servers in AWS, and we've tried tuning the timeouts, but the are exceeding any reasonable timeout and still failing.

      In the attached logs, 5 is a follower, 3 is the leader. 5 loses connectivity at 11:21:34. 3 sees the disconnect at the same moment.

      5 tries to re-establish the quorum, but cannot do it until the connections to the other servers expire at 11:37:02. After the connections are re-established, 5 connects immediately.

      At 11:41:08, the operator restarted the server, and it reconnected normally.

      I suspect there is a problem with stale connections to the rest of the quorum - the other services on this box were fine (monitoring, puppet) and able to establish new connections with no problems.

      I posed this problem to the zookeeper-users list and was asked to open a ticket.

        Attachments

        1. fail5.log
          50 kB
          Ben Sherman
        2. fail3.log
          65 kB
          Ben Sherman

          Activity

            People

            • Assignee:
              bensherman Ben Sherman
              Reporter:
              bensherman Ben Sherman
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: