Uploaded image for project: 'ZooKeeper'
  1. ZooKeeper
  2. ZOOKEEPER-2202

Cluster crashes when reconfig adds an unreachable observer

Add voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Patch Available
    • Major
    • Resolution: Unresolved
    • 3.5.0, 3.6.0
    • None
    • None
    • None

    Description

      While adding support for reconfig() in Kazoo (https://github.com/python-zk/kazoo/pull/333) I found that the cluster can be crashed if you add an observer whose election port isn't reachable (i.e.: packets for that destination are dropped, not rejected). This will raise a SocketTimeoutException which will bring down the PrepRequestProcessor:

      2015-06-02 14:37:16,473 [myid:3] - WARN  [ProcessThread(sid:3 cport:-1)::QuorumCnxManager@384] - Cannot open channel to 100 at election address /8.8.8.8:38703
      java.net.SocketTimeoutException: connect timed out
              at java.net.PlainSocketImpl.socketConnect(Native Method)
              at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:345)
              at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
              at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
              at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
              at java.net.Socket.connect(Socket.java:589)
              at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:369)
              at org.apache.zookeeper.server.quorum.QuorumPeer.connectNewPeers(QuorumPeer.java:1288)
              at org.apache.zookeeper.server.quorum.QuorumPeer.setLastSeenQuorumVerifier(QuorumPeer.java:1315)
              at org.apache.zookeeper.server.quorum.Leader.propose(Leader.java:1056)
              at org.apache.zookeeper.server.quorum.ProposalRequestProcessor.processRequest(ProposalRequestProcessor.java:78)
              at org.apache.zookeeper.server.PrepRequestProcessor.pRequest(PrepRequestProcessor.java:877)
              at org.apache.zookeeper.server.PrepRequestProcessor.run(PrepRequestProcessor.java:143)
      

      A simple repro can be obtained by using the code in the referenced pull request above and using 8.8.8.8:3888 (for example) instead of a free (but closed) port in the loopback.

      I think that adding an Observer (or a Participant) that isn't currently reachable is a valid use case (i.e.: you are provisioning the machine and it's not currently needed) so I think we could handle this with lower connect timeouts, not sure.

      Attachments

        1. ZOOKEEPER-2202.patch
          2 kB
          Raúl Gutiérrez Segalés

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            rgs Raúl Gutiérrez Segalés
            rgs Raúl Gutiérrez Segalés

            Dates

              Created:
              Updated:

              Slack

                Issue deployment