Uploaded image for project: 'ZooKeeper'
  1. ZooKeeper
  2. ZOOKEEPER-4316

Leader election fails due to SocketTimeoutException in QuorumCnxManager

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.4.12, 3.5.7
    • None
    • quorum
    • None

    Description

      I have a 3 node zookeeper cluster deployed as a stack using docker swarm.
      Deploying this stack causes zookeeper to fail with a SocketTimeoutException during leader election with the following log

       

      2021-06-11 03:59:34,607 [myid:2] - WARN  [QuorumPeer[myid=2]/0.0.0.0:2181:QuorumCnxManager@584] - Cannot open channel to 3 at election address zoo3/10.0.11.5:3888
      java.net.SocketTimeoutException: connect timed out
              at java.net.PlainSocketImpl.socketConnect(Native Method)
             at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
              at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
              at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
              at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
              at java.net.Socket.connect(Socket.java:589)
              at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:558)
              at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectAll(QuorumCnxManager.java:610)
              at org.apache.zookeeper.server.quorum.FastLeaderElection.lookForLeader(FastLeaderElection.java:838)
              at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:957)

      The docker overlay network itself appears to be sound. A netstat on one of the nodes outputs

      bash-4.4# netstat -tuln
      Active Internet connections (only servers)
      Proto Recv-Q Send-Q Local Address           Foreign Address         State
      tcp        0      0 0.0.0.0:2181            0.0.0.0:*               LISTEN
      tcp        0      0 0.0.0.0:3888            0.0.0.0:*               LISTEN
      tcp        0      0 0.0.0.0:42941           0.0.0.0:*               LISTEN
      tcp        0      0 127.0.0.11:35453        0.0.0.0:*               LISTEN
      udp        0      0 127.0.0.11:55009        0.0.0.0:*

      showing the 3888 port is open. but a tcpdump only shows send and re-transmissions and there are no responses in port 3888.
      Suspecting the issue maybe due to a short timeout or small number of retries, I have tried increasing the cnxTimeout to 300000 and electionPortBindRetry to 0 (infinite), but even after 13 hrs of continuous running and retrying election the same error persists

      I have attached the stack.yml, the custom docker-entrypoint.sh that we override on top of the official container to enable running from a root host user, and the zoo.cfg file from inside the container.

      Any help in identifying the underlying issue or mis-configuration, or any configuration parameter that may help solve the issue is deeply appreciated.

       

      Attachments

        1. docker-entrypoint.sh
          2 kB
          Arun Subramanian R
        2. zoo_3.5.7.yml
          3 kB
          Arun Subramanian R
        3. zoo.cfg
          0.3 kB
          Arun Subramanian R

        Activity

          People

            Unassigned Unassigned
            asubramanianr Arun Subramanian R
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: