Uploaded image for project: 'ZooKeeper'
  1. ZooKeeper
  2. ZOOKEEPER-3991

QuorumCnxManager Listener port bind retry does not retry DNS lookup

VotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Patch

    Description

      We run Zookeeper in a container environment where DNS is not stable. As recommended by the documentation, we set electionPortBindRetry to 0 (keeps retrying forever).

      On some instances, we get the following exception in an infinite loop, even though the address already became resolve-able:

       

      zk-2_1  | 2020-11-03 10:57:08,407 [myid:3] - ERROR [ListenerHandler-zk-2.test:3888:QuorumCnxManager$Listener$ListenerHandler@1093] - Exception while listening
      zk-2_1  | java.net.SocketException: Unresolved address
      zk-2_1  | 	at java.base/java.net.ServerSocket.bind(Unknown Source)
      zk-2_1  | 	at java.base/java.net.ServerSocket.bind(Unknown Source)
      zk-2_1  | 	at org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener$ListenerHandler.createNewServerSocket(QuorumCnxManager.java:1140)
      zk-2_1  | 	at org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener$ListenerHandler.acceptConnections(QuorumCnxManager.java:1064)
      zk-2_1  | 	at org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener$ListenerHandler.run(QuorumCnxManager.java:1033)
      zk-2_1  | 	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
      zk-2_1  | 	at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
      zk-2_1  | 	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
      zk-2_1  | 	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
      zk-2_1  | 	at java.base/java.lang.Thread.run(Unknown Source)

      Zookeeper does not actually retry the DNS resolution, it just keeps using the old failed result.

       

      This happens because the InetSocketAddress is created once and the DNS lookup happens when it is created.

      This issue has come up previously in https://issues.apache.org/jira/browse/ZOOKEEPER-1506 but it appears to still happen here.

      I have attached a repro.tar.gz to help reproduce this issue. Steps:

      • Untar repro.tar.gz
      • docker-compose up
      • See the exception keeps happening for zk-2, not for the others
      • Open db.test and uncomment the zk-2 line, increment the serial and save
      • Wait a few seconds for the DNS to refresh
      • Verify that you can resolve zk-2.test now (dig @172.16.60.2 zk-2.test) but the error keeps appearing

      I have also attached a patch that resolves this. The patch will retry DNS resolution if the address is still unresolved every time it tries to create the server socket.

       

      Attachments

        1. repro.tar.gz
          0.7 kB
          Lander Visterin

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            lander.visterin Lander Visterin
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 2h 40m
                2h 40m

                Slack

                  Issue deployment