ZooKeeper
  1. ZooKeeper
  2. ZOOKEEPER-1840

Server tries to connect to itself during dynamic reconfig

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 3.5.0
    • Fix Version/s: None
    • Component/s: quorum
    • Labels:
      None

      Description

      Submitted this bug on a suggestion of Alexander Shraer (see https://issues.apache.org/jira/browse/ZOOKEEPER-1691)

      How to reproduce:

      == Server 1 zoo.cfg:
      standaloneEnabled=false
      dynamicConfigFile=<path to>/confdyn1/zoo.cfg.dynamic

      == Server 1 zoo.cfg.dynamic:
      server.1=localhost:2888:3888:participant;localhost:2181

      == Server 2 zoo.cfg:
      standaloneEnabled=false
      dynamicConfigFile=<path to>/confdyn2/zoo.cfg.dynamic

      == Server 2 zoo.cfg.dynamic (it is "aware" of the server 1, as mentioned in the Dynamic Reconfiguration - User Manual
      that I should have read more carefully yesterday):
      server.1=localhost:2888:3888:participant;localhost:2181
      server.2=localhost:2889:3889:participant;localhost:2182

      Start server 1
      Start server 2

      == use client 1 to issue a reconfig command on server 1:
      [zk: localhost:2181(CONNECTED) 1] reconfig -add server.2=localhost:2889:3889:participant;localhost:2182
      Committed new configuration:
      server.1=localhost:2888:3888:participant;localhost:2181
      server.2=localhost:2889:3889:participant;localhost:2182
      version=100000003

      There are strange stack traces in both server consoles.

      Server 1:
      2013-12-12 22:31:40,888 [myid:1] - WARN [ProcessThread(sid:1 cport:-1)::QuorumCnxManager@390] - Cannot open channel to 2 at election address localhost/127.0.0.1:3889
      java.net.ConnectException: Connection refused: connect
      at java.net.PlainSocketImpl.socketConnect(Native Method)
      at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351)
      at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213)
      at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200)
      at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
      at java.net.Socket.connect(Socket.java:529)
      at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:375)
      at org.apache.zookeeper.server.quorum.QuorumPeer.connectNewPeers(QuorumPeer.java:1252)
      at org.apache.zookeeper.server.quorum.QuorumPeer.setLastSeenQuorumVerifier(QuorumPeer.java:1272)
      at org.apache.zookeeper.server.quorum.Leader.propose(Leader.java:1071)
      at org.apache.zookeeper.server.quorum.ProposalRequestProcessor.processRequest(ProposalRequestProcessor.java:78)
      at org.apache.zookeeper.server.PrepRequestProcessor.pRequest(PrepRequestProcessor.java:864)
      at org.apache.zookeeper.server.PrepRequestProcessor.run(PrepRequestProcessor.java:144)
      2013-12-12 22:31:41,919 [myid:1] - WARN [LearnerHandler-/127.0.0.1:52301:QuorumPeer@1259] - Restarting Leader Election
      2013-12-12 22:31:41,920 [myid:1] - INFO [localhost/127.0.0.1:3888:QuorumCnxManager$Listener@571] - Leaving listener
      2013-12-12 22:31:41,920 [myid:1] - INFO [QuorumPeerListener:QuorumCnxManager$Listener@544] - My election bind port: localhost/127.0.0.1:3888
      2013-12-12 22:31:44,438 [myid:1] - INFO [WorkerReceiver[myid=1]:FastLeaderElection$Messenger$WorkerReceiver@410] - WorkerReceiver is down
      2013-12-12 22:31:44,439 [myid:1] - INFO [WorkerSender[myid=1]:FastLeaderElection$Messenger$WorkerSender@442] - WorkerSender is down

      Server 2:
      2013-12-12 22:31:41,894 [myid:2] - WARN [QuorumPeer[myid=2]/127.0.0.1:2182:QuorumCnxManager@390] - Cannot open channel to 2 at election address localhost/127.0.0.1:3889
      java.net.ConnectException: Connection refused: connect
      at java.net.PlainSocketImpl.socketConnect(Native Method)
      at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351)
      at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213)
      at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200)
      at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
      at java.net.Socket.connect(Socket.java:529)
      at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:375)
      at org.apache.zookeeper.server.quorum.QuorumPeer.connectNewPeers(QuorumPeer.java:1252)
      at org.apache.zookeeper.server.quorum.QuorumPeer.setLastSeenQuorumVerifier(QuorumPeer.java:1272)
      at org.apache.zookeeper.server.quorum.Follower.processPacket(Follower.java:131)
      at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:89)
      at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:967)
      2013-12-12 22:31:41,923 [myid:2] - WARN [QuorumPeer[myid=2]/127.0.0.1:2182:QuorumPeer@1259] - Restarting Leader Election
      2013-12-12 22:31:41,924 [myid:2] - INFO [QuorumPeerListener:QuorumCnxManager$Listener@544] - My election bind port: localhost/127.0.0.1:3889

      1. ZOOKEEPER-1840.patch
        0.8 kB
        Alexander Shraer

        Activity

        Hide
        Alexander Shraer added a comment -

        This issue is not about reconfig, so I'm renaming the JIRA. Its about whether or not to have an explicit connection from a server to in QuorumCnxManager.java

        It looks like in FLE we a server skips sending messages to itself by immediately putting the message in the receiving queue, so although a connection to itself exists it will not be used.

        But if I understand correctly a leader will explicitly send operations (proposals) to itself over an open connection. When a message is received it passes through several layers of request processors. So its not so clear to me on what layer we'd want to enqueue a packet from the leader to itself if we want to avoid explicitly sending it. Alternatively we could jut explicitly call Leader.processAck(myself) before sending the packet to others. Flavio Junqueira, Benjamin Reed, others, what do you think ? is this something we want to fix ?

        Show
        Alexander Shraer added a comment - This issue is not about reconfig, so I'm renaming the JIRA. Its about whether or not to have an explicit connection from a server to in QuorumCnxManager.java It looks like in FLE we a server skips sending messages to itself by immediately putting the message in the receiving queue, so although a connection to itself exists it will not be used. But if I understand correctly a leader will explicitly send operations (proposals) to itself over an open connection. When a message is received it passes through several layers of request processors. So its not so clear to me on what layer we'd want to enqueue a packet from the leader to itself if we want to avoid explicitly sending it. Alternatively we could jut explicitly call Leader.processAck(myself) before sending the packet to others. Flavio Junqueira , Benjamin Reed , others, what do you think ? is this something we want to fix ?
        Hide
        Michi Mutsuzaki added a comment -

        I'm a bit confused by the summary. Aren't both server 1 and 2 trying to connect to server 2?

        Show
        Michi Mutsuzaki added a comment - I'm a bit confused by the summary. Aren't both server 1 and 2 trying to connect to server 2?
        Hide
        Alexander Shraer added a comment -

        I just noticed that the specific error in the description is due to something else - when we reconfigure, we restart leader election, which involves shutting down QCM. I think this is why 1 can't connect to 2 momentarily. On the other hand 2 tries to connect to 1 from connectNewPeers, and the attached patch prevents this.

        I'm still not sure whether QCM connects to itself by default when its created.

        Show
        Alexander Shraer added a comment - I just noticed that the specific error in the description is due to something else - when we reconfigure, we restart leader election, which involves shutting down QCM. I think this is why 1 can't connect to 2 momentarily. On the other hand 2 tries to connect to 1 from connectNewPeers, and the attached patch prevents this. I'm still not sure whether QCM connects to itself by default when its created.
        Hide
        Alexander Shraer added a comment -

        sorry I meant "2 tries to connect to 2 from connectNewPeers"

        Show
        Alexander Shraer added a comment - sorry I meant "2 tries to connect to 2 from connectNewPeers"
        Show
        Michi Mutsuzaki added a comment - QuorumCnxManager doesn't connect to itself by default. https://github.com/apache/zookeeper/blob/trunk/src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L322
        Hide
        Alexander Shraer added a comment -

        I found this line, but its used by FLE only, not leader -> follower. What about connections for sending the operation proposals ?

        Show
        Alexander Shraer added a comment - I found this line, but its used by FLE only, not leader -> follower. What about connections for sending the operation proposals ?
        Hide
        Michi Mutsuzaki added a comment -

        If I understand it correctly, followers/observers connect to the leader by calling connectToLeader(). The leader accepts connections and initialize LearnerHandler objects. It doesn't use QuorumCnxManager, and the leader doesn't connect to itself.

        https://github.com/apache/zookeeper/blob/trunk/src/java/main/org/apache/zookeeper/server/quorum/Leader.java#L390

        Show
        Michi Mutsuzaki added a comment - If I understand it correctly, followers/observers connect to the leader by calling connectToLeader(). The leader accepts connections and initialize LearnerHandler objects. It doesn't use QuorumCnxManager, and the leader doesn't connect to itself. https://github.com/apache/zookeeper/blob/trunk/src/java/main/org/apache/zookeeper/server/quorum/Leader.java#L390
        Hide
        Alexander Shraer added a comment -

        Thanks, you're right. I guess its too late here
        I found how this works now - ProposalRequestProcessor on the leader invokes propose() that sends messages to all learners and then ProposalRequestProcessor hands it over to AckRequestProcessor that creates an ACK from the leader itself. Anyway, I changed the title back to what it was. I tried the patch, looks like it solves the issue - no message from 2 to 2 connection failure.

        Show
        Alexander Shraer added a comment - Thanks, you're right. I guess its too late here I found how this works now - ProposalRequestProcessor on the leader invokes propose() that sends messages to all learners and then ProposalRequestProcessor hands it over to AckRequestProcessor that creates an ACK from the leader itself. Anyway, I changed the title back to what it was. I tried the patch, looks like it solves the issue - no message from 2 to 2 connection failure.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12640396/ZOOKEEPER-1840.patch
        against trunk revision 1587812.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2045//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2045//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2045//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12640396/ZOOKEEPER-1840.patch against trunk revision 1587812. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2045//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2045//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2045//console This message is automatically generated.
        Hide
        Michi Mutsuzaki added a comment -

        +1 Thanks Alex!

        Show
        Michi Mutsuzaki added a comment - +1 Thanks Alex!
        Show
        Michi Mutsuzaki added a comment - trunk: http://svn.apache.org/viewvc?view=revision&revision=1587818
        Hide
        Hudson added a comment -

        FAILURE: Integrated in ZooKeeper-trunk #2292 (See https://builds.apache.org/job/ZooKeeper-trunk/2292/)
        ZOOKEEPER-1840. Server tries to connect to itself during dynamic reconfig (Alexander Shraer via michim) (michim: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1587818)

        • /zookeeper/trunk/CHANGES.txt
        • /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/QuorumPeer.java
        Show
        Hudson added a comment - FAILURE: Integrated in ZooKeeper-trunk #2292 (See https://builds.apache.org/job/ZooKeeper-trunk/2292/ ) ZOOKEEPER-1840 . Server tries to connect to itself during dynamic reconfig (Alexander Shraer via michim) (michim: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1587818 ) /zookeeper/trunk/CHANGES.txt /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/QuorumPeer.java

          People

          • Assignee:
            Alexander Shraer
            Reporter:
            Bruno Freudensprung
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development