ZooKeeper
  1. ZooKeeper
  2. ZOOKEEPER-900

FLE implementation should be improved to use non-blocking sockets

    Details

    • Type: Bug Bug
    • Status: Reopened
    • Priority: Critical Critical
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 3.5.0
    • Component/s: None
    • Labels:
      None

      Description

      From earlier email exchanges:
      1. Blocking connects and accepts:

      a) The first problem is in manager.toSend(). This invokes connectOne(), which does a blocking connect. While testing, I changed the code so that connectOne() starts a new thread called AsyncConnct(). AsyncConnect.run() does a socketChannel.connect(). After starting AsyncConnect, connectOne starts a timer. connectOne continues with normal operations if the connection is established before the timer expires, otherwise, when the timer expires it interrupts AsyncConnect() thread and returns. In this way, I can have an upper bound on the amount of time we need to wait for connect to succeed. Of course, this was a quick fix for my testing. Ideally, we should use Selector to do non-blocking connects/accepts. I am planning to do that later once we at least have a quick fix for the problem and consensus from others for the real fix (this problem is big blocker for us). Note that it is OK to do blocking IO in SenderWorker and RecvWorker threads since they block IO to the respective peer.

      b) The blocking IO problem is not just restricted to connectOne(), but also in receiveConnection(). The Listener thread calls receiveConnection() for each incoming connection request. receiveConnection does blocking IO to get peer's info (s.read(msgBuffer)). Worse, it invokes connectOne() back to the peer that had sent the connection request. All of this is happening from the Listener. In short, if a peer fails after initiating a connection, the Listener thread won't be able to accept connections from other peers, because it would be stuck in read() or connetOne(). Also the code has an inherent cycle. initiateConnection() and receiveConnection() will have to be very carefully synchronized otherwise, we could run into deadlocks. This code is going to be difficult to maintain/modify.

      Also see: https://issues.apache.org/jira/browse/ZOOKEEPER-822

      1. ZOOKEEPER-900.patch2
        25 kB
        Vishal Kher
      2. ZOOKEEPER-900.patch1
        34 kB
        Vishal Kher
      3. ZOOKEEPER-900.patch
        26 kB
        Vishal Kher

        Issue Links

          Activity

          Hide
          Germán Blanco added a comment -

          I think that this should take the direction of allowing the option to use Netty also in leader election, together with making connections asynchronous. This is mentioned by Patrick Hunt in ZOOKEEPER-901.
          Should we use that one or a brand new JIRA?

          Show
          Germán Blanco added a comment - I think that this should take the direction of allowing the option to use Netty also in leader election, together with making connections asynchronous. This is mentioned by Patrick Hunt in ZOOKEEPER-901 . Should we use that one or a brand new JIRA?
          Hide
          Mahadev konar added a comment -

          Patrick Hunt I htink we can close this one in favor of another jira.

          Show
          Mahadev konar added a comment - Patrick Hunt I htink we can close this one in favor of another jira.
          Hide
          Patrick Hunt added a comment -

          Vishal Kher & Mahadev konar what's the status on this? Should we close in preference to another jira? (as Mahadev suggested)

          Show
          Patrick Hunt added a comment - Vishal Kher & Mahadev konar what's the status on this? Should we close in preference to another jira? (as Mahadev suggested)
          Hide
          Germán Blanco added a comment -

          Avoiding locking in the connect solves also ZOOKEEPER-1678

          Show
          Germán Blanco added a comment - Avoiding locking in the connect solves also ZOOKEEPER-1678
          Hide
          Mahadev konar added a comment -

          Vishal, please use different jira's for subtasks. Reopening the same jira causes a lot of confusion. I am moving this out to 3.4.1 for now.

          Show
          Mahadev konar added a comment - Vishal, please use different jira's for subtasks. Reopening the same jira causes a lot of confusion. I am moving this out to 3.4.1 for now.
          Hide
          Hudson added a comment -

          Integrated in ZooKeeper-trunk #1010 (See https://hudson.apache.org/hudson/job/ZooKeeper-trunk/1010/)

          Show
          Hudson added a comment - Integrated in ZooKeeper-trunk #1010 (See https://hudson.apache.org/hudson/job/ZooKeeper-trunk/1010/ )
          Hide
          Patrick Hunt added a comment -

          I see some great information about how the code/algos operate being detailed in these jiras. I highly encourage you guys to document this stuff in either the code or in a separate document available on the wiki/forrest (now mvn site, whatever). It's critical
          that we provide more details like this to our devs.

          See ZOOKEEPER-918 as a great example of what I'm talking about. (although adding more comments to the code is fine too).

          Basically, if you find yourself describing something in a jira that's not documented already, consider documenting it. Thanks.

          Show
          Patrick Hunt added a comment - I see some great information about how the code/algos operate being detailed in these jiras. I highly encourage you guys to document this stuff in either the code or in a separate document available on the wiki/forrest (now mvn site, whatever). It's critical that we provide more details like this to our devs. See ZOOKEEPER-918 as a great example of what I'm talking about. (although adding more comments to the code is fine too). Basically, if you find yourself describing something in a jira that's not documented already, consider documenting it. Thanks.
          Hide
          Vishal Kher added a comment -

          The committed patch fixed one part of the problem (enforced timeout on network IO).

          Reopening for remaining subtasks.

          Show
          Vishal Kher added a comment - The committed patch fixed one part of the problem (enforced timeout on network IO). Reopening for remaining subtasks.
          Hide
          Flavio Junqueira added a comment -

          Ok, there might have been a confusion. I've seen the patch available flag up and I interpreted it as ready to commit (after review, of course). If you still think there is work to be done on this jira, Vishal, please consider reopening it and creating sub-tasks. From your comments, I can extract at least 3 possible tasks.

          Once you create sub-tasks (or new independent jiras), I will comment on your questions. I'd rather do that so that we don't mix up the discussion. Is that ok?

          Show
          Flavio Junqueira added a comment - Ok, there might have been a confusion. I've seen the patch available flag up and I interpreted it as ready to commit (after review, of course). If you still think there is work to be done on this jira, Vishal, please consider reopening it and creating sub-tasks. From your comments, I can extract at least 3 possible tasks. Once you create sub-tasks (or new independent jiras), I will comment on your questions. I'd rather do that so that we don't mix up the discussion. Is that ok?
          Hide
          Vishal Kher added a comment -

          Hi Flavio,

          Thanks. I will take shutting down of worker threads in a separate
          jira.

          I am cleaning up QCM further to incorporate the change mentioned in my
          comment on 02/Nov/10 02:09 PM.

          I have a few more questions:

          1. I have a question about the following piece of code in QCM:

          if (remoteSid == QuorumPeer.OBSERVER_ID)

          { /* * Choose identifier at random. We need a value to identify * the connection. */ remoteSid = observerCounter--; LOG.info("Setting arbitrary identifier to observer: " + remoteSid); }

          Should we allow this? The problem with this code is that if a peer
          connects twice with QuorumPeer.OBSERVER_ID, we will end up creating
          threads for this peer twice. This could result in redundant
          SendWorker/RecvWorker threads.

          I haven't used observers yet. The documentation
          http://hadoop.apache.org/zookeeper/docs/r3.3.0/zookeeperObservers.html
          says that just like followers, observers should have server IDs. In
          which case, why do we want to provide a wild-card?

          2. Should I add a check to reject connections from peers that are not
          listed in the configuration file? Currently, we are not doing any
          sanity check for server IDs. I think this might fix ZOOKEEPER-851.
          The fix is simple. However, I am not sure if anyone in community
          is relying on this ability.

          -Vishal

          Show
          Vishal Kher added a comment - Hi Flavio, Thanks. I will take shutting down of worker threads in a separate jira. I am cleaning up QCM further to incorporate the change mentioned in my comment on 02/Nov/10 02:09 PM. I have a few more questions: 1. I have a question about the following piece of code in QCM: if (remoteSid == QuorumPeer.OBSERVER_ID) { /* * Choose identifier at random. We need a value to identify * the connection. */ remoteSid = observerCounter--; LOG.info("Setting arbitrary identifier to observer: " + remoteSid); } Should we allow this? The problem with this code is that if a peer connects twice with QuorumPeer.OBSERVER_ID, we will end up creating threads for this peer twice. This could result in redundant SendWorker/RecvWorker threads. I haven't used observers yet. The documentation http://hadoop.apache.org/zookeeper/docs/r3.3.0/zookeeperObservers.html says that just like followers, observers should have server IDs. In which case, why do we want to provide a wild-card? 2. Should I add a check to reject connections from peers that are not listed in the configuration file? Currently, we are not doing any sanity check for server IDs. I think this might fix ZOOKEEPER-851 . The fix is simple. However, I am not sure if anyone in community is relying on this ability. -Vishal
          Hide
          Flavio Junqueira added a comment -

          Committed revision 1036071.

          Show
          Flavio Junqueira added a comment - Committed revision 1036071.
          Hide
          Flavio Junqueira added a comment -

          +1, Great job, Vishal! On your question, the problem is that it is not easy to decide when a peer can close its connections because it doesn't know in which state others are and it might need to receive and respond to notifications. In any case, if have an idea for how to do it and want to discuss it further, we could create a new jira and work there, since this is a separate issue.

          Show
          Flavio Junqueira added a comment - +1, Great job, Vishal! On your question, the problem is that it is not easy to decide when a peer can close its connections because it doesn't know in which state others are and it might need to receive and respond to notifications. In any case, if have an idea for how to do it and want to discuss it further, we could create a new jira and work there, since this is a separate issue.
          Hide
          Vishal Kher added a comment -

          Hi Flavio,

          Currently the QCM keeps the TCP connections to other peers alive even after finishing leader election. How about we shutdown these threads after a node decides to be a leader or a follower?The threads get created on fly. Do you see any problem with that?

          Thanks.
          -Vishal

          Show
          Vishal Kher added a comment - Hi Flavio, Currently the QCM keeps the TCP connections to other peers alive even after finishing leader election. How about we shutdown these threads after a node decides to be a leader or a follower?The threads get created on fly. Do you see any problem with that? Thanks. -Vishal
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12459714/ZOOKEEPER-900.patch
          against trunk revision 1034003.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/32//testReport/
          Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/32//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/32//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12459714/ZOOKEEPER-900.patch against trunk revision 1034003. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/32//testReport/ Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/32//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/32//console This message is automatically generated.
          Hide
          Vishal Kher added a comment -

          patch submitted.

          Show
          Vishal Kher added a comment - patch submitted.
          Hide
          Vishal Kher added a comment -

          Hi Flavio, Pat,

          I have attached the patch with suggested modifications. I did some more testing and also ran SimpleSysTest.

          Other than the changes suggested above and some cosmetic changes to error messages, I have added a delay on 1 second in Listener.run() in case we see an IOException.

          Let me know if you have more comments.

          -Vishal

          Show
          Vishal Kher added a comment - Hi Flavio, Pat, I have attached the patch with suggested modifications. I did some more testing and also ran SimpleSysTest. Other than the changes suggested above and some cosmetic changes to error messages, I have added a delay on 1 second in Listener.run() in case we see an IOException. Let me know if you have more comments. -Vishal
          Hide
          Flavio Junqueira added a comment -

          If we fix the findbugs issue here, then we should just close ZOOKEEPER-902 stating that it was resolved in ZOOKEEPER-900.

          Show
          Flavio Junqueira added a comment - If we fix the findbugs issue here, then we should just close ZOOKEEPER-902 stating that it was resolved in ZOOKEEPER-900 .
          Hide
          Vishal Kher added a comment -

          Hi Flavio, Pat,

          I will submit a new patch with suggested changes. For 902 do I just add a file ./java/test/bin/test-patch.properties with OK_FINDBUGS_WARNINGS=0 in it?

          Show
          Vishal Kher added a comment - Hi Flavio, Pat, I will submit a new patch with suggested changes. For 902 do I just add a file ./java/test/bin/test-patch.properties with OK_FINDBUGS_WARNINGS=0 in it?
          Hide
          Patrick Hunt added a comment -

          I'd appreciate if you could fix the findbugs, that would be great. See also ZOOKEEPER-902 – as part of the fix set the findbugs acceptable back to 0. Thanks!

          Show
          Patrick Hunt added a comment - I'd appreciate if you could fix the findbugs, that would be great. See also ZOOKEEPER-902 – as part of the fix set the findbugs acceptable back to 0. Thanks!
          Hide
          Flavio Junqueira added a comment -

          Hi Vishal, Good job, thanks! The patch is pretty much good for me. Just a few points:

          1. Findbugs complained about the fact that we are not checking if sock is null in line 674. It could be if the previous catch block is executed. I was actually thinking that there should be a single try block followed by two catch blocks, no?
          2. You may also consider fixing the other two issues Findbugs is complaining about. The statement declaring msgLength should be removed. It was probably there for debugging purposes;
          3. From the patch, it sounds like the formatting for some of the log statements got messed up. I would appreciate if you could fix those. I've seen just a couple of them.
          Show
          Flavio Junqueira added a comment - Hi Vishal, Good job, thanks! The patch is pretty much good for me. Just a few points: Findbugs complained about the fact that we are not checking if sock is null in line 674. It could be if the previous catch block is executed. I was actually thinking that there should be a single try block followed by two catch blocks, no? You may also consider fixing the other two issues Findbugs is complaining about. The statement declaring msgLength should be removed. It was probably there for debugging purposes; From the patch, it sounds like the formatting for some of the log statements got messed up. I would appreciate if you could fix those. I've seen just a couple of them.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12459485/ZOOKEEPER-900.patch2
          against trunk revision 1034003.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          -1 javadoc. The javadoc tool appears to have generated 1 warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 2 new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/26//testReport/
          Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/26//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/26//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12459485/ZOOKEEPER-900.patch2 against trunk revision 1034003. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. -1 javadoc. The javadoc tool appears to have generated 1 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 2 new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/26//testReport/ Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/26//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/26//console This message is automatically generated.
          Hide
          Vishal Kher added a comment -

          submitted patch ZOOKEEPER-900.patch1

          Show
          Vishal Kher added a comment - submitted patch ZOOKEEPER-900 .patch1
          Hide
          Patrick Hunt added a comment -

          fyi, if a patch is ready for review/commit then click the "submit patch" link – will trigger the workflow.
          Also if you use the same patch name (ZOOKEEPER-###.patch) and re-attach with the same name jira will handle this correctly, more detail here:
          http://wiki.apache.org/hadoop/ZooKeeper/HowToContribute
          thanks!

          Show
          Patrick Hunt added a comment - fyi, if a patch is ready for review/commit then click the "submit patch" link – will trigger the workflow. Also if you use the same patch name (ZOOKEEPER-###.patch) and re-attach with the same name jira will handle this correctly, more detail here: http://wiki.apache.org/hadoop/ZooKeeper/HowToContribute thanks!
          Hide
          Vishal Kher added a comment -

          Attaching the patch without formatting changes.

          Show
          Vishal Kher added a comment - Attaching the patch without formatting changes.
          Hide
          Vishal Kher added a comment -

          Hi Flavio,

          Regarding your comment:
          I was reasoning about the code. I had not tried to reproduce the problem when I posted the question.

          I tried a simple test and I am not able to reproduce the problem on Suse. I closed the connection after writing the server ID but before the receiving server issued a read. The receiver was able to read the ID and on the following read it got a socket closed exception.

          I am not sure if all TCP implementations would behave this way.

          -Vishal

          Show
          Vishal Kher added a comment - Hi Flavio, Regarding your comment: I was reasoning about the code. I had not tried to reproduce the problem when I posted the question. I tried a simple test and I am not able to reproduce the problem on Suse. I closed the connection after writing the server ID but before the receiving server issued a read. The receiver was able to read the ID and on the following read it got a socket closed exception. I am not sure if all TCP implementations would behave this way. -Vishal
          Hide
          Vishal Kher added a comment -

          Diff of log4j file was included by mistake. I will post a patch without formatting changes.

          Show
          Vishal Kher added a comment - Diff of log4j file was included by mistake. I will post a patch without formatting changes.
          Hide
          Patrick Hunt added a comment -

          Looking at the patch. Quite a bit changed, hard to tell which is important and which not. In these situations I've used the -w diff trick to get just the important changes, then applied that patch to virgin code, opened the file in eclipse and fixed the (relatively) smaller set of formatting issues.

          Also, the patch includes log4j.properties change, you don't want to include that I'm thinking.

          Show
          Patrick Hunt added a comment - Looking at the patch. Quite a bit changed, hard to tell which is important and which not. In these situations I've used the -w diff trick to get just the important changes, then applied that patch to virgin code, opened the file in eclipse and fixed the (relatively) smaller set of formatting issues. Also, the patch includes log4j.properties change, you don't want to include that I'm thinking.
          Hide
          Vishal Kher added a comment -

          ok. how about making an exception for formatting for this patch? I would have to spend some time reapplying the changes (which I would like to avoid ).

          Show
          Vishal Kher added a comment - ok. how about making an exception for formatting for this patch? I would have to spend some time reapplying the changes (which I would like to avoid ).
          Hide
          Patrick Hunt added a comment -

          please try to keep the reformatting changes to a minimum unless it's code directly being worked on. otw it makes it harder to review (svn -x -w diff does help, but still) and "blame" detail is lost.

          Show
          Patrick Hunt added a comment - please try to keep the reformatting changes to a minimum unless it's code directly being worked on. otw it makes it harder to review (svn -x -w diff does help, but still) and "blame" detail is lost.
          Hide
          Vishal Kher added a comment -

          There are two enhancements that I am working on for QuorumCnxManager.

          1. QCM uses blocking IO for communicating with other peers. It does
          not set a timeout for network read/write operations. SO_TIMEOUT does
          not work with SocketChannel.

          2. Incoming requests are processed one at a time. As a result, if QCM is
          processing a connection from a peer and that peer fails, then requests
          from other peers won't be processed. Even if we add timeout to
          read/write calls, other peers will be blocked for that amount of
          time. I had proposed a change in my earlier post for this part (see
          above). I am working on a fix.

          The attached patch addresses the first problem. Earlier, QCM used
          SocketChannels. Now it uses DataInputStream/DataOutputStream, which
          will blocki only until SO_TIMEOUT expires.

          There are also some formatting changes done automatically by my editor
          according to Java coding standards. So some of the changes are just
          cosmetic.

          I have tested this change by creating a 3 node cluster and rebooting
          leader/follower several times. The patch also includes a simple test.

          Please let me know your comments.

          Show
          Vishal Kher added a comment - There are two enhancements that I am working on for QuorumCnxManager. 1. QCM uses blocking IO for communicating with other peers. It does not set a timeout for network read/write operations. SO_TIMEOUT does not work with SocketChannel. 2. Incoming requests are processed one at a time. As a result, if QCM is processing a connection from a peer and that peer fails, then requests from other peers won't be processed. Even if we add timeout to read/write calls, other peers will be blocked for that amount of time. I had proposed a change in my earlier post for this part (see above). I am working on a fix. The attached patch addresses the first problem. Earlier, QCM used SocketChannels. Now it uses DataInputStream/DataOutputStream, which will blocki only until SO_TIMEOUT expires. There are also some formatting changes done automatically by my editor according to Java coding standards. So some of the changes are just cosmetic. I have tested this change by creating a 3 node cluster and rebooting leader/follower several times. The patch also includes a simple test. Please let me know your comments.
          Hide
          Flavio Junqueira added a comment -

          Sure, I can investigate a little further, and Vishal let us know if you find anything.

          Show
          Flavio Junqueira added a comment - Sure, I can investigate a little further, and Vishal let us know if you find anything.
          Hide
          Patrick Hunt added a comment -

          I don't know for this specific case, but the corners I've looked at (tearing down a connection) there have been issues. Perhaps they are issues on our side, I'm not certain, but I do know that we fail with this version of nc (default in ubuntu maverick) even after significant work was done to address the original problem:
          OpenBSD netcat (Debian patchlevel 1.89-3ubuntu2)

          Let's assume what you say is correct – we'd want to test this carefully to assure ourselves.

          Show
          Patrick Hunt added a comment - I don't know for this specific case, but the corners I've looked at (tearing down a connection) there have been issues. Perhaps they are issues on our side, I'm not certain, but I do know that we fail with this version of nc (default in ubuntu maverick) even after significant work was done to address the original problem: OpenBSD netcat (Debian patchlevel 1.89-3ubuntu2) Let's assume what you say is correct – we'd want to test this carefully to assure ourselves.
          Hide
          Flavio Junqueira added a comment -

          That's a pretty strong statement. You're essentially suggesting that we shouldn't rely upon TCP to implement even its basic functionality. Also, my understanding is that Vishal is just reasoning about the code and he hasn't been able to reproduce that situation. Please correct me if I'm mistaken, Vishal.

          Show
          Flavio Junqueira added a comment - That's a pretty strong statement. You're essentially suggesting that we shouldn't rely upon TCP to implement even its basic functionality. Also, my understanding is that Vishal is just reasoning about the code and he hasn't been able to reproduce that situation. Please correct me if I'm mistaken, Vishal.
          Hide
          Patrick Hunt added a comment -

          Flavio, I'd be worried that different tcp stacks might (inter)operate differently in practice vs theory.

          In general it's pretty tough to get this right - look at all the problems we've been having with netcat behavior
          https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&&query=netcat&summary=true&description=true&body=true&pid=12310801

          Ubuntu recently moved from "traditional" to the newish bsd flavor (supports ipv6 natively) of nc and we are back to having issues after having made significant changes in 3.3 to fix this (incl a number of tests that simulated the nc behavior as closely as we could understand it).

          Show
          Patrick Hunt added a comment - Flavio, I'd be worried that different tcp stacks might (inter)operate differently in practice vs theory. In general it's pretty tough to get this right - look at all the problems we've been having with netcat behavior https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&&query=netcat&summary=true&description=true&body=true&pid=12310801 Ubuntu recently moved from "traditional" to the newish bsd flavor (supports ipv6 natively) of nc and we are back to having issues after having made significant changes in 3.3 to fix this (incl a number of tests that simulated the nc behavior as closely as we could understand it).
          Hide
          Flavio Junqueira added a comment -

          Hi Vishal, This is a good question. I'm actually assuming that the behavior of TCP is such that if I send a message and then close the channel properly (calling close()), due to the reliability and order guarantees of the connection, the message will get through before the connection closes. Essentially, I'm relying upon the TCP ACK to do exactly what you're proposing. However, it might be a good idea to make sure that the assumption is correct or if you know the answer already, just let me know. Overall I do agree that having an ACK is important.

          Show
          Flavio Junqueira added a comment - Hi Vishal, This is a good question. I'm actually assuming that the behavior of TCP is such that if I send a message and then close the channel properly (calling close()), due to the reliability and order guarantees of the connection, the message will get through before the connection closes. Essentially, I'm relying upon the TCP ACK to do exactly what you're proposing. However, it might be a good idea to make sure that the assumption is correct or if you know the answer already, just let me know. Overall I do agree that having an ACK is important.
          Hide
          Vishal Kher added a comment -

          Hi Flavio,

          I have a question regarding the logic that determines which connection
          to retain if peer 1 and peer 2 decide to communicate with each other.

          Suppose peer 1 connects to peer 2. It first sends its sid as a
          challenge. Peer 2 reads the sid and determines whether to keep the
          connection or initiate a connection back to peer 1. Both determine
          that peer 2 should be the one initiating the connection to peer 1
          since sid of peer2 > sid of peer1. I am concerned that they both
          may not be able to maintain any connection since the handshake is
          one-way.

          In the current implementation, peer1 disconnects immediately after
          writing the challenge to peer 2. It can happen that peer 2 may get a
          ClosedChannelException before it reads the challenge from peer 1. As a
          result, peer 2 will not initiate a connection to peer 1.

          Is this a legitimate problem? If it is, how about we ask peer 2 to
          send back a ACK after it reads the challenge. Peer 1 will do a timed
          read() after writing a challenge to peer 2. This will hopefully give
          peer 2 enough time to read the challenge and take appropriate
          action. If peer 2 is really slow, peer 1 will timeout on the read
          operation.

          -Vishal

          Show
          Vishal Kher added a comment - Hi Flavio, I have a question regarding the logic that determines which connection to retain if peer 1 and peer 2 decide to communicate with each other. Suppose peer 1 connects to peer 2. It first sends its sid as a challenge. Peer 2 reads the sid and determines whether to keep the connection or initiate a connection back to peer 1. Both determine that peer 2 should be the one initiating the connection to peer 1 since sid of peer2 > sid of peer1. I am concerned that they both may not be able to maintain any connection since the handshake is one-way. In the current implementation, peer1 disconnects immediately after writing the challenge to peer 2. It can happen that peer 2 may get a ClosedChannelException before it reads the challenge from peer 1. As a result, peer 2 will not initiate a connection to peer 1. Is this a legitimate problem? If it is, how about we ask peer 2 to send back a ACK after it reads the challenge. Peer 1 will do a timed read() after writing a challenge to peer 2. This will hopefully give peer 2 enough time to read the challenge and take appropriate action. If peer 2 is really slow, peer 1 will timeout on the read operation. -Vishal
          Hide
          Vishal Kher added a comment -

          Hi Mahadev,

          Yes, I will provide a patch (for the approach proposed above). I am working on it. I will get in touch with Flavio if I have questions.

          -Vishal

          Show
          Vishal Kher added a comment - Hi Mahadev, Yes, I will provide a patch (for the approach proposed above). I am working on it. I will get in touch with Flavio if I have questions. -Vishal
          Hide
          Mahadev konar added a comment -

          This is definitely a good step in cleaning up QuorumCnxnManager! I have updated the jira to mark it for 3.4 release. Vishal is that ok with you? Would you be able to provide a patch for the 3.4 release?

          Show
          Mahadev konar added a comment - This is definitely a good step in cleaning up QuorumCnxnManager! I have updated the jira to mark it for 3.4 release. Vishal is that ok with you? Would you be able to provide a patch for the 3.4 release?
          Hide
          Vishal Kher added a comment -

          Hi Flavio,

          Thanks for your feedback. I will do the code changes.

          For point 2 above, I was referring to the code that deletes the SenderWorker and ReceiveWorker pair after receiving a connect request. I was concerned that a peer might send frequent connect request before to the remote peer before the remote peer can initiate connection back. But I think the Notification n = recvqueue.poll(notTimeout, TimeUnit.MILLISECONDS); in lookForLeader will prevent this scenario. Also, this won't be a concern if we decide to remove the part that kills the pair for each connect.

          I am also thinking of adding a sanity check that will accept connections only from peers that are not listed in the zoo.cfg file or OBSERVER_ID.
          I have not used observes so far. Can you please explain why a node will use OBSERVER_ID instead of its sid? In particular, I am referring to the following code in QuorumCnxManager:
          // Read server id
          sid = Long.valueOf(msgBuffer.getLong());
          if(sid == QuorumPeer.OBSERVER_ID)

          { /* * Choose identifier at random. We need a value to identify * the connection. */ sid = observerCounter--; LOG.info("Setting arbitrary identifier to observer: " + sid); }
          Show
          Vishal Kher added a comment - Hi Flavio, Thanks for your feedback. I will do the code changes. For point 2 above, I was referring to the code that deletes the SenderWorker and ReceiveWorker pair after receiving a connect request. I was concerned that a peer might send frequent connect request before to the remote peer before the remote peer can initiate connection back. But I think the Notification n = recvqueue.poll(notTimeout, TimeUnit.MILLISECONDS); in lookForLeader will prevent this scenario. Also, this won't be a concern if we decide to remove the part that kills the pair for each connect. I am also thinking of adding a sanity check that will accept connections only from peers that are not listed in the zoo.cfg file or OBSERVER_ID. I have not used observes so far. Can you please explain why a node will use OBSERVER_ID instead of its sid? In particular, I am referring to the following code in QuorumCnxManager: // Read server id sid = Long.valueOf(msgBuffer.getLong()); if(sid == QuorumPeer.OBSERVER_ID) { /* * Choose identifier at random. We need a value to identify * the connection. */ sid = observerCounter--; LOG.info("Setting arbitrary identifier to observer: " + sid); }
          Hide
          Flavio Junqueira added a comment -

          Hi Vishal, I like your proposal, it seems reasonable and not difficult to implement.

          On your questions:

          1. I don't think it is necessary to kill a pair SenderWorker/RecvWorker every time, and I'd certainly support changing it;
          2. I'm not sure where you're suggesting to introduce a delay. In the FLE code, a server sends a new batch of notifications if it changes its vote or if it times out waiting for a new notification. This timeout value increases over time. I was actually thinking that we should reset the timeout value upon receiving a notification. I think this is a bug....

          Given that it is your proposal, I'd be happy to let you take a stab at it and help you out if you need a hand. Does it make sense for you?

          Show
          Flavio Junqueira added a comment - Hi Vishal, I like your proposal, it seems reasonable and not difficult to implement. On your questions: I don't think it is necessary to kill a pair SenderWorker/RecvWorker every time, and I'd certainly support changing it; I'm not sure where you're suggesting to introduce a delay. In the FLE code, a server sends a new batch of notifications if it changes its vote or if it times out waiting for a new notification. This timeout value increases over time. I was actually thinking that we should reset the timeout value upon receiving a notification. I think this is a bug.... Given that it is your proposal, I'd be happy to let you take a stab at it and help you out if you need a hand. Does it make sense for you?
          Hide
          Vishal Kher added a comment -

          Hi Flavio,

          I have a suggestion for changing the blocking IO code in QuorumCnxManager. It keeps the current code structure and requires a small amount of changes. I am not sure if these comments should go in ZOOKEEPER-901. ZOOKEEPER-901 is probably addressing netty as well. Please feel free to close this JIRA if you intend to make all the changes as a part of ZOOKEEPER-901.

          Basically we jusy need to move parts of initiateConnection and receiveConnection to SenderWorker and ReceiveWorker.

          A. Current flow for receiving connection:
          1. accept connection in Listener.run()
          2. receiveConnection()

          • Read remote server's ID
          • Take action based on my ID and remote server's ID (disconnect and reconnect if my ID is > remote server's ID).
          • kill current set of SenderWorker and ReciveWorker threads
          • Start a new pair

          B Current flow for initiating connection:
          1. In connectOne(), connect if not already connected. else return.
          2. send my ID to the remote server
          3. if my ID < remote server disconnect and return
          4. if my ID > remote server

          • kill current set of SenderWorker and ReceiveWorkter threads for the remote server
          • Start a new pair

          Proposed changes:
          Move the code that performs any blocking IO in SenderWorker and ReceiveWorker.

          A. Proposed flow for receiving connection:
          1. accept connection in Listener.run()
          2. receiveConnection()

          • kill current set of SenderWorker and ReciveWorker threads
          • Start a new pair

          Proposed changed to SenderWorker:

          • Read remote server's ID
          • Take action based on my ID and remote server's ID (disconnect and reconnect if my ID is > remote server's ID).
          • Proceed to normal operation

          B Proposed flow for initiating connection:
          1. in connectOne(), return if already connected
          2. Start a new SenderWorker and ReceiveWorker pair
          2. In SenderWorker

          • connect to remote server
          • write my ID
          • if my ID < remote server disconnect and return (shutdown the pair).
          • Proceed to normal operation

          Questions:

          • In QuorumCnxManager, is it necessary to kill the current pair and restart a new one every time we receive a connect request?
          • In receiveConnection we may choose to reject an accepted connection if a thread in
            SenderWorker is in the process of connecting. Otherwise a server with ID <
            remote server may keep sending frequent connect request that will result in the
            remote server closing connections for this peer. But I think we add a delay
            before sending notifications, which might be good enough to prevent this
            problem.

          Let me know what you think about this. I can also help with the implementation.

          -Vishal

          Show
          Vishal Kher added a comment - Hi Flavio, I have a suggestion for changing the blocking IO code in QuorumCnxManager. It keeps the current code structure and requires a small amount of changes. I am not sure if these comments should go in ZOOKEEPER-901 . ZOOKEEPER-901 is probably addressing netty as well. Please feel free to close this JIRA if you intend to make all the changes as a part of ZOOKEEPER-901 . Basically we jusy need to move parts of initiateConnection and receiveConnection to SenderWorker and ReceiveWorker. A. Current flow for receiving connection: 1. accept connection in Listener.run() 2. receiveConnection() Read remote server's ID Take action based on my ID and remote server's ID (disconnect and reconnect if my ID is > remote server's ID). kill current set of SenderWorker and ReciveWorker threads Start a new pair B Current flow for initiating connection: 1. In connectOne(), connect if not already connected. else return. 2. send my ID to the remote server 3. if my ID < remote server disconnect and return 4. if my ID > remote server kill current set of SenderWorker and ReceiveWorkter threads for the remote server Start a new pair Proposed changes: Move the code that performs any blocking IO in SenderWorker and ReceiveWorker. A. Proposed flow for receiving connection: 1. accept connection in Listener.run() 2. receiveConnection() kill current set of SenderWorker and ReciveWorker threads Start a new pair Proposed changed to SenderWorker: Read remote server's ID Take action based on my ID and remote server's ID (disconnect and reconnect if my ID is > remote server's ID). Proceed to normal operation B Proposed flow for initiating connection: 1. in connectOne(), return if already connected 2. Start a new SenderWorker and ReceiveWorker pair 2. In SenderWorker connect to remote server write my ID if my ID < remote server disconnect and return (shutdown the pair). Proceed to normal operation Questions: In QuorumCnxManager, is it necessary to kill the current pair and restart a new one every time we receive a connect request? In receiveConnection we may choose to reject an accepted connection if a thread in SenderWorker is in the process of connecting. Otherwise a server with ID < remote server may keep sending frequent connect request that will result in the remote server closing connections for this peer. But I think we add a delay before sending notifications, which might be good enough to prevent this problem. Let me know what you think about this. I can also help with the implementation. -Vishal

            People

            • Assignee:
              Vishal Kher
              Reporter:
              Vishal Kher
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:

                Development