ZooKeeper
  1. ZooKeeper
  2. ZOOKEEPER-22

Automatic request retries on connect failover

    Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 3.5.0
    • Component/s: c client, java client
    • Labels:
      None

      Description

      Moved from SourceForge to Apache.
      http://sourceforge.net/tracker/index.php?func=detail&aid=1831412&group_id=209147&atid=1008547

      When a connection to a ZooKeeper server fails, all of the pending requests
      will return an error. In reality the requests should be resubmitted when
      the client reestablishes a connection to ZooKeeper.

      For read requests, it's no big deal to just reissue the request. For update
      requests, the ZooKeeper must be able to detect if the request has been
      processed and, if so, return the result of the previous execution;
      otherwise, it should process the request.

      1. zookeeper-22.pdf
        33 kB
        Mahadev konar
      2. zookeeper-22.docx
        150 kB
        Mahadev konar

        Issue Links

          Activity

          Hide
          james strachan added a comment -

          BTW this discussion came up recently on the dev lists too...

          http://mail-archives.apache.org/mod_mbox/hadoop-zookeeper-dev/200807.mbox/%3cec6e67fd0807180945vd72ac6axcfd0851789fb6e5c@mail.gmail.com%3e

          To be able to retry operations on conection close (or due to session expiration) there is a patch in https://issues.apache.org/jira/browse/ZOOKEEPER-78

          which adds a ZooKeeperFacade for dealing with reconnecting on session expiration and some helper methods in ProtocolSupport for retrying synchronous operations or blocks of code in light of connection failures

          Show
          james strachan added a comment - BTW this discussion came up recently on the dev lists too... http://mail-archives.apache.org/mod_mbox/hadoop-zookeeper-dev/200807.mbox/%3cec6e67fd0807180945vd72ac6axcfd0851789fb6e5c@mail.gmail.com%3e To be able to retry operations on conection close (or due to session expiration) there is a patch in https://issues.apache.org/jira/browse/ZOOKEEPER-78 which adds a ZooKeeperFacade for dealing with reconnecting on session expiration and some helper methods in ProtocolSupport for retrying synchronous operations or blocks of code in light of connection failures
          Hide
          james strachan added a comment -

          BTW you can see the code for ProtocolSupport and ZooKeeperFacade as I've checked in the patch for ZOOKEEPER-78 into a temporary sandbox area, details here

          Show
          james strachan added a comment - BTW you can see the code for ProtocolSupport and ZooKeeperFacade as I've checked in the patch for ZOOKEEPER-78 into a temporary sandbox area, details here
          Hide
          Benjamin Reed added a comment -

          it turns out that all the information to do this is split between server and client. the server pushes all updates through the atomic broadcast, even errors. so if the client resends the pending requests to the server when it reconnects, the server should be able to either replay the responses or execute the request. this would eliminate the annoying-to-deal-with CONNECTIONLOSS error.

          Show
          Benjamin Reed added a comment - it turns out that all the information to do this is split between server and client. the server pushes all updates through the atomic broadcast, even errors. so if the client resends the pending requests to the server when it reconnects, the server should be able to either replay the responses or execute the request. this would eliminate the annoying-to-deal-with CONNECTIONLOSS error.
          Hide
          Ted Dunning added a comment -


          Is there progress on this issue?

          Show
          Ted Dunning added a comment - Is there progress on this issue?
          Hide
          Mahadev konar added a comment -

          ted, due to some laziness from my side, I havent made much progress on this. I expect to make good progress next week and hope to post a patch within a week or two.

          Show
          Mahadev konar added a comment - ted, due to some laziness from my side, I havent made much progress on this. I expect to make good progress next week and hope to post a patch within a week or two.
          Hide
          Ted Dunning added a comment -

          I wouldn't call it laziness. At most distraction.

          But a lot of ZK users will breathe a sigh of relief when this fix gets deployed!

          Thanks for your efforts on this.

          Show
          Ted Dunning added a comment - I wouldn't call it laziness. At most distraction. But a lot of ZK users will breathe a sigh of relief when this fix gets deployed! Thanks for your efforts on this.
          Hide
          Patrick Hunt added a comment -

          Mahadev is working on it, but been sidlined by the 3.1 and 3.2 fix releases. I believe patches should be landing soon, this is still planned for 3.3.0.

          Show
          Patrick Hunt added a comment - Mahadev is working on it, but been sidlined by the 3.1 and 3.2 fix releases. I believe patches should be landing soon, this is still planned for 3.3.0.
          Hide
          Mahadev konar added a comment -

          sorry folks, I had been working on this jira for sometime and had gotten side tracked by other issues. I will upload a proposal for this jira and whats expected for the users in a while. Please feel free to take a look and comment. I have been making some progress on this and will try to get it in soon.

          thanks

          Show
          Mahadev konar added a comment - sorry folks, I had been working on this jira for sometime and had gotten side tracked by other issues. I will upload a proposal for this jira and whats expected for the users in a while. Please feel free to take a look and comment. I have been making some progress on this and will try to get it in soon. thanks
          Hide
          Mahadev konar added a comment -

          here is design document for zookeeper-22. I realised that the scope of zookeeper-22 is much bigger than i had anticipated. This invloves extensive changes to the leader, connect requests from the client, clean up scripts.

          There needs to version checking introduced with this patch so that old clients work with the new servers and old servers work with the new client, to make it all backwards compatible.

          Given the scope of this jira, i will be creating sub jiras and mark it as part of this jira since a humongous patch would be hard to get in given that it would touch all critical parts of zookeeper.

          Show
          Mahadev konar added a comment - here is design document for zookeeper-22. I realised that the scope of zookeeper-22 is much bigger than i had anticipated. This invloves extensive changes to the leader, connect requests from the client, clean up scripts. There needs to version checking introduced with this patch so that old clients work with the new servers and old servers work with the new client, to make it all backwards compatible. Given the scope of this jira, i will be creating sub jiras and mark it as part of this jira since a humongous patch would be hard to get in given that it would touch all critical parts of zookeeper.
          Hide
          Henry Robinson added a comment -

          Mahadev -

          Exciting! Any chance you could post a pdf or html version? - .docx are hard to read on various systems.

          Henry

          Show
          Henry Robinson added a comment - Mahadev - Exciting! Any chance you could post a pdf or html version? - .docx are hard to read on various systems. Henry
          Hide
          Mahadev konar added a comment -

          i am attaching a pdf version of the proposal. comments are welcome.

          Show
          Mahadev konar added a comment - i am attaching a pdf version of the proposal. comments are welcome.
          Hide
          qing yan added a comment -

          One suggestion, can we make auto retry an option rather than mandatory?
          My concern is what if client wants to abort the operation after receiving CONNECTION LOSS event:
          He needs to either kill the thread or issue an explict undo operation afterwards, kind of awkward...

          Show
          qing yan added a comment - One suggestion, can we make auto retry an option rather than mandatory? My concern is what if client wants to abort the operation after receiving CONNECTION LOSS event: He needs to either kill the thread or issue an explict undo operation afterwards, kind of awkward...
          Hide
          qing yan added a comment -

          CONNECTION LOSS event refers to the situation where connection to ZK cluster(a quorum of ZK nodes) is lost and application needs to enter the
          safe mode, while broken connection to a particular node and failover to another node can be handled transparently.

          Show
          qing yan added a comment - CONNECTION LOSS event refers to the situation where connection to ZK cluster(a quorum of ZK nodes) is lost and application needs to enter the safe mode, while broken connection to a particular node and failover to another node can be handled transparently.

            People

            • Assignee:
              Mahadev konar
              Reporter:
              Patrick Hunt
            • Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

              • Created:
                Updated:

                Development