ZooKeeper
  1. ZooKeeper
  2. ZOOKEEPER-84

provide a mechanism to reconnect a ZooKeeper if a client receives a SessionExpiredException

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: java client
    • Labels:
      None

      Description

      am about to attach a patch which adds a reconnect() method to easily re-establish a connection if a session expires - along with a toString() implementation for easier debugging

        Activity

        Hide
        Mahadev konar added a comment -

        james,
        I do not understand where you want ot add a reconnect? Can you explain what this jira is about?

        Show
        Mahadev konar added a comment - james, I do not understand where you want ot add a reconnect? Can you explain what this jira is about?
        Hide
        james strachan added a comment -

        about to submit a match - whoops forgot to add it

        Show
        james strachan added a comment - about to submit a match - whoops forgot to add it
        Hide
        james strachan added a comment -

        sorry I forgot to add the patch before here it is now, hopefully this will make more sense now

        Show
        james strachan added a comment - sorry I forgot to add the patch before here it is now, hopefully this will make more sense now
        Hide
        Patrick Hunt added a comment -

        Assigning to James.

        Show
        Patrick Hunt added a comment - Assigning to James.
        Hide
        Mahadev konar added a comment -

        james,
        here is what the current sematics is —

        • there is one on one mapping between a client and a session
        • if a session expires the client becomes invalid
        • the client can reconnect to the servers if they die only to renew the session. If the session is invalid this client in invalid.

        originally we allowed a client to reconnect and create a new session and foudn that our users were getting more and more confused with such kind of sematincs and found that the above sematics made things easier for them. So I think we should not reconnect and try creating a new session with the same client. does that make sense?

        Show
        Mahadev konar added a comment - james, here is what the current sematics is — there is one on one mapping between a client and a session if a session expires the client becomes invalid the client can reconnect to the servers if they die only to renew the session. If the session is invalid this client in invalid. originally we allowed a client to reconnect and create a new session and foudn that our users were getting more and more confused with such kind of sematincs and found that the above sematics made things easier for them. So I think we should not reconnect and try creating a new session with the same client. does that make sense?
        Hide
        james strachan added a comment -

        I'm kinda confused by that

        So what am I meant to do if I get a SessionExpiredException? I want to reconnect with a new session; as the current connection is totally useless and invalid.

        Right now the only option I can see is to recreate from scratch the entire ZooKeeper object?

        If so that means we've gotta introduce a ZooKeeperProxy thingy which wraps the ZooKeeper object and just lets it be recreated if a SessionExpiredException occurs. Seems kinda unnecessary when really a ZooKeeper instance is capable of easily recreating the connection to the ZK server when the session is expired.

        Maybe the issue is the phrase reconnect() - maybe a method called recreateNewSession() is better? We could also document it that this is only to be called if your client becomes invalid because the ZK server has expired the session?

        Show
        james strachan added a comment - I'm kinda confused by that So what am I meant to do if I get a SessionExpiredException? I want to reconnect with a new session; as the current connection is totally useless and invalid. Right now the only option I can see is to recreate from scratch the entire ZooKeeper object? If so that means we've gotta introduce a ZooKeeperProxy thingy which wraps the ZooKeeper object and just lets it be recreated if a SessionExpiredException occurs. Seems kinda unnecessary when really a ZooKeeper instance is capable of easily recreating the connection to the ZK server when the session is expired. Maybe the issue is the phrase reconnect() - maybe a method called recreateNewSession() is better? We could also document it that this is only to be called if your client becomes invalid because the ZK server has expired the session?
        Hide
        james strachan added a comment -

        the ZooKeeper will reconnect by default if the socket fails right? So I'm only really talking about reconectingWithNewSession() in those rare cases that the ZK server times out the session

        Show
        james strachan added a comment - the ZooKeeper will reconnect by default if the socket fails right? So I'm only really talking about reconectingWithNewSession() in those rare cases that the ZK server times out the session
        Hide
        Benjamin Reed added a comment -

        The problem with the reconnect method is that it is easy to use but very hard to use correctly. In general, we try to avoid giving programmers methods that make it easy to hang themselves. We provide plenty of rope, and they are free to make a noose; we just don't provide a tieTheKnot method.

        The basic problem has to do with clean up that happens when a session dies. Obviously ephemeral nodes will go away. Our watch reestablishment patches could be updated to make watches reestablish on reconnect, but that would be very confusing to have ephemerals disappear but watches get reestablished.

        To ZooKeeper an expired session is a dead client, so by making the ZooKeeper service handle die as well we have clear semantics.

        As an historical note, session expiration wasn't always a fatal thing. We used to delivered the session expiration event and then get a new session. The earliest problem our users ran into was with multi-threaded applications. If there are multiple threads going, the watcher object will get the session expired event; the other threads may still be trying to use the ZooKeeper object. If you use the same ZooKeeper object to reconnect with, you can easily end up with a thread that uses the object thinking that it is running in the context of an earlier session and really mess things up. The explicit reconnect alleviates this slightly, but you would still need to do synchronization around the ZooKeeper object and it would be very difficult to get working when you are using different libraries using the same ZooKeeper object.

        You are working on a lock class; you can imagine someone else does a barrier class. Who is in charge of doing the reconnect() and deciding when it is safe?

        Testing applications that use this would also be a pain given the number of race conditions involved.

        So the safest thing to do is have the lock object become invalid when it's ZooKeeper object becomes invalid. An expired session is a really bad thing, we can't hide it nicely.

        Show
        Benjamin Reed added a comment - The problem with the reconnect method is that it is easy to use but very hard to use correctly. In general, we try to avoid giving programmers methods that make it easy to hang themselves. We provide plenty of rope, and they are free to make a noose; we just don't provide a tieTheKnot method. The basic problem has to do with clean up that happens when a session dies. Obviously ephemeral nodes will go away. Our watch reestablishment patches could be updated to make watches reestablish on reconnect, but that would be very confusing to have ephemerals disappear but watches get reestablished. To ZooKeeper an expired session is a dead client, so by making the ZooKeeper service handle die as well we have clear semantics. As an historical note, session expiration wasn't always a fatal thing. We used to delivered the session expiration event and then get a new session. The earliest problem our users ran into was with multi-threaded applications. If there are multiple threads going, the watcher object will get the session expired event; the other threads may still be trying to use the ZooKeeper object. If you use the same ZooKeeper object to reconnect with, you can easily end up with a thread that uses the object thinking that it is running in the context of an earlier session and really mess things up. The explicit reconnect alleviates this slightly, but you would still need to do synchronization around the ZooKeeper object and it would be very difficult to get working when you are using different libraries using the same ZooKeeper object. You are working on a lock class; you can imagine someone else does a barrier class. Who is in charge of doing the reconnect() and deciding when it is safe? Testing applications that use this would also be a pain given the number of race conditions involved. So the safest thing to do is have the lock object become invalid when it's ZooKeeper object becomes invalid. An expired session is a really bad thing, we can't hide it nicely.
        Hide
        james strachan added a comment -

        I hear you

        So an Elect Leader or Write Lock protocol has to deal with expired sessions and create new sessions; at some point someone has to recreate something. You can pass the buck and say we're not gonna allow the ZooKeeper to reconnect. Then say we're not allowed to have the WriteLock reconnect, then the next and next layer of the onion. But eventually there's gonna be something somewhere that recreates a session

        For now I'll work on the assumption we're gonna have to have an object which is a wrapper around a ZooKeeper so that it can handle reconnections by just discarding one ZooKeeper instance and creating another. This object could be shared across Protocols (we might wanna reuse one connection with ZK to make multiple locks for example).

        Show
        james strachan added a comment - I hear you So an Elect Leader or Write Lock protocol has to deal with expired sessions and create new sessions; at some point someone has to recreate something. You can pass the buck and say we're not gonna allow the ZooKeeper to reconnect. Then say we're not allowed to have the WriteLock reconnect, then the next and next layer of the onion. But eventually there's gonna be something somewhere that recreates a session For now I'll work on the assumption we're gonna have to have an object which is a wrapper around a ZooKeeper so that it can handle reconnections by just discarding one ZooKeeper instance and creating another. This object could be shared across Protocols (we might wanna reuse one connection with ZK to make multiple locks for example).
        Hide
        james strachan added a comment -

        You can mark this issue as RESOLVED/WILL_NOT_FIX if you like now - I've implemented a ZooKeeperFacade to wrap up the reconnectWithNewSession() logic for ZOOKEEPER-78

        Show
        james strachan added a comment - You can mark this issue as RESOLVED/WILL_NOT_FIX if you like now - I've implemented a ZooKeeperFacade to wrap up the reconnectWithNewSession() logic for ZOOKEEPER-78
        Hide
        Patrick Hunt added a comment -

        This has been a really interesting discussion, appreciate everyone digging in. Before closing this it would be great if we could capture this somewhere. James isn't going to be the only one encountering these issues.

        Ben I'll assign to you, can you document this on the ASF wiki, then close? Thanks.

        Show
        Patrick Hunt added a comment - This has been a really interesting discussion, appreciate everyone digging in. Before closing this it would be great if we could capture this somewhere. James isn't going to be the only one encountering these issues. Ben I'll assign to you, can you document this on the ASF wiki, then close? Thanks.
        Hide
        Patrick Hunt added a comment -

        Assigning to Ben to document on the wiki

        Show
        Patrick Hunt added a comment - Assigning to Ben to document on the wiki
        Hide
        james strachan added a comment -

        If ZOOKEEPER-78 ever gets committed (hint, hint we can just refer folks to the ZooKeeperFacade if ever folks hit the SessionExpiredException

        Show
        james strachan added a comment - If ZOOKEEPER-78 ever gets committed (hint, hint we can just refer folks to the ZooKeeperFacade if ever folks hit the SessionExpiredException
        Hide
        james strachan added a comment -

        BTW here is the code for ZooKeeperFacade as I've checked in the patch for ZOOKEEPER-78 into a temporary sandbox area, details here

        Show
        james strachan added a comment - BTW here is the code for ZooKeeperFacade as I've checked in the patch for ZOOKEEPER-78 into a temporary sandbox area, details here
        Hide
        Patrick Hunt added a comment -

        If I read the comments correctly the attached patch seems to be moot. Also the last comment references a SVN repository, please attach the patch as a "svn diff" and resubmit for review.

        Show
        Patrick Hunt added a comment - If I read the comments correctly the attached patch seems to be moot. Also the last comment references a SVN repository, please attach the patch as a "svn diff" and resubmit for review.
        Hide
        Mahadev konar added a comment -

        marking this as wont fix – since we agreed that the one to one correspondence between a zookeeper client and session keeps things simple for the users.

        Show
        Mahadev konar added a comment - marking this as wont fix – since we agreed that the one to one correspondence between a zookeeper client and session keeps things simple for the users.
        Hide
        Benjamin Reed added a comment -

        I've documented why Facade is a bad idea on the wiki: http://wiki.apache.org/hadoop/ZooKeeper/ErrorHandling

        Show
        Benjamin Reed added a comment - I've documented why Facade is a bad idea on the wiki: http://wiki.apache.org/hadoop/ZooKeeper/ErrorHandling

          People

          • Assignee:
            james strachan
            Reporter:
            james strachan
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development