ZooKeeper
  1. ZooKeeper
  2. ZOOKEEPER-1310

C Api should use state CONNECTION_LOSS

    Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: c client
    • Labels:
      None
    • Environment:

      Linux

      Description

      I would like to ZooKeeper let know my watcher (which I'm giving to zookeeeper_init) about CONNECTION_LOSS, right the given watcher doesn't know that connection is lost due to what I can't do my stuff.

      What you think? If so I could try to create a patch.

        Activity

        Jakub Lekstan created issue -
        Jakub Lekstan made changes -
        Field Original Value New Value
        Component/s c client [ 12312380 ]
        Jakub Lekstan made changes -
        Environment Linux
        Hide
        Marshall McMullen added a comment -

        Your watcher should get notified that you've transitioned from CONNECTED to CONNECTING or ASSOCIATING. So you could use that to infer that you're in equivalent of CONNECTION_LOSS.

        Show
        Marshall McMullen added a comment - Your watcher should get notified that you've transitioned from CONNECTED to CONNECTING or ASSOCIATING. So you could use that to infer that you're in equivalent of CONNECTION_LOSS.
        Hide
        Jakub Lekstan added a comment -

        I'm not getting CONNECTING state too.

        Show
        Jakub Lekstan added a comment - I'm not getting CONNECTING state too.
        Hide
        Marshall McMullen added a comment -

        Maybe I've got the semantics wrong, but I had thought it's not the watch that gets called, but the session event handler. e.g. when you call zookeeper_init you give it a function and context to be called for session state changes. I believe that should get called with CONNECTING state when the client disconnects to reconnect to a new server.

        The other part of this is the C API does some implicit session expiration stuff for you under the hood. When half the session timeout goes by without getting a response from the server it's connected to, it prematurely disconnects from that server and tries to connect to a new one.

        On the other hand, I do recall reading somewhere that it's a known limitation of ZK that you don't get notified of a session loss until the client gets reconnected to the ensemble.... Maybe others can chime in with clarification and give us more specifics.

        Show
        Marshall McMullen added a comment - Maybe I've got the semantics wrong, but I had thought it's not the watch that gets called, but the session event handler. e.g. when you call zookeeper_init you give it a function and context to be called for session state changes. I believe that should get called with CONNECTING state when the client disconnects to reconnect to a new server. The other part of this is the C API does some implicit session expiration stuff for you under the hood. When half the session timeout goes by without getting a response from the server it's connected to, it prematurely disconnects from that server and tries to connect to a new one. On the other hand, I do recall reading somewhere that it's a known limitation of ZK that you don't get notified of a session loss until the client gets reconnected to the ensemble.... Maybe others can chime in with clarification and give us more specifics.
        Hide
        Ted Dunning added a comment -

        On the other hand, I do recall reading somewhere that it's a known limitation of ZK that you don't get notified of a session loss until the client gets reconnected to the ensemble.... Maybe others can chime in with clarification and give us more specifics.

        It isn't so much a limitation of ZK as a limitation of physics. ZK can tell you when the connection is lost, but until the connection is re-established, it can't tell you that your session was expired by the server because only the server really knows that. Even running a local timer is not a reliable indicator because the quorum may be down and not registering the passage of time. In fact, you can't even know how much time has passed very reliably due to effects like clock stretching in VM's or the ever-present risk of somebody setting the clock.

        Show
        Ted Dunning added a comment - On the other hand, I do recall reading somewhere that it's a known limitation of ZK that you don't get notified of a session loss until the client gets reconnected to the ensemble.... Maybe others can chime in with clarification and give us more specifics. It isn't so much a limitation of ZK as a limitation of physics. ZK can tell you when the connection is lost, but until the connection is re-established, it can't tell you that your session was expired by the server because only the server really knows that. Even running a local timer is not a reliable indicator because the quorum may be down and not registering the passage of time. In fact, you can't even know how much time has passed very reliably due to effects like clock stretching in VM's or the ever-present risk of somebody setting the clock.
        Hide
        Mark Gius added a comment -

        I don't think the bug reporter is asking about Zookeeper Session level information. More information about the underlying socket connection to zookeeper.

        I'm running into similar problems were the C api gets disconnected from an endpoint that isn't coming back. I get a single CONNECTING event and then the C bindings spin more or less forever trying to reconnect to the now defunct endpoint. In order to make informed decisions about what to do the client needs to inform calling code (probably via the session event mechanism) that a connection has failed X number of times, or periodically inform the caller that the connection is still down.

        Once this information is exposed to the caller they can act on the information. For example, if I can detect that my endpoint is no longer present before Zookeeper decides to expire my session, I may be able to salvage my ephemeral nodes if I can re-establish a connection to the cluster fast enough.

        Show
        Mark Gius added a comment - I don't think the bug reporter is asking about Zookeeper Session level information. More information about the underlying socket connection to zookeeper. I'm running into similar problems were the C api gets disconnected from an endpoint that isn't coming back. I get a single CONNECTING event and then the C bindings spin more or less forever trying to reconnect to the now defunct endpoint. In order to make informed decisions about what to do the client needs to inform calling code (probably via the session event mechanism) that a connection has failed X number of times, or periodically inform the caller that the connection is still down. Once this information is exposed to the caller they can act on the information. For example, if I can detect that my endpoint is no longer present before Zookeeper decides to expire my session, I may be able to salvage my ephemeral nodes if I can re-establish a connection to the cluster fast enough.
        Hide
        Michi Mutsuzaki added a comment -

        Hi Mark,

        ZooKeeper C client is supposed handle reconnect. When the client goes into CONNECTING_STATE, it's supposed to contact the next server in the server list provided in zookeeper_init(). It's a bug if it keeps hitting a server that's down. Please open a separate jira.

        I'm not sure if this bug is valid. As Marshall pointed out, you can already tell when you lose a connection by checking for CONNECTING state. We should definitely document the timeout/disconnect/reconnect behavior in more details in zookeeper.h.

        --Michi

        Show
        Michi Mutsuzaki added a comment - Hi Mark, ZooKeeper C client is supposed handle reconnect. When the client goes into CONNECTING_STATE, it's supposed to contact the next server in the server list provided in zookeeper_init(). It's a bug if it keeps hitting a server that's down. Please open a separate jira. I'm not sure if this bug is valid. As Marshall pointed out, you can already tell when you lose a connection by checking for CONNECTING state. We should definitely document the timeout/disconnect/reconnect behavior in more details in zookeeper.h. --Michi
        Hide
        Mark Gius added a comment -

        I would be happy to open up another ticket for the issues I've been having if this bug is invalid or unrelated to them.

        The Zookeeper C client does in fact try another server in the list when the actively connected socket closes. However, the Client gives NO indication to the caller that ALL of the endpoints provided in zookeeper_init are "not responding."

        If you have a set of zookeeper endpoints that changes on a somewhat frequent basis it is entirely possible that a very old instance of a client is sitting out there with a list of servers that only include one valid endpoint. If that endpoint goes down, the Client correctly generates a CONNECTING event. Having generated this event, the client tries each server in the list, waits a little while, and repeats. It does this without informing the caller that this loop is occurring. As a caller, I cannot make an informed decision about what is actually going on right now.

        Show
        Mark Gius added a comment - I would be happy to open up another ticket for the issues I've been having if this bug is invalid or unrelated to them. The Zookeeper C client does in fact try another server in the list when the actively connected socket closes. However, the Client gives NO indication to the caller that ALL of the endpoints provided in zookeeper_init are "not responding." If you have a set of zookeeper endpoints that changes on a somewhat frequent basis it is entirely possible that a very old instance of a client is sitting out there with a list of servers that only include one valid endpoint. If that endpoint goes down, the Client correctly generates a CONNECTING event. Having generated this event, the client tries each server in the list, waits a little while, and repeats. It does this without informing the caller that this loop is occurring. As a caller, I cannot make an informed decision about what is actually going on right now.
        Hide
        Michi Mutsuzaki added a comment -

        Hi Mark,

        It sounds like what you are describing is related to dynamic reconfiguration (ZOOKEEPER-107). There is a sub-task to add support for addServer/removeServer in the client API. Do you think you can use this in your scenario?

        https://issues.apache.org/jira/browse/ZOOKEEPER-107
        https://issues.apache.org/jira/browse/ZOOKEEPER-762

        Thanks!
        --Michi

        Show
        Michi Mutsuzaki added a comment - Hi Mark, It sounds like what you are describing is related to dynamic reconfiguration ( ZOOKEEPER-107 ). There is a sub-task to add support for addServer/removeServer in the client API. Do you think you can use this in your scenario? https://issues.apache.org/jira/browse/ZOOKEEPER-107 https://issues.apache.org/jira/browse/ZOOKEEPER-762 Thanks! --Michi
        Hide
        Michi Mutsuzaki added a comment -

        Hi Jakub,

        It's hard to tell why you are not getting a callback for CONNECTING state without looking at the code. Could you post a code snippet here? Debug log would be useful as well.

        Thanks!
        --Michi

        Show
        Michi Mutsuzaki added a comment - Hi Jakub, It's hard to tell why you are not getting a callback for CONNECTING state without looking at the code. Could you post a code snippet here? Debug log would be useful as well. Thanks! --Michi

          People

          • Assignee:
            Unassigned
            Reporter:
            Jakub Lekstan
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:

              Development