Apache S4
  1. Apache S4
  2. S4-85

Improve handling of ZK connection changes

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.5.0
    • Fix Version/s: 0.5.0
    • Labels:
      None

      Description

      Currently we are handling ZK state changes a bit differently in two places.

      org.apache.s4.comm.topology.AssignmentFromZK:

          @Override
          public void handleStateChanged(KeeperState state) throws Exception {
              this.state = state;
              if (!state.equals(KeeperState.SyncConnected)) {
                  logger.warn("Session not connected for cluster [{}]: [{}]. Trying to reconnect", clusterName, state.name());
                  zkClient.close();
                  zkClient.connect(connectionTimeout, null);
                  handleNewSession();
              }
          }
      

      org.apache.s4.comm.topology.ClusterFromZK:

          @Override
          public void handleStateChanged(KeeperState state) throws Exception {
              this.state = state;
              if (!state.equals(KeeperState.SyncConnected)) {
                  logger.warn("Session not connected for cluster [{}]: [{}]. Trying to reconnect", clusterName, state.name());
                  zkClient.connect(connectionTimeout, null);
                  handleNewSession();
              }
          }
      

      In the first case we explicitly close the connection before trying to reconnect. Furthermore, I think we should only try to reconnect when the state is equals to KeeperState.Expired, since now we are closing the connection on a Disconnected event too.

        Activity

        Hide
        Matthieu Morel added a comment -

        Indeed we should actually never try to reconnect. According to Zookeeper's documentation, when a disconnected state is reached, the Zookeeper client automatically tries to reconnect, possibly to a different server from the Zookeeper ensemble. So we shouldn't do anything there.

        If the connection is expired, it means that Zookeeper server considered the node is lost, it removes related ephemeral nodes and notifies related watchers. In that case, reconnection logic would actually be quite complex, and it is safer to fail fast and actually kill the current node. The current partition will be picked automatically by a standby S4 node.

        Show
        Matthieu Morel added a comment - Indeed we should actually never try to reconnect. According to Zookeeper's documentation, when a disconnected state is reached, the Zookeeper client automatically tries to reconnect, possibly to a different server from the Zookeeper ensemble. So we shouldn't do anything there. If the connection is expired, it means that Zookeeper server considered the node is lost, it removes related ephemeral nodes and notifies related watchers. In that case, reconnection logic would actually be quite complex, and it is safer to fail fast and actually kill the current node. The current partition will be picked automatically by a standby S4 node.
        Hide
        Matthieu Morel added a comment -

        Patch available in branch S4-85

        Upon Zookeeper connection expiration notification, the node logs an error message and the VM shuts down.

        Show
        Matthieu Morel added a comment - Patch available in branch S4-85 Upon Zookeeper connection expiration notification, the node logs an error message and the VM shuts down.
        Hide
        Daniel Gómez Ferro added a comment -

        Patch is good and works as expected, +1

        Show
        Daniel Gómez Ferro added a comment - Patch is good and works as expected, +1
        Hide
        Matthieu Morel added a comment -

        Merged in piper commit 1f1ad90c8e94de60f780c2d7ff080e1e2346d505

        Note that I had to refactor tests slightly so that fail fast strategy would not lead to side effects during the tests (like: exit while running another test in the same VM)

        Show
        Matthieu Morel added a comment - Merged in piper commit 1f1ad90c8e94de60f780c2d7ff080e1e2346d505 Note that I had to refactor tests slightly so that fail fast strategy would not lead to side effects during the tests (like: exit while running another test in the same VM)

          People

          • Assignee:
            Matthieu Morel
            Reporter:
            Daniel Gómez Ferro
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development