Uploaded image for project: 'ZooKeeper'
  1. ZooKeeper
  2. ZOOKEEPER-4541

Ephemeral znode owned by closed session visible in 1 of 3 servers

    XMLWordPrintableJSON

Details

    Description

      We have a ZooKeeper 3.7.0 ensemble with servers 1-3. Using zkCli.sh we saw the following znode on server 1:

      stat /vespa/host-status-service/hosted-vespa:zone-config-servers/lock2/_c_be5a2484-da16-45a6-9a98-4657924040e3-lock-0000024982
      cZxid = 0xa240000381f
      ctime = Tue May 10 17:17:27 UTC 2022
      mZxid = 0xa240000381f
      mtime = Tue May 10 17:17:27 UTC 2022
      pZxid = 0xa240000381f
      cversion = 0
      dataVersion = 0
      aclVersion = 0
      ephemeralOwner = 0x20006d5d6a10000
      dataLength = 14
      numChildren = 0
      

      This znode was absent on server 2 and 3. A delete on the node failed everywhere.

      delete /vespa/host-status-service/hosted-vespa:zone-config-servers/lock2/_c_be5a2484-da16-45a6-9a98-4657924040e3-lock-0000024982
      Node does not exist: /vespa/host-status-service/hosted-vespa:zone-config-servers/lock2/_c_be5a2484-da16-45a6-9a98-4657924040e3-lock-0000024982
      

      This makes sense as a mutable operation goes to the leader, which was server 3 and where the node is absent. Restarting server 1 did not fix the issue. Stopping server 1, removing the zookeeper database and version-2 directory, and starting the server fixed the issue.

      Session 0x20006d5d6a10000 was created by a ZooKeeper client and initially connected to server 2. The client later closed the session at around the same time as the then-leader server 2 was stopped:

      2022-05-10 17:17:28.141 I - cfg2 Submitting global closeSession request for session 0x20006d5d6a10000 (ZooKeeperServer)
      2022-05-10 17:17:28.145 I - cfg2 Session: 0x20006d5d6a10000 closed (ZooKeeper)
      

      That both were closed/shutdown at the same time is something we do on our side. Perhaps there is a race in the handling of closing of sessions and the shutdown of the leader?

      We did a dump of server 1-3, and server 1 had two sessions that did not exist in server 2 and 3, and there was one ephemeral node reported on 1 that was not reported on server 2 and 3. The ephemeral owner matched one of the two extraneous sessions.  The dump from server 1 with the extra entries:

      Global Sessions(8):
      ...
      0x20006d5d6a10000 120000ms
      0x2004360ecca0000 120000ms
      ...
      Sessions with Ephemerals (4):
      0x20006d5d6a10000:
      /vespa/host-status-service/hosted-vespa:zone-config-servers/lock2/_c_be5a2484-da16-45a6-9a98-4657924040e3-lock-0000024982
      

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              hakonhallingstad Håkon Hallingstad
              Votes:
              1 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 9h 20m
                  9h 20m