Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Duplicate
-
None
-
None
Description
We have a ZooKeeper 3.7.0 ensemble with servers 1-3. Using zkCli.sh we saw the following znode on server 1:
stat /vespa/host-status-service/hosted-vespa:zone-config-servers/lock2/_c_be5a2484-da16-45a6-9a98-4657924040e3-lock-0000024982 cZxid = 0xa240000381f ctime = Tue May 10 17:17:27 UTC 2022 mZxid = 0xa240000381f mtime = Tue May 10 17:17:27 UTC 2022 pZxid = 0xa240000381f cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x20006d5d6a10000 dataLength = 14 numChildren = 0
This znode was absent on server 2 and 3. A delete on the node failed everywhere.
delete /vespa/host-status-service/hosted-vespa:zone-config-servers/lock2/_c_be5a2484-da16-45a6-9a98-4657924040e3-lock-0000024982 Node does not exist: /vespa/host-status-service/hosted-vespa:zone-config-servers/lock2/_c_be5a2484-da16-45a6-9a98-4657924040e3-lock-0000024982
This makes sense as a mutable operation goes to the leader, which was server 3 and where the node is absent. Restarting server 1 did not fix the issue. Stopping server 1, removing the zookeeper database and version-2 directory, and starting the server fixed the issue.
Session 0x20006d5d6a10000 was created by a ZooKeeper client and initially connected to server 2. The client later closed the session at around the same time as the then-leader server 2 was stopped:
2022-05-10 17:17:28.141 I - cfg2 Submitting global closeSession request for session 0x20006d5d6a10000 (ZooKeeperServer)
2022-05-10 17:17:28.145 I - cfg2 Session: 0x20006d5d6a10000 closed (ZooKeeper)
That both were closed/shutdown at the same time is something we do on our side. Perhaps there is a race in the handling of closing of sessions and the shutdown of the leader?
We did a dump of server 1-3, and server 1 had two sessions that did not exist in server 2 and 3, and there was one ephemeral node reported on 1 that was not reported on server 2 and 3. The ephemeral owner matched one of the two extraneous sessions. The dump from server 1 with the extra entries:
Global Sessions(8): ... 0x20006d5d6a10000 120000ms 0x2004360ecca0000 120000ms ... Sessions with Ephemerals (4): 0x20006d5d6a10000: /vespa/host-status-service/hosted-vespa:zone-config-servers/lock2/_c_be5a2484-da16-45a6-9a98-4657924040e3-lock-0000024982
Attachments
Issue Links
- is superceded by
-
ZOOKEEPER-4785 Txn loss due to race condition in Learner.syncWithLeader() during DIFF sync
- Resolved
-
ZOOKEEPER-4712 Follower.shutdown() and Observer.shutdown() do not correctly shutdown the syncProcessor, which may lead to data inconsistency
- Closed
-
ZOOKEEPER-4394 Learner.syncWithLeader got NullPointerException
- Closed
- links to