Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
2.2.0
-
None
Description
The monitor to monitor socket communication does not have reconnect logic to handle a network reset or transient network errors.
Analysis:
• During a ~20 second network reset window, no errors are detected by open sockets
o Open sockets are dead, but there is no indication from the TCP/IP stack that socket is in an error condition
• Once the network is restored, a CONNECTIONLOSS is reported by the Zookeeper Client Library.
o However, reconnect logic reestablishes connection with quorum.
• At EPOLL expiration time, EPOLL logic report “Not heard from peer=n” and treats peer as Node Down.
o The node down logic deletes corresponding znode, CZClient::WatchNodeDelete()
o All monitor processes continually check for expired znodes for each node in the cluster, including their own znode
An expired znode is handled as a down node
Attachments
Issue Links
- links to