We were doing some systematic fault injection testing on the latest ZooKeeper stable release 3.6.2 and found an untimely network issue can cause ZooKeeper followers to hang: clients connected to this follower get stuck in their requests, while the follower would not rejoin the quorum for a long time.
Our overall experience through the fault injection testing is that ZooKeeper is robust to tolerate network issues (delay or packet loss) in various places. If a thread is doing a socket write and hangs in this operation due to the network fault, this thread will get stuck. In general, ZooKeeper can handle the issue correctly even though this thread hangs. For example, in a leader, if the `LearnerHandler` thread hangs in this way, the `QuorumPeer` (which is running the Leader#lead method) is able to confirm the stale PING state and abandon the problematic `QuorumPeer`.
However, in the latest ZooKeeper stable release 3.6.2, we found that if a network issue happens to occur while the `writePacket` method of the `Learner` class is executing, the entire follower can get stuck. In particular, the whole `QuorumPeer` thread would be blocked because it is calling the `writePacket` method. Unlike the other situations in which the fault could be tolerated by ZooKeeper, QuorumPeer itself and other threads did not detect or handle this issue. Therefore, this follower hangs in the sense that it is not able to communicate with the leader, and the leader will abandon this follower once it does not reply `PING` packets in time, although this follower still believes it's a follower.
Steps to reproduce are as follows:
- Start a cluster with 1 leader and 2 followers.
- Manually create some datanodes, and do some reads and writes.
- Inject network fault, either using a tool like `tcconfig` or the attached Byteman scripts.
- Once stuck, you may observe new requests to this follower would also get stuck.
The scripts for reproduction are provided in https://gist.github.com/functioner/ad44b5e457c8cb22eac5fc861f56d0d4.
We confirmed the issue also occurs in the latest master branch version.
The `writePacket` method can be invoked by 3 threads: FollowerRequestProcessor, SyncRequestProcessor, and QuorumPeer. In particular, we have these possible stack traces in the attached `stacktrace.md` in https://gist.github.com/functioner/ad44b5e457c8cb22eac5fc861f56d0d4.
There are two key issues. First, the network I/O is performed inside a synchronization block. Second, unlike the socket connection or read operations that are protected by timeouts in ZooKeeper, the socket (and OutputArchive with the socket OutputStream) write would not throw the timeout exception when the write is stuck (until the network issue is resolved). In this case, while the `QuorumPeer#readPacket` method contains a socket timeout, the reason that the follower (QuorumPeer) did not initiate the rejoin is because the QuorumPeer is blocked in writePacket and would not proceed to the receiving stage (`readPacket`), so no timeout exception is thrown to trigger the error handling.
When we are preparing to submit this bug report, we find that the
ZOOKEEPER-3575 has already proposed a fix to move the packet sending in learner to a separate thread. But the reason that we are still able to expose the symptom in the `master` branch is because the fix is somehow disabled by default with a configuration parameter `learner.asyncSending` that is not documented. We tried the fault injection testing on the `master` branch version with this parameter set to be true (`-Dlearner.asyncSending=true`) and found the symptom would be gone. Specifically, even though with the network issue, the packet writes to the leader would still be stuck, now the `QuorumPeer` thread would not be blocked and it can detect the issue during the phase of receiving packet from the leader thanks to the timeout in the socket read. As a result, the follower would be able to quickly go back to `LOOKING` and then `FOLLOWING` state again, while the problematic `LearnerSender` would be abandoned and recreated.
It seems that the fix in
ZOOKEEPER-3575 was not enabled by default is perhaps because it was not clear whether the issue could really occur.
We would like to first confirm this is an issue and hope the attached reproducing scripts can be helpful. Also, our testing shows this issue occurs not only in the shutdown phase as pointed out in
ZOOKEEPER-3575 but also in regular requests handling, which can be serious.
In addition, we would like to propose making the parameter `learner.asyncSending` default to be true so the fix can be enabled by default. A related improvement is to add a description of this parameter in the documentation. Otherwise, administrators would have to read the source code and to be lucky enough to stumble upon the beginning section of Learner.java to realize there is a parameter to fix the behavior.
Lastly, the async packet sending only appears in the `master` branch (and 3.7.x). There is no such fix or parameter in the latest stable release (3.6.2). We are wondering if this fix should be backported to the 3.6.x branch?
P.S. We may need to add a fault injection tool (as ZOOKEEPER-3601 suggests) so that we can provide a test for this issue.