We ran into a situation where the cluster ended up with split brain when the log disk filled up on a node.
The ZK cluster(3 node) in question was being used as the metadata store for Pulsar. There was an outage in the Pulsar cluster, where two the ZK nodes filled up there log disks, causing the cluster to lose quorum. Once we rectified the full disk situation and restarted the nodes everything seemed to work, but we started getting a lot of log messages about UpdateMetadataLoop retrying. UpdateMetadataLoop is used to update bookkeeper ledger metadata. If it sees a write conflict it rereads the znode, checks whether the update needs to happen, applies it and writes. These retries were flooding the log on a subset of the brokers. It turned out that it was reading a znode with version 0, but when it tried the setData with version set to 0 it was failing because the znode had a version of 2 (there were many instances of this). After investigating this, we saw that the znode had a different stat and value on ZK-1 to that on ZK-0 & ZK-2.
We resolved the situation by deleting the log and snapshots from ZK-1 and restarting, at which point everything went back to normal. Had ZK-1 managed to become leader we would have been in a lot of trouble, but thankfully this didn't happen.
For the sequence of events that led to split brain, I'll refer to the following code.
The sequence of events that lead to the failure are:
|2020-01-04 01:56:56Z||ZK-2 fails to write to its transaction log due to disk full. ZK-2 is still participating in leader election. ZK-2 becomes a follower of ZK-1. ZK-1 sends TRUNC to ZK-2. truncLog.truncate on ZK-2 throws an exception because of the disk being full, and leaves the process in a broken state.|
|2020-01-04 02:35:23Z||ZK-2 removes 9 transaction logs from disk (bringing it from 100% to 19%). It doesn't recover because its in a broken state.|
|2020-01-09 08:57:33Z||ZK-1 fails to write to its transaction log due to disk full. Restarts as follower. Goes into loop of dropping from quorum (because it can't update transaction log)|
|2020-01-09 08:59:33Z||ZK-1 receives snapshot from leader (ZK-0) (at 1e00000000). ZK-1 persists snapshot, but fails to add subsequent transations to log due to lack of space. ZK-1 drops from quorum.|
|2020-01-09 09:00:12Z||ZK-1 joins quorum as follower. 1e00000000 is close enough to leader to receive TRUNC(1d0000001d). TRUNC fails because txnLog can't flush on close() in trunateLog. ZK-1 goes into loop, dropping and joining quorum.|
|2020-01-09 09:39:00Z||ZK-1 runs purgeTxnLog. Process doesn't recover due to truncation exception having broken FileTxnSnapLog.|
|2020-01-09 19:28:37Z||ZK-1 is restarted. ZK-1 joins quorum as follower. ZK-1 receives TRUNC(1d0000001d). In this case, txnLog.close() can succeed because there's nothing to flush. snapLog is closed. truncLog.truncate fails with "java.io.IOException: No log files found to truncate! This could happen if you still have snapshots from an old setup or log files were deleted accidentally or dataLogDir was changed in zoo.cfg.". It's true that there are no log files to truncate because the snapshot is at 1e00000000 which was received from the leader at 08:59 and nothing has been logged since. In any case, FileTxnSnapLog is in another inconsistent state. snapLog is closed. txnLog is closed, but nothing was ever written to it, so it looks like brand new.|
|2020-01-09 19:29:04Z||ZK-2 is restarted. ZK-2 & ZK-0 are now in a good state, so they can make progress. Transactions start to be logged.|
|2020-01-09 19:33:16Z||ZK-1 joins the quorum. As progress has been made, it receives a SNAP from the leader at 6b30001183a. It writes a snapshot, which ultimately calls FileSnap#serialize. Nothing hits the snapshot disk, because FileSnap is in closed state since 19:28. ZK-1 starts logging transactions to its log disk.|
|2020-01-09 19:42:00Z||We do a rolling restart of the cluster.|
|2020-01-09 19:45:11Z||ZK-1 loads the last snapshot that has been persisted to disk (1e00000000), and applies all log entries with zxid greater than the snapshot (6b30001183a onwards).|
|2020-01-09 19:47:35Z||ZK-2 & ZK-1 form a quorum, ZK-2 leading. ZK-1 reports its lastZxid as 6b30001b32f and gets a DIFF from ZK-2.|
From this point on, the cluster has split brain. ZK-1 is missing all transaction between 1e00000000 and 6bf0001183a.
There's a couple of failures in the code that could stop this problem.
- An exception in truncateLog should nuke the process. Even without the split brain occurring, the processes limped on in a broken state for days and required human intervention to get going again.
- snapLog and txnLog should be defensively nulled after they're closed.
- FileSnap#serialize should not fail silently if close=true. This is really bad. It should at least throw an exception.
The issue occurred with 3.4.13 running on a kubernetes cluster. The bad code paths still exist on master.