Description
I found in the current KRaft implementation, when network partition happened between the current controller leader and the other controller nodes, the "split brain" issue will happen. It causes 2 leaders will exist in the controller cluster, and 2 inconsistent sets of metadata will return to the clients.
Root cause
In KIP-595, we said A voter will begin a new election under three conditions:
1. If it fails to receive a FetchResponse from the current leader before expiration of quorum.fetch.timeout.ms
2. If it receives a EndQuorumEpoch request from the current leader
3. If it fails to receive a majority of votes before expiration of quorum.election.timeout.ms after declaring itself a candidate.
And that's exactly what the current KRaft's implementation.
However, when the leader is isolated from the network partition, there's no way for it to resign from the leadership and start a new election. So the leader will always be the leader even though all other nodes are down. And this makes the split brain issue possible.
When reading further in the KIP-595, I found we indeed considered this situation and have solution for that. in this section, it said:
In the pull-based model, however, say a new leader has been elected with a new epoch and everyone has learned about it except the old leader (e.g. that leader was not in the voters anymore and hence not receiving the BeginQuorumEpoch as well), then that old leader would not be notified by anyone about the new leader / epoch and become a pure "zombie leader", as there is no regular heartbeats being pushed from leader to the follower. This could lead to stale information being served to the observers and clients inside the cluster.
To resolve this issue, we will piggy-back on the "quorum.fetch.timeout.ms" config, such that if the leader did not receive Fetch requests from a majority of the quorum for that amount of time, it would begin a new election and start sending VoteRequest to voter nodes in the cluster to understand the latest quorum. If it couldn't connect to any known voter, the old leader shall keep starting new elections and bump the epoch.
But we missed this implementation in current KRaft.
The flow is like this:
1. 3 controller nodes, A(leader), B(follower), C(follower)
2. network partition happened between [A] and [B, C].
3. B and C starts new election since fetch timeout expired before receiving fetch response from leader A.
4. B (or C) is elected as a leader in new epoch, while A is still the leader in old epoch.
5. broker D creates a topic "new", and updates to leader B.
6. broker E describe topic "new", but got nothing because it is connecting to the old leader A.
Attachments
Issue Links
- is duplicated by
-
KAFKA-13621 Resign leader on network partition
- Resolved
- links to