[KAFKA-14077] KRaft should support recovery from failed disk - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Blocker
Resolution: Unresolved
Affects Version/s: 2.8.0
Fix Version/s: None
Component/s: kraft
Labels:
None

Description

If one of the nodes in the metadata quorum has a disk failure, there is no way currently to safely bring the node back into the quorum. When we lose disk state, we are at risk of losing committed data even if the failure only affects a minority of the cluster.

Here is an example. Suppose that a metadata quorum has 3 members: v1, v2, and v3. Initially, v1 is the leader and writes a record at offset 1. After v2 acknowledges replication of the record, it becomes committed. Suppose that v1 fails before v3 has a chance to replicate this record. As long as v1 remains down, the raft protocol guarantees that only v2 can become leader, so the record cannot be lost. The raft protocol expects that when v1 returns, it will still have that record, but what if there is a disk failure, the state cannot be recovered and v1 participates in leader election? Then we would have committed data on a minority of the voters. The main problem here concerns how we recover from this impaired state without risking the loss of this data.

Consider a naive solution which brings v1 back with an empty disk. Since the node has lost is prior knowledge of the state of the quorum, it will vote for any candidate that comes along. If v3 becomes a candidate, then it will vote for itself and it just needs the vote from v1 to become leader. If that happens, then the committed data on v2 will become lost.

This is just one scenario. In general, the invariants that the raft protocol is designed to preserve go out the window when disk state is lost. For example, it is also possible to contrive a scenario where the loss of disk state leads to multiple leaders. There is a good reason why raft requires that any vote cast by a voter is written to disk since otherwise the voter may vote for different candidates in the same epoch.

Many systems solve this problem with a unique identifier which is generated automatically and stored on disk. This identifier is then committed to the raft log. If a disk changes, we would see a new identifier and we can prevent the node from breaking raft invariants. Then recovery from a failed disk requires a quorum reconfiguration. We need something like this in KRaft to make disk recovery possible.

Attachments

Issue Links

is related to

KAFKA-14142 Improve information returned about the cluster metadata partition

Resolved

links to

KIP-853: KRaft Voter Changes

Activity

People

Assignee:: Jose Armando Garcia Sancio

Reporter:: Jason Gustafson

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 14/Jul/22 21:48

Updated:: 14/Oct/23 18:08