Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
Description
This is a continuation of KAFKA-16171.
It turns out that the active KRaftMigrationDriver can get a stale read from ZK after becoming the active controller in ZK (i.e., writing to "/controller").
Because ZooKeeper only offers linearizability on writes to a given ZNode, it is possible that we get a stale read on the "/migration" ZNode after writing to "/controller" (and "/controller_epoch") when becoming active.
The history looks like this:
- Node B becomes leader in the Raft layer. KRaftLeaderEvents are enqueued on all KRaftMigrationDriver
- Node A writes some state to ZK, updates "/migration", and checks "/controller_epoch" in one transaction. This happens before B claims controller leadership in ZK. The "/migration" state is updated from X to Y
- Node B claims leadership by updating "/controller" and "/controller_epoch". Leader B reads "/migration" state X
- Node A tries to write some state, fails on "/controller_epoch" check op.
- Node A processes new leader and becomes inactive
This does not violate consistency guarantees made by ZooKeeper.
> Write operations in ZooKeeper are linearizable. In other words, each write will appear to take effect atomically at some point between when the client issues the request and receives the corresponding response.
and
> Read operations in ZooKeeper are not linearizable since they can return potentially stale data. This is because a read in ZooKeeper is not a quorum operation and a server will respond immediately to a client that is performing a read.
---
The impact of this stale read is the same as KAFKA-16171. The KRaftMigrationDriver never gets past SYNC_KRAFT_TO_ZK because it has a stale zkVersion for the "/migration" ZNode. The result is brokers never learn about the new controller and cannot update any partition state.
The workaround for this bug is to re-elect the controller by shutting down the active KRaft controller.
This bug was found during a migration where the KRaft controller was rapidly failing over due to an excess of metadata.
Attachments
Issue Links
- links to