Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
Description
There are two ways that partition state can be updated in the zk world: one is through `LeaderAndIsr` requests and one is through `AlterPartition` responses. All changes made to partition state result in new LeaderAndIsr requests, but replicas will ignore them if the leader epoch is less than or equal to the current known leader epoch. Basically it works like this:
- Changes made by the leader are done through AlterPartition requests. These changes bump the partition epoch (or zk version), but leave the leader epoch unchanged. LeaderAndIsr requests are sent by the controller, but replicas ignore them. Partition state is instead only updated when the AlterIsr response is received.
- Changes made by the controller are made directly by the controller and always result in a leader epoch bump. These changes are sent to replicas through LeaderAndIsr requests and are applied by replicas.
The code in `kafka.server.ReplicaManager` and `kafka.cluster.Partition` are built on top of these assumptions. The logic in `makeLeader`, for example, assumes that the leader epoch has indeed been bumped. Specifically, follower state gets reset and a new entry is written to the leader epoch cache.
In KRaft, we also have two paths to update partition state. One is AlterPartition, just like in the zk world. The second is updates received from the metadata log. These follow the same path as LeaderAndIsr requests for the most part, but a big difference is that all changes are sent down to `kafka.cluster.Partition`, even those which do not have a bumped leader epoch. This breaks the assumptions mentioned above in `makeLeader`, which could result in leader epoch cache inconsistency. Another side effect of this on the follower side is that replica fetchers for updated partitions get unnecessarily restarted. There may be others as well.
We need to either replicate the same logic on the zookeeper side or make the logic robust to all updates including those without a leader epoch bump.