During an upgrade of the message format, there is a short time during which the configured message format version is not consistent across all replicas of a partition. As long as all brokers are using the same binary version (i.e. all have been updated to the latest code), this typically does not cause any problems. Followers will take whatever message format is used by the leader. However, it is possible for leadership to change several times between brokers which support the new format and those which support the old format. This can cause the version used in the log to flap between the different formats until the upgrade is complete.
For example, suppose broker 1 has been updated to use v2 and broker 2 is still on v1. When broker 1 is the leader, all new messages will be written in the v2 format. When broker 2 is leader, v1 will be used. If there is any instability in the cluster or if completion of the update is delayed, then the log will be seen to switch back and forth between v1 and v2. Once the update is completed and broker 1 begins using v2, then the message format will stabilize and everything will generally be ok.
Downgrades of the message format are problematic, even if they are just temporary. There are basically two issues:
1. We use the configured message format version to tell whether down-conversion is needed. We assume that the this is always the maximum version used in the log, but that assumption fails in the case of a downgrade. In the worst case, old clients will see the new format and likely fail.
2. The logic we use for finding the truncation offset during the become follower transition does not handle flapping between message formats. When the new format is used by the leader, then the epoch cache will be updated correctly. When the old format is in use, the epoch cache won't be updated. This can lead to an incorrect result to OffsetsForLeaderEpoch queries.
We have actually observed the second problem. The scenario went something like this. Broker 1 is the leader of epoch 0 and writes some messages to the log using the v2 message format. Broker 2 then becomes the leader for epoch 1 and writes some messages in the v2 format. On broker 2, the last entry in the epoch cache is epoch 0. No entry is written for epoch 1 because it uses the old format. When broker 1 became a follower, it send an OffsetsForLeaderEpoch query to broker 2 for epoch 0. Since epoch 0 was the last entry in the cache, the log end offset was returned. This resulted in localized log divergence.
There are a few options to fix this problem. From a high level, we can either be stricter about preventing downgrades of the message format, or we can add additional logic to make downgrades safe.
(Disallow downgrades): As an example of the first approach, the leader could always use the maximum of the last version written to the log and the configured message format version.
(Allow downgrades): If we want to allow downgrades, then it make makes sense to invalidate and remove all entries in the epoch cache following the message format downgrade. This would basically force us to revert to truncation to the high watermark, which is what you'd expect from a downgrade. We would also need a solution for the problem of detecting when down-conversion is needed for a fetch request. One option I've been thinking about is enforcing the invariant that each segment uses only one message format version. Whenever the message format changes, we need to roll a new segment. Then we can simply remember which format is in use by each segment to tell whether down-conversion is needed for a given fetch request.