I've been looking at some MM2 scenarios and identified a situation where consumers can miss consuming some data in the even of a failover.
When a consumer subscribes to a topic for the first time and commits offsets, the offsets for every existing partition of that topic will be saved to the cluster's __consumer_offset topic. Even if a partition is completely empty, the offset 0 will still be saved for the consumer's consumer group.
When MM2 is replicating the checkpoints to the remote cluster, though, it ignores anything that has an offset equals to zero, replicating offsets only for partitions that contain data.
This can lead to a gap in the data consumed by consumers in the following scenario:
- Topic is created on the source cluster.
- MM2 is configured to replicate the topic and consumer groups
- Producer starts to produce data to the source topic but for some reason some partitions do not get data initially, while others do (skewed keyed messages or bad luck)
- Consumers start to consume data from that topic and their consumer groups' offsets are replicated to the target cluster, but only for partitions that contain data. The consumers are using the default setting auto.offset.reset = latest.
- A consumer failover to the second cluster is performed (for whatever reason), and the offset translation steps are completed. The consumer are not restarted yet.
- The producers continue to produce data to the source cluster topic and now produce data to the partitions that were empty before.
- After the producers start producing data, consumers are started on the target cluster and start consuming.
For the partitions that already had data before the failover, everything works fine. The consumer offsets will have been translated correctly and the consumers will start consuming from the correct position.
For the partitions that were empty before the failover, though, any data written by the producers to those partitions after the failover but before the consumers start will be completely missed, since the consumers will jump straight to the latest offset when they start due to the lack of a zero offset stored locally on the target cluster.