Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
3.6.0
-
None
Description
Executive summary: When the MirrorCheckpointTask restarts, it loses the state of checkpointsPerConsumerGroup, which limits offset translation to records mirrored after the latest restart.
For example, if 1000 records are mirrored and the OffsetSyncs are read by MirrorCheckpointTask, the emitted checkpoints are cached, and translation can happen at the ~500th record. If MirrorCheckpointTask restarts, and 1000 more records are mirrored, translation can happen at the ~1500th record, but no longer at the ~500th record.
Context:
Before KAFKA-13659, MM2 made translation decisions based on the incompletely-initialized OffsetSyncStore, and the checkpoint could appear to go backwards temporarily during restarts. To fix this, we forced the OffsetSyncStore to initialize completely before translation could take place, ensuring that the latest OffsetSync had been read, and thus providing the most accurate translation.
Before KAFKA-14666, MM2 translated offsets only off of the latest OffsetSync. Afterwards, an in-memory sparse cache of historical OffsetSyncs was kept, to allow for translation of earlier offsets. This came with the caveat that the cache's sparseness allowed translations to go backwards permanently. To prevent this behavior, a cache of the latest Checkpoints was kept in the MirrorCheckpointTask#checkpointsPerConsumerGroup variable, and offset translation remained restricted to the fully-initialized OffsetSyncStore.
Effectively, the MirrorCheckpointTask ensures that it translates based on an OffsetSync emitted during it's lifetime, to ensure that no previous MirrorCheckpointTask emitted a later sync. If we can read the checkpoints emitted by previous generations of MirrorCheckpointTask, we can still ensure that checkpoints are monotonic, while allowing translation further back in history.
Attachments
Issue Links
- is related to
-
KAFKA-12468 Initial offsets are copied from source to target cluster
- Resolved
-
KAFKA-14666 MM2 should translate consumer group offsets behind replication flow
- Resolved
-
KAFKA-13659 MM2 should read all offset syncs at start up
- Resolved
- relates to
-
KAFKA-16364 MM2 High-Resolution Offset Translation
- Open
- links to