We are experiencing unexpected offset reset errors occasionally, as well as occasional replay of messages (without an offset reset error).
The cause seems to be a failed commit on rebalance, leaving an old value in the hashMap used to store the latest processed offset for a partition. This old value is then re-read and re-committed across rebalances in certain situations.
Our relevant configuration details are:
It seems when the KafkaConsumer experiences an Exception committing the offset (CommitFailedException) upon a rebalance, this leaves the old offset value in the lastProcessedOffset hashMap.
A subsequent rebalance that assigns the same partition to the same consumer, that then thereafter experiences another rebalance (before any messages have been processed successfully as this will over write the invalid value and self correct the problem) will commit this old offset again. This offset may be very old if there have been many rebalances in between the original rebalance that failed to commit its offset.
If the old offset is beyond the retention period and the message is no longer available the outcome is an offset reset error. If the offset is within the retention period all messages are replayed from that offset without an error being thrown.