On a cluster of 6 nodes, we recently saw a case where a single under-replicated partition suddenly appeared, replication lag spiked, and network IO spiked. The cluster appeared to recover eventually on its own,
Looking at the logs, the thing which failed was partition 7 of the topic background_queue. It had an ISR of 1,4,3 and its leader at the time was 3. Here are the interesting log lines:
On node 3 (the leader):
Note that both replicas suddenly asked for an offset ahead of the available offsets.
And on nodes 1 and 4 (the replicas) many occurrences of the following:
Based on my reading, this looks like the replicas somehow got ahead of the leader, asked for an invalid offset, got confused, and re-replicated the entire topic from scratch to recover (this matches our network graphs, which show 3 sending a bunch of data to 1 and 4).
Taking a stab in the dark at the cause, there appears to be a race condition where replicas can receive a new offset before the leader has committed it and is ready to replicate?