Affects Version/s: None
Fix Version/s: None
Let's suppose the following starting point:
- 1 topic
- 1 partition
- 1 reader
- 24h retention period
- leader outbound bandwidth is 3x of inbound bandwidth (1x replication + 1x reader + 1x slack = total outbound)
In this scenario, when replica fails and needs to be brought back from scratch, you can catch up at 2x inbound bandwidth (1x regular replication + 1x slack used).
2x catch-up speed means replica will be at the point where leader is now in 24h / 2x = 12h. However, in 12h the oldest 12h of the topic will fall out of retention cliff and will be deleted. There's absolutely no use for this data, it will never be read from the replica in any scenario. And this not even including the fact that we still need to replicate 12h more of data that accumulated since the time we started.
My suggestion is to refill sufficiently out of sync replicas backwards from the tip: newest segments first, oldest segments last. Then we can stop when we hit retention cliff and replicate far less data. The lower the ratio of catch-up bandwidth to inbound bandwidth, the higher the returns would be. This will also set a hard cap on retention time: it will be no higher than retention period if catch-up speed if >1x (if it's less, you're forever out of ISR anyway).
What exactly "sufficiently out of sync" means in terms of lag is a topic for a debate. The default segment size is 1GiB, I'd say that being >1 full segments behind probably warrants this.
As of now, the solution for slow recovery appears to be to reduce retention to speed up recovery, which doesn't seem very friendly.