Affects Version/s: None
Fix Version/s: 2.1.0
Currently a replica will be moved out of ISR if follower has not fetched from leader for 10 sec (default replica.lag.time.max.ms). This cases problem in the following scenario:
Say follower's ReplicaFetchThread needs to fetch 2k partitions from the leader broker. Only 100 out of 2k partitions are actively being produced to and therefore the total bytes in rate for those 2k partitions are small. The following will happen:
1) The follower's ReplicaFetcherThread sends FetchRequest for those 2k partitions.
2) Because the total bytes-in-rate for those 2k partitions is very small, follower is able to catch up and leader broker adds these 2k partitions to ISR. Follower's lastCaughtUpTimeMs for all partitions are updated to the current time T0.
3) Since follower has caught up for all 2k partitions, leader updates 2k partition znodes to include the follower in the ISR. It may take 20 seconds to write 2k partition znodes if each znode write operation takes 10 ms.
4) At T0 + 15, maybeShrinkIsr() is invoked on leader broker. Since there is no FetchRequet from the follower for more than 10 seconds after T0, all those 2k partitions will be considered as out of syn and the follower will be removed from ISR.
5) The follower receives FetchResponse at least 20 seconds after T0. That means the next FetchRequest from follower to leader will be after T0 + 20.
The sequence of events described above will loop over time. There will be constant churn of URP in the cluster even if follower can catch up with leader's byte-in-rate. This reduces the cluster availability.
In order to address this problem, one simple approach is to keep follower in the ISR as long as follower's LEO equals leader's LEO regardless of follower's lastCaughtUpTimeMs. This is particularly useful if there are a lot of inactive partitions in the cluster.