Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
Description
Just filing a ticket to list some thoughts I had on optimizing https://issues.apache.org/jira/browse/KAFKA-12486.
The idea here is to trigger a rebalance upon detecting corruption of some task. This task may have had a large amount of state that had to be wiped out under eos, so we might be able to avoid a long downtime due to restoration if we can utilize the HA TaskAssignor to temporarily move that active task to another node that has some state for it already (eg had a standby task for it).
Right now, we trigger that rebalance under the condition that (a) eos is enabled, and (b) at least one of the corrupted tasks was an active task. This is a pretty safe bet, but it's worth jotting down some potential optimizations of this condition so we can trim down the occurrences of unnecessary rebalances that wouldn't have helped. For example:
1) Don't kick off a rebalance if the corrupted task is in CREATED or RESTORING, and is not within the acceptable.recovery.lag from the end of the changelog. If the task wasn't caught up on this host but assigned to it anyway, that indicates there wasn't any other host with enough state for this task and therefore no one to temporarily take it over
2) Only trigger a rebalance if standbys are configured, and/or parse the standby host info to verify whether this task has a standby copy on another live client. It's still possible to have a copy of this task's state on another host even without standbys, but the odds are greatly reduced.
3) If we want to get really fancy (and I'm not quite sure we do), we could have the assignor report not just the names but also the lag of each standby task on another host, and then trigger the rebalance depending on whether this task has a hot standby within the acceptable.recovery.lag