Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-13333

Optimize condition for triggering rebalance after wiping out corrupted task

    XMLWordPrintableJSON

Details

    Description

      Just filing a ticket to list some thoughts I had on optimizing https://issues.apache.org/jira/browse/KAFKA-12486

      The idea here is to trigger a rebalance upon detecting corruption of some task. This task may have had a large amount of state that had to be wiped out under eos, so we might be able to avoid a long downtime due to restoration if we can utilize the HA TaskAssignor to temporarily move that active task to another node that has some state for it already (eg had a standby task for it).

      Right now, we trigger that rebalance under the condition that (a) eos is enabled, and (b) at least one of the corrupted tasks was an active task. This is a pretty safe bet, but it's worth jotting down some potential optimizations of this condition so we can trim down the occurrences of unnecessary rebalances that wouldn't have helped. For example:

      1) Don't kick off a rebalance if the corrupted task is in CREATED or RESTORING, and is not within the acceptable.recovery.lag from the end of the changelog. If the task wasn't caught up on this host but assigned to it anyway, that indicates there wasn't any other host with enough state for this task and therefore no one to temporarily take it over

      2) Only trigger a rebalance if standbys are configured, and/or parse the standby host info to verify whether this task has a standby copy on another live client. It's still possible to have a copy of this task's state on another host even without standbys, but the odds are greatly reduced.

      3) If we want to get really fancy (and I'm not quite sure we do), we could have the assignor report not just the names but also the lag of each standby task on another host, and then trigger the rebalance depending on whether this task has a hot standby within the acceptable.recovery.lag

      Attachments

        Activity

          People

            Unassigned Unassigned
            ableegoldman A. Sophie Blee-Goldman
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: