Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-9178

restoredPartitions is not cleared until the last restoring task completes



    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • 2.4.0
    • 2.4.0
    • None


      We check the `active` set is empty during closeLostTasks(). However we don't currently properly clear the restoredPartitions set in some edge cases:

      We only remove partitions from restoredPartitions when a) all tasks are done restoring, at which point we clear it entirely(in AssignedStreamTasks#updateRestored), or b) one task at a time, when that task is restoring and is closed.

      Say some partitions were still restoring while others had completed and transitioned to running when a rebalance occurs. The still-restoring tasks are all revoked, and closed immediately, and their partitions removed from restoredPartitions. We also suspend & revoke some running tasks that have finished restoring, and remove them from running/runningByPartition.

      Now we have only running tasks left, so in TaskManager#updateNewAndRestoringTasks we don’t ever even call AssignedStreamTasks#updateRestored }}and therefore we never get to clear {{restoredPartitions. We then close each of the currently running tasks and remove their partitions from everything, BUT we never got to remove or clear the partitions of the running tasks that we revoked previously.

      It turns out we can't just rely on removing from restoredPartitions }}upon completion since the partitions will just be added back to it during the next loop (blocked by KAFKA-9177). For now, we should just remove partitions from {{restoredPartitions when closing or suspending running tasks as well.




            bchen225242 Boyang Chen
            bchen225242 Boyang Chen
            0 Vote for this issue
            5 Start watching this issue