Description
Along KIP-429 we did a lot of refactoring of the task management classes, including the TaskManager and AssignedTasks (and children). While hopefully easier to reason about there's still significant opportunity for further cleanup including safer state tracking. Some potential improvements:
1) Verify that no tasks are ever in more than one state at once. One possibility is to just check that the suspended, created, restoring, and running maps are all disjoint, but this begs the question of when and where to do those checks, and how often. Another idea might be to put all tasks into a single map and just track their state on a per-task basis. Whatever it is should be aware that some methods are on the critical code path, and should not be burdened with excessive safety checks (ie AssignedStreamTasks#process). Alternatively, it seems to make sense to just make each state its own type. We can then do some cleanup of the AbstractTask and StreamTask classes, which currently contain a number of methods specific to only one type/state of task. For example
- only active running tasks ever need to be suspendable, yet every task does through suspend then closeSuspended during close.
- as the name suggests, closeSuspended should technically only ever apply to suspended tasks
- the code paths needed to perform certain actions such as closing or committing a task vary widely between the different states. A restoring task need only close its state manager, but skipping the task.close call and calling only closeStateManager has lead to confusion and time wasted trying to remember or ask someone why that is sufficient
2) Cleanup of closing and/or shutdown logic – there are some potential improvements to be made here as well, for example AssignedTasks currently implements a closeZombieTask method despite the fact that standby tasks are never zombies.
3) The StoreChangelogReader also interacts with (only) the AssignedStreamsTasks class, through the TaskManager. It can be difficult to reason about these interactions and the state of the changelog reader.
4) All 4 classes and their state have very strict consistency requirements that currently are almost impossible to verify, which has already resulted in several bugs that we were lucky to catch in time. We should tighten up how these classes manage their own state, and how the overall state is managed between them, so that it is easy to make changes without introducing new bugs because one class updated its own state without knowing it needed to tell another class to also update its
Attachments
Issue Links
- contains
-
KAFKA-9552 Stream should handle OutOfSequence exception thrown from Producer
- Resolved
- links to