Details
-
Sub-task
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
Once we have KAFKA-12740, we can close the loop on EOS by checkpointing not only those state stores which are attached to processors that the record has successfully passed, but also any remaining state stores further downstream in the subtopology that aren't connected to the processor where the error occurred.
At this point, outside of a hard crash (eg process is killed) or dropping out of the group, we’ll only ever need to restore state stores from scratch if the exception came from the specific processor node they’re attached to. Which is pretty darn cool.
Note: we may need to first do some follow-up work to KAFKA-12740, depending on where we land on the open question in that ticket: whether to just disable the partial-topology commit for EOS or fully implement the logic to only perform the partial-commit iff the task remains assigned to that same client. If we end up just doing the former in KAFKA-12740 then we'll need to implement the latter before enabling this for EOS, and as a prerequisite to the work in this ticket