[KAFKA-12740] 3. Resume processing from last-cleared processor after soft crash - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: streams
Labels:
None

Description

Building off of that, we can go one step further and avoid duplicate work within the subtopology itself. Any time a record is partially processed through a subtopology before hitting an error, all of the processors up to that point will be applied again when the record is reprocessed. If we can keep track of how far along the subtopology a record was processed, then we can start reprocessing it from the last processor it had cleared before hitting an error. We’ll need to make sure to commit and/or flush everything up to that point in the subtopology as well.
This proposal is the most likely to benefit from letting a StreamThread recover after an unexpected exception rather than letting it die and starting up a new one, as we don’t need to worry about handing anything off from the dying thread to its replacement.

Note: we should consider whether we should only allow this (A) if we can be sure the task is re/still assigned to the same client (ie user does not select SHUTDOWN_CLIENT), or (B) we allow it for ALOS and just skip this under EOS until we can implement (A) (at which time we can re-enable for EOS as well). Also note that disabling the partial-topology commit unless the task remains assigned to the same client will also further improve ALOS semantics, especially if we can actively prevent those from being committed to the changelog unless we want it to be, by the same mechanism.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: A. Sophie Blee-Goldman

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 01/May/21 01:32

Updated:: 03/May/21 17:29