Details
-
Sub-task
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
Within a task, there will typically be a number of records that have been successfully processed through the subtopology but not yet committed. If the next record to be picked up hits an unexpected exception, we’ll dirty close the entire task and essentially throw away all the work we did on those previous records. We should be able to drop only the corrupted record and just commit the offsets up to that point.
Again, for some exceptions such as de/serialization or user code errors, this can be straightforward as the thread/task is otherwise in a healthy state. Other cases such as an error in the Producer will need to be tackled separately, since a Producer error cannot be isolated to a single task.
The challenge here will be in handling records sent to the changelog while processing the record that hits an error – we may need to buffer those records so they aren’t sent to the RecordCollector until a record has been fully processed, otherwise they will be committed and break EOS semantics (unless we can immediately implement KAFKA-12740)