Details
-
Bug
-
Status: Closed
-
Critical
-
Resolution: Fixed
-
1.6.4, 1.7.2, 1.8.3, 1.9.2, 1.10.0
Description
This bugs also Affects 1.5.x branch.
As described in point 1 here: https://issues.apache.org/jira/browse/FLINK-17327?focusedCommentId=17090576&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17090576
setTolerableCheckpointFailureNumber(...) and its deprecated setFailTaskOnCheckpointError(...) predecessor are implemented incorrectly. Since Flink 1.5 (https://issues.apache.org/jira/browse/FLINK-4809) they can lead to operators (and especially sinks with an external state) end up in an inconsistent state. That's also true even if they are not used, because of another issue: FLINK-17351
If we mix this with intermittent external system failure. Sink reports an exception, transaction was lost/aborted, Sink is in failed state, but if there will be a happy coincidence that it manages to accept further records, this exception can be lost and all of the records in those failed checkpoints will be lost forever as well.
For details please check FLINK-17327.
Attachments
Issue Links
- causes
-
FLINK-17327 Kafka unavailability could cause Flink TM shutdown
- Closed
- is caused by
-
FLINK-4809 Operators should tolerate checkpoint failures
- Closed
- relates to
-
FLINK-17351 CheckpointCoordinator and CheckpointFailureManager ignores checkpoint timeouts
- Closed
- links to