[FLINK-17350] StreamTask should always fail immediately on failures in synchronous part of a checkpoint - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Critical
Resolution: Fixed
Affects Version/s: 1.6.4, 1.7.2, 1.8.3, 1.9.2, 1.10.0
Fix Version/s: 1.11.0
Component/s: Runtime / Checkpointing, Runtime / Task
Labels:
- pull-request-available

Release Note:

Hide
Failures in synchronous part of checkpointing (like an exceptions thrown by an operator) will fail it's Task (and job) immediately, regardless of the configuration parameters. Since Flink 1.5 such failures could be ignored by setting `setTolerableCheckpointFailureNumber(...)` or its deprecated `setFailTaskOnCheckpointError(...)` predecessor. Now both options will only affect asynchronous failures.

Show
Failures in synchronous part of checkpointing (like an exceptions thrown by an operator) will fail it's Task (and job) immediately, regardless of the configuration parameters. Since Flink 1.5 such failures could be ignored by setting `setTolerableCheckpointFailureNumber(...)` or its deprecated `setFailTaskOnCheckpointError(...)` predecessor. Now both options will only affect asynchronous failures.

Description

This bugs also Affects 1.5.x branch.

As described in point 1 here: https://issues.apache.org/jira/browse/FLINK-17327?focusedCommentId=17090576&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17090576

setTolerableCheckpointFailureNumber(...) and its deprecated setFailTaskOnCheckpointError(...) predecessor are implemented incorrectly. Since Flink 1.5 (https://issues.apache.org/jira/browse/FLINK-4809) they can lead to operators (and especially sinks with an external state) end up in an inconsistent state. That's also true even if they are not used, because of another issue: ~~FLINK-17351~~

If we mix this with intermittent external system failure. Sink reports an exception, transaction was lost/aborted, Sink is in failed state, but if there will be a happy coincidence that it manages to accept further records, this exception can be lost and all of the records in those failed checkpoints will be lost forever as well.

For details please check ~~FLINK-17327~~.

Attachments

Issue Links

causes

FLINK-17327 Kafka unavailability could cause Flink TM shutdown

Closed

is caused by

FLINK-4809 Operators should tolerate checkpoint failures

Closed

relates to

FLINK-17351 CheckpointCoordinator and CheckpointFailureManager ignores checkpoint timeouts

Closed

links to

GitHub Pull Request #12101

Activity

People

Assignee:: Piotr Nowojski

Reporter:: Piotr Nowojski

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 23/Apr/20 12:49

Updated:: 16/Oct/20 10:32

Resolved:: 16/May/20 08:12