Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Cannot Reproduce
-
1.9.1, 1.10.0
-
None
Description
Currently, when a Job is stopped in-progress checkpoints are aborted and afterwards a synchronous savepoint is started.
Since the number of tolerable checkpoint failures is 0 per default (see org.apache.flink.streaming.api.environment.CheckpointConfig#getTolerableCheckpointFailureNumber), this triggers a restart of the job if there are any ongoing checkpoints.
In consequence, the stop call only triggers a failover of the job instead of stopping the job, if there is an ongoing checkpoint (or savepoint).
Possible options I see are:
a) change default of tolerable checkpoint failures to at least the max number of concurrent checkpoints
b) do not count checkpoint failures due to the stop action when checking against tolerable checkpoint failures
c) do not abort pending checkpoints when stopping a job, but queue the synchronous savepoint after all current in-progress checkpoints