Currently, if Flink cannot complete a checkpoint, it results in a failure and recovery.
To make the impact of less stable storage infrastructure on the performance of Flink less severe, Flink should be able to tolerate a certain number of failed checkpoints and simply keep executing.
This should be controllable via a parameter, for example:
A value of -1 could indicate an infinite number of checkpoint failures tolerated by Flink.
The default value should still be 0, to keep compatibility with the existing behavior.