Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-17350

StreamTask should always fail immediately on failures in synchronous part of a checkpoint

    XMLWordPrintableJSON

Details

    • Hide
      Failures in synchronous part of checkpointing (like an exceptions thrown by an operator) will fail it's Task (and job) immediately, regardless of the configuration parameters. Since Flink 1.5 such failures could be ignored by setting `setTolerableCheckpointFailureNumber(...)` or its deprecated `setFailTaskOnCheckpointError(...)` predecessor. Now both options will only affect asynchronous failures.
      Show
      Failures in synchronous part of checkpointing (like an exceptions thrown by an operator) will fail it's Task (and job) immediately, regardless of the configuration parameters. Since Flink 1.5 such failures could be ignored by setting `setTolerableCheckpointFailureNumber(...)` or its deprecated `setFailTaskOnCheckpointError(...)` predecessor. Now both options will only affect asynchronous failures.

    Description

      This bugs also Affects 1.5.x branch.

      As described in point 1 here: https://issues.apache.org/jira/browse/FLINK-17327?focusedCommentId=17090576&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17090576

      setTolerableCheckpointFailureNumber(...) and its deprecated setFailTaskOnCheckpointError(...) predecessor are implemented incorrectly. Since Flink 1.5 (https://issues.apache.org/jira/browse/FLINK-4809) they can lead to operators (and especially sinks with an external state) end up in an inconsistent state. That's also true even if they are not used, because of another issue: FLINK-17351

      If we mix this with intermittent external system failure. Sink reports an exception, transaction was lost/aborted, Sink is in failed state, but if there will be a happy coincidence that it manages to accept further records, this exception can be lost and all of the records in those failed checkpoints will be lost forever as well.

      For details please check FLINK-17327.

      Attachments

        Issue Links

          Activity

            People

              pnowojski Piotr Nowojski
              pnowojski Piotr Nowojski
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: