Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-23189

Count and fail the task when the disk is error on JobManager

    XMLWordPrintableJSON

Details

    • Hide
      In previous versions, IOExceptions thrown from the JobManager, would not fail the entire Job. We changed the way we bookkeep those exceptions and now they do increase the number of checkpoint failures.

      The number of tolerable checkpoint failures can be adjusted or disabled via: org.apache.flink.streaming.api.environment.CheckpointConfig#setTolerableCheckpointFailureNumber (which is set to 0 by default).
      Show
      In previous versions, IOExceptions thrown from the JobManager, would not fail the entire Job. We changed the way we bookkeep those exceptions and now they do increase the number of checkpoint failures. The number of tolerable checkpoint failures can be adjusted or disabled via: org.apache.flink.streaming.api.environment.CheckpointConfig#setTolerableCheckpointFailureNumber (which is set to 0 by default).

    Description

      When the jobmanager disk is error and the triggerCheckpoint will throw a IOException and fail, this will cause a TRIGGER_CHECKPOINT_FAILURE, but this failure won't cause Job failed. Users can hardly find this error if he don't see the JobManager logs. To avoid this case, I propose that we can figure out these IOException case and increase the failureCounter which can fail the job finally.

      Attachments

        1. exception.txt
          6 kB
          zlzhang0122

        Issue Links

          Activity

            People

              zlzhang0122 zlzhang0122
              zlzhang0122 zlzhang0122
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: