Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-23189

Count and fail the task when the disk is error on JobManager

Agile BoardRank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersCreate sub-taskConvert to sub-taskLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Hide
      In previous versions, IOExceptions thrown from the JobManager, would not fail the entire Job. We changed the way we bookkeep those exceptions and now they do increase the number of checkpoint failures.

      The number of tolerable checkpoint failures can be adjusted or disabled via: org.apache.flink.streaming.api.environment.CheckpointConfig#setTolerableCheckpointFailureNumber (which is set to 0 by default).
      Show
      In previous versions, IOExceptions thrown from the JobManager, would not fail the entire Job. We changed the way we bookkeep those exceptions and now they do increase the number of checkpoint failures. The number of tolerable checkpoint failures can be adjusted or disabled via: org.apache.flink.streaming.api.environment.CheckpointConfig#setTolerableCheckpointFailureNumber (which is set to 0 by default).

    Description

      When the jobmanager disk is error and the triggerCheckpoint will throw a IOException and fail, this will cause a TRIGGER_CHECKPOINT_FAILURE, but this failure won't cause Job failed. Users can hardly find this error if he don't see the JobManager logs. To avoid this case, I propose that we can figure out these IOException case and increase the failureCounter which can fail the job finally.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            zlzhang0122 zlzhang0122
            zlzhang0122 zlzhang0122
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment