[FLINK-23189] Count and fail the task when the disk is error on JobManager - ASF JIRA

Attach files

Attach Screenshot

Voters

Watch issue

Watchers

Create sub-task

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.12.2, 1.13.1
Fix Version/s: 1.14.0
Component/s: Runtime / Checkpointing
Labels:
- pull-request-available

Release Note:

Hide
In previous versions, IOExceptions thrown from the JobManager, would not fail the entire Job. We changed the way we bookkeep those exceptions and now they do increase the number of checkpoint failures.

The number of tolerable checkpoint failures can be adjusted or disabled via: org.apache.flink.streaming.api.environment.CheckpointConfig#setTolerableCheckpointFailureNumber (which is set to 0 by default).

Show
In previous versions, IOExceptions thrown from the JobManager, would not fail the entire Job. We changed the way we bookkeep those exceptions and now they do increase the number of checkpoint failures. The number of tolerable checkpoint failures can be adjusted or disabled via: org.apache.flink.streaming.api.environment.CheckpointConfig#setTolerableCheckpointFailureNumber (which is set to 0 by default).

Description

When the jobmanager disk is error and the triggerCheckpoint will throw a IOException and fail, this will cause a TRIGGER_CHECKPOINT_FAILURE, but this failure won't cause Job failed. Users can hardly find this error if he don't see the JobManager logs. To avoid this case, I propose that we can figure out these IOException case and increase the failureCounter which can fail the job finally.