[FLINK-23189] Count and fail the task when the disk is error on JobManager - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.12.2, 1.13.1
Fix Version/s: 1.14.0
Component/s: Runtime / Checkpointing
Labels:
- pull-request-available

Release Note:

Hide
In previous versions, IOExceptions thrown from the JobManager, would not fail the entire Job. We changed the way we bookkeep those exceptions and now they do increase the number of checkpoint failures.

The number of tolerable checkpoint failures can be adjusted or disabled via: org.apache.flink.streaming.api.environment.CheckpointConfig#setTolerableCheckpointFailureNumber (which is set to 0 by default).

Show
In previous versions, IOExceptions thrown from the JobManager, would not fail the entire Job. We changed the way we bookkeep those exceptions and now they do increase the number of checkpoint failures. The number of tolerable checkpoint failures can be adjusted or disabled via: org.apache.flink.streaming.api.environment.CheckpointConfig#setTolerableCheckpointFailureNumber (which is set to 0 by default).

Description

When the jobmanager disk is error and the triggerCheckpoint will throw a IOException and fail, this will cause a TRIGGER_CHECKPOINT_FAILURE, but this failure won't cause Job failed. Users can hardly find this error if he don't see the JobManager logs. To avoid this case, I propose that we can figure out these IOException case and increase the failureCounter which can fail the job finally.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

exception.txt
07/Jul/21 02:20
6 kB
zlzhang0122

Issue Links

is duplicated by

FLINK-24249 login from keytab fail when disk damage

Closed

is related to

FLINK-24344 Handling of IOExceptions when triggering checkpoints doesn't cause job failover

Closed

links to

GitHub Pull Request #16637

GitHub Pull Request #16829

Activity

People

Assignee:: zlzhang0122

Reporter:: zlzhang0122

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 30/Jun/21 05:18

Updated:: 22/Sep/21 13:52

Resolved:: 14/Aug/21 20:37