[FLINK-21846] Rethink whether failure of ExecutionGraph creation in Adaptive Scheduler should directly fail the job - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Not a Priority
Resolution: Unresolved
Affects Version/s: 1.13.0
Fix Version/s: None
Component/s: Runtime / Coordination
Labels:

Description

Currently, the AdaptiveScheduler fails a job execution if the ExecutionGraph creation fails. This can be problematic because the failure could result from a transient problem (e.g. filesystem is currently not available). In the case of a transient problem a job rescaling could lead to a job failure which might be a bit surprising for users. Instead, I would expect that Flink would retry the ExecutionGraph creation.

One idea could be to ask the restart policy for how to treat the failure and whether to retry the ExecutionGraph creation or not.

One thing to keep in mind, though, is that some failure might be permanent failures (e.g. wrongly specified savepoint path). In such as case we would ideally fail immediately. One way to address this problem could be to try to restore the savepoint once we create the AdaptiveScheduler.

Attachments

Issue Links

is related to

FLINK-21075 FLIP-160: Adaptive scheduler

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Till Rohrmann

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 17/Mar/21 12:04

Updated:: 19/Dec/21 10:39