[FLINK-24063] Reconsider the behavior of ClusterEntrypoint#startCluster failure handler - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: Runtime / Coordination
Labels:
None

Description

If the job runCluster failed, it will trigger the STOP_APPLICATION behavior. But if we consider a case like that:

A job have running for a long time
Then the JobManager encounter a fatal error like the network problem, which may let the jobManager process down
Then a new process will be started by the resource framework like yarn or kubernetes. But it will failed at the ClusterEntrypoint#startCluster due to the same network problem.
Then the job turn into the FAILED status.

This means a streaming job will no longer run due to some fatal error, this is somehow fragile. I think we should give some retry mechanism to prevent the job fast fail twice ,so that deal with some external error which may keep for a period of time.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: WenJun Min

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 31/Aug/21 05:38

Updated:: 03/Sep/21 05:35