Details
-
Improvement
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
None
-
None
-
None
Description
If the job runCluster failed, it will trigger the STOP_APPLICATION behavior. But if we consider a case like that:
- A job have running for a long time
- Then the JobManager encounter a fatal error like the network problem, which may let the jobManager process down
- Then a new process will be started by the resource framework like yarn or kubernetes. But it will failed at the ClusterEntrypoint#startCluster due to the same network problem.
- Then the job turn into the FAILED status.
This means a streaming job will no longer run due to some fatal error, this is somehow fragile. I think we should give some retry mechanism to prevent the job fast fail twice ,so that deal with some external error which may keep for a period of time.