Details
-
Sub-task
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
Currently Vortex considers all failures the same, via FailedEvaluatorHandler. We should handle different types of failures differently.
Type 1: Resource preemption
We react based on a configured policy. (e.g. re-request infinitely) If needed we can even add a new event handler to REEF Driver named PreemptedEvaluatorHandler just for this type(a separate JIRA issue outside of the Vortex umbrella JIRA).
Type 2: Internal Vortex code failure
Can happen nondeterministically and even result in an infinite resource release+request. In such case, we should probably shut down the Driver immediately for the ease of debugging and to prevent it from interefereing with other jobs in the cluster.
Type 3: Other types of failures
If the failure is caused by issues like OOM then we also treat such case differently.
Attachments
Issue Links
- is related to
-
REEF-836 Add Preemption API
- Open