[REEF-1511] timeout for Task Shutdown during IMRU recovery - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.16
Component/s: IMRU
Labels:
- FT

Description

This related to fault tolerance implementation in PR-1251.
Currently recovery logic in IMRU driver is to wait for all task to move to a final state (failed or completed) before restarting the job check AreAllTasksInFinalState() in TryRecovery() method)
We've seen driver hanging for a long time waiting for few last tasks finalize.
Aborting tasks should be quick, so there is bug there, but we also can add logic in driver not to wait for all tasks to complete.
For instance: if 5% of tasks did not report final state withing expected period, release corresponding evaluators and proceed with new job retry.

Attachments

Issue Links

Is contained by

REEF-1223 IMRU Fault Tolerance - restart failed evaluators

Resolved

Activity

People

Assignee:: Julia Wang

Reporter:: Andrey

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 09/Aug/16 17:13

Updated:: 16/Dec/16 19:41

Resolved:: 16/Dec/16 19:41