Description
a. Any transit app error should have retry logic inside code. After retry, if it still fails, restart server won’t help.
b. Any expected app exceptions should be not recoverable
c. Unexpected app exceptions should be not recoverable
Resource issue
a. Evaluator is killed by RM. We should response to this case
System Error
a. System issue causing a machine crash
b. Other system error we encountered in 10 month data testing, what are the exact events received?
Attachments
Issue Links
- Is contained by
-
REEF-1223 IMRU Fault Tolerance - restart failed evaluators
- Resolved