Description
Assuming there are no bugs in the REEF/IMRU system, we would like to catch all known exceptions and converted them into the categories of IMRUTaskAppException(any exceptions caused by user code, is not recoverable) IMRUTaskGroupCommunicationException(exceptions caused by group communications, is recoverable) and IMRUTaskSystemException (any other system errors or transit errors and can be recover).
As there might have new exceptions, we will leave the default as a recoverable exception.
Most of the code is already implemented in REEF-1251. We need to do a refactor/clean up.
Attachments
Issue Links
- Is contained by
-
REEF-1223 IMRU Fault Tolerance - restart failed evaluators
- Resolved