Description
In IMRUDriver, we keep track of the list of failed evaluator ids, and use its length when we need to find out how many evaluators failed. However, if evaluator failed in WaitingForEvaluator state we immediately remove it from this list and request another evaluator, effectively forgetting about the failure. Thus, even with lots of evaluators failing at this stage we'll never hit MaximumNumberOfEvaluatorFailures limit, and will keep requesting new evaluators indefinitely.
I think we should just remove this list (with the additional benefit of reduced memory consumption) and replace it with a single counter which is never decremented. We're only using the values in the list for sanity checks.
Attachments
Issue Links
- Is contained by
-
REEF-1223 IMRU Fault Tolerance - restart failed evaluators
- Resolved