Details
-
Bug
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
None
-
None
-
None
Description
There is a theoretical issue in EvaluatorManager which we should elevate to critical if it starts hitting us:
EvaluatorManager joins two streams of information: It receives heartbeats from the Evaluator itself as well as container status events from the resource manager. Usually, the relevant events arrive in the EvaluatorManager in their canonical order:
- The resource manager indicates container launch.
- The first heartbeat is received from the Evaluator
- The Evaluator sends heartbeats with status messages etc.
- The Evaluator sends its last heartbeat
- The resource manager indicates container exit
The 4th message might never be sent or received in catastrophic failure scenarios. That is why EvaluatorManager declares an FailedEvaluator when receiving the 5th message for an Evaluator whose last heartbeat still indicated a RUNNING state.
This is where the race condition occurs: If the last heartbeat from an Evaluator arrives after the container exit from the resource manager, the application experiences a FailedEvaluator where a CompletedEvaluator would have been in order.
A first idea to fix this would be to add a small time window after receiving the container exit before deciding whether this is indeed a failure. Think 100ms or so. That way, we allow for some slack in the arrival of the last heartbeat. The obvious downside of such an approach is that we introduce latency in the cases where the Evaluator really failed. Even worse, we add a magic constant to the code.
This used to be #964
Attachments
Issue Links
- is related to
-
REEF-726 Race condition with completed Containers
- Resolved