[REEF-61] Possible race condition in EvaluatorManager - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: REEF-Common
Labels:
None

Description

There is a theoretical issue in EvaluatorManager which we should elevate to critical if it starts hitting us:

EvaluatorManager joins two streams of information: It receives heartbeats from the Evaluator itself as well as container status events from the resource manager. Usually, the relevant events arrive in the EvaluatorManager in their canonical order:

The resource manager indicates container launch.
The first heartbeat is received from the Evaluator
The Evaluator sends heartbeats with status messages etc.
The Evaluator sends its last heartbeat
The resource manager indicates container exit

The 4th message might never be sent or received in catastrophic failure scenarios. That is why EvaluatorManager declares an FailedEvaluator when receiving the 5th message for an Evaluator whose last heartbeat still indicated a RUNNING state.

This is where the race condition occurs: If the last heartbeat from an Evaluator arrives after the container exit from the resource manager, the application experiences a FailedEvaluator where a CompletedEvaluator would have been in order.

A first idea to fix this would be to add a small time window after receiving the container exit before deciding whether this is indeed a failure. Think 100ms or so. That way, we allow for some slack in the arrival of the last heartbeat. The obvious downside of such an approach is that we introduce latency in the cases where the Evaluator really failed. Even worse, we add a magic constant to the code.

This used to be #964

Attachments

Issue Links

is related to

REEF-726 Race condition with completed Containers

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Markus Weimer

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 03/Dec/14 18:41

Updated:: 09/Sep/15 21:16