Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
Description
There is a race condition in REEF-Local-Runtime, and it can happen as follows:
- The Evaluator sends the DONE message and exits its process.
- The RM discovers Evaluator ends, sends DONE message to Driver.
- Driver first gets DONE message from RM before getting reading the DONE message from the Evaluator in its network queue.
- Driver calls FailedEvaluatorHandler, even though the Evaluator shuts down properly.
This can be fixed by requiring an ACK from the Driver prior to letting the Evaluator exit its process.
Attachments
Issue Links
- is related to
-
REEF-1310 The Java Driver should ACK the Java Evaluator's DONE heartbeat
- Resolved
- relates to
-
REEF-347 Configure .NET tests to only listen on 127.0.0.1
- Resolved
- supercedes
-
REEF-1302 BroadcastReduceTest fail intermitently
- Resolved
-
REEF-977 Fix TestBroadcastAndReduceOnLocalRuntime
- Closed
-
REEF-978 Fix PipelinedBroadcastReduceTest
- Closed
-
REEF-1291 Driver gets FailedEvaluator message even if evaluator shuts down properly
- Resolved
- links to
1.
|
Fix TestBroadcastAndReduceOnLocalRuntime | Closed | Unassigned | |
2.
|
Fix PipelinedBroadcastReduceTest | Closed | Unassigned |