Description
During the pull request review of REEF-1312, we noticed a rare race condition during the Evaluator shutdown. It was exposed in one out of 11 runs of the tests:
Org.Apache.REEF.Tests.Functional.IMRU.IMRUMapperCountTest.TestIMRUMapperCountOnLocalRuntime Expected number of contexts to close (4) differs from actual number of success indicators (8)\r\nExpected: True\r\nActual: False at Org.Apache.REEF.Tests.Functional.ReefFunctionalTest.ValidateSuccessForLocalRuntime(Int32 numberOfContextsToClose, Int32 numberOfTasksToFail, Int32 numberOfEvaluatorsToFail, String testFolder) in D:\src\reef\lang\cs\Org.Apache.REEF.Tests\Functional\ReefFunctionalTest.cs:line 179 at Org.Apache.REEF.Tests.Functional.IMRU.IMRUMapperCountTest.TestIMRUMapperCountOnLocalRuntime() in D:\src\reef\lang\cs\Org.Apache.REEF.Tests\Functional\IMRU\IMRUMapperCountTest.cs:line 38
The root of the test failure has been traced to the Evaluator being in a bad state:
Org.Apache.REEF.Common.Runtime.Evaluator.EvaluatorRuntime Error: 0 : 2016-04-13T10:24:03.5192372-07:00 0007 ERROR: evaluator Node-1-1460568240750 failed with exceptionencountered error [System.InvalidOperationException: Received a control message from Driver after Evaluator is done.] with mesage [Received a control message from Driver after Evaluator is done.] and stack trace [] Org.Apache.REEF.Common.Runtime.Evaluator.HeartBeatManager Information: 0 : 2016-04-13T10:24:03.5197049-07:00 0007 INFO: Triggered a heartbeat: EvaluatorHeartbeatProto: task_id=[], task_status=[], task_message=[], evaluator_status=[FAILED], context_status=[], timestamp=[1460568243519], recoveryFlag =[False].
The complete runtime folder is available for download here
Attachments
Issue Links
- is related to
-
REEF-1312 Convert IMRU.Examples to test
- Resolved
- links to