Uploaded image for project: 'REEF (Retired)'
  1. REEF (Retired)
  2. REEF-1338

Race condition in Evaluator shutdown

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 0.15
    • REEF.NET
    • None

    Description

      During the pull request review of REEF-1312, we noticed a rare race condition during the Evaluator shutdown. It was exposed in one out of 11 runs of the tests:

      Org.Apache.REEF.Tests.Functional.IMRU.IMRUMapperCountTest.TestIMRUMapperCountOnLocalRuntime
      
      Expected number of contexts to close (4) differs from actual number of success indicators (8)\r\nExpected: True\r\nActual:   False
      
         at Org.Apache.REEF.Tests.Functional.ReefFunctionalTest.ValidateSuccessForLocalRuntime(Int32 numberOfContextsToClose, Int32 numberOfTasksToFail, Int32 numberOfEvaluatorsToFail, String testFolder) in D:\src\reef\lang\cs\Org.Apache.REEF.Tests\Functional\ReefFunctionalTest.cs:line 179
         at Org.Apache.REEF.Tests.Functional.IMRU.IMRUMapperCountTest.TestIMRUMapperCountOnLocalRuntime() in D:\src\reef\lang\cs\Org.Apache.REEF.Tests\Functional\IMRU\IMRUMapperCountTest.cs:line 38
      

      The root of the test failure has been traced to the Evaluator being in a bad state:

       Org.Apache.REEF.Common.Runtime.Evaluator.EvaluatorRuntime Error: 0 : 2016-04-13T10:24:03.5192372-07:00 0007 ERROR: evaluator Node-1-1460568240750 failed with exceptionencountered error [System.InvalidOperationException: Received a control message from Driver after Evaluator is done.] with mesage [Received a control message from Driver after Evaluator is done.] and stack trace [] Org.Apache.REEF.Common.Runtime.Evaluator.HeartBeatManager Information: 0 : 2016-04-13T10:24:03.5197049-07:00 0007 INFO: Triggered a heartbeat: EvaluatorHeartbeatProto: task_id=[], task_status=[], task_message=[], evaluator_status=[FAILED], context_status=[], timestamp=[1460568243519], recoveryFlag =[False].
      

      The complete runtime folder is available for download here

      Attachments

        Activity

          People

            afchung90 Andrew Chung
            markus.weimer Markus Weimer
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: