Uploaded image for project: 'REEF'
  1. REEF
  2. REEF-1981

Evaluators fail to heartbeat to restarted driver

Attach filesAttach ScreenshotAdd voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • REEF Evaluator
    • None

    Description

      On driver failover, we are hitting the following exception:

      Feb 28, 2018 1:43:23 PM org.apache.reef.runtime.yarn.driver.YarnDriverRuntimeRestartManager informAboutEvaluatorFailures
      WARNING: Container [container_e4838_1519690816115_0025_01_000005] has failed during driver restart process, FailedEvaluatorHandler will be triggered, but no additional evaluator can be requested due to YARN-2433.
      Feb 28, 2018 1:43:23 PM org.apache.reef.runtime.common.driver.evaluator.EvaluatorManager onEvaluatorException
      WARNING: Failed evaluator: container_e4838_1519690816115_0025_01_000005
      org.apache.reef.exception.EvaluatorException: Evaluator [container_e4838_1519690816115_0025_01_000005] is assumed to be in state [ALLOCATED]. But the resource manager reports it to be in state [FAILED]. This most likely means that the Evaluator suffered a failure before being used.
       at org.apache.reef.runtime.common.driver.evaluator.EvaluatorManager.onResourceStatusMessage(EvaluatorManager.java:693)
       at org.apache.reef.runtime.common.driver.resourcemanager.ResourceStatusHandler.onNext(ResourceStatusHandler.java:91)
       at org.apache.reef.runtime.common.driver.resourcemanager.ResourceStatusHandler.onNext(ResourceStatusHandler.java:38)
       at org.apache.reef.runtime.yarn.driver.REEFEventHandlers.onResourceStatus(REEFEventHandlers.java:93)
       at org.apache.reef.runtime.yarn.driver.YarnDriverRuntimeRestartManager.informAboutEvaluatorFailures(YarnDriverRuntimeRestartManager.java:230)
       at org.apache.reef.driver.restart.DriverRestartManager.onDriverRestartCompleted(DriverRestartManager.java:282)
       at org.apache.reef.driver.restart.DriverRestartManager.access$000(DriverRestartManager.java:47)
       at org.apache.reef.driver.restart.DriverRestartManager$1.run(DriverRestartManager.java:136)
       at java.util.TimerThread.mainLoop(Timer.java:555)
       at java.util.TimerThread.run(Timer.java:505)
      

      However, according to Yarn RM logs, these containers have not failed at this time. We suspect that the evaluators are failing to heartbeat into the new Driver.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            seanpo03 Sean Po

            Dates

              Created:
              Updated:

              Issue deployment