Uploaded image for project: 'REEF (Retired)'
  1. REEF (Retired)
  2. REEF-1981

Evaluators fail to heartbeat to restarted driver

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • REEF Evaluator
    • None

    Description

      On driver failover, we are hitting the following exception:

      Feb 28, 2018 1:43:23 PM org.apache.reef.runtime.yarn.driver.YarnDriverRuntimeRestartManager informAboutEvaluatorFailures
      WARNING: Container [container_e4838_1519690816115_0025_01_000005] has failed during driver restart process, FailedEvaluatorHandler will be triggered, but no additional evaluator can be requested due to YARN-2433.
      Feb 28, 2018 1:43:23 PM org.apache.reef.runtime.common.driver.evaluator.EvaluatorManager onEvaluatorException
      WARNING: Failed evaluator: container_e4838_1519690816115_0025_01_000005
      org.apache.reef.exception.EvaluatorException: Evaluator [container_e4838_1519690816115_0025_01_000005] is assumed to be in state [ALLOCATED]. But the resource manager reports it to be in state [FAILED]. This most likely means that the Evaluator suffered a failure before being used.
       at org.apache.reef.runtime.common.driver.evaluator.EvaluatorManager.onResourceStatusMessage(EvaluatorManager.java:693)
       at org.apache.reef.runtime.common.driver.resourcemanager.ResourceStatusHandler.onNext(ResourceStatusHandler.java:91)
       at org.apache.reef.runtime.common.driver.resourcemanager.ResourceStatusHandler.onNext(ResourceStatusHandler.java:38)
       at org.apache.reef.runtime.yarn.driver.REEFEventHandlers.onResourceStatus(REEFEventHandlers.java:93)
       at org.apache.reef.runtime.yarn.driver.YarnDriverRuntimeRestartManager.informAboutEvaluatorFailures(YarnDriverRuntimeRestartManager.java:230)
       at org.apache.reef.driver.restart.DriverRestartManager.onDriverRestartCompleted(DriverRestartManager.java:282)
       at org.apache.reef.driver.restart.DriverRestartManager.access$000(DriverRestartManager.java:47)
       at org.apache.reef.driver.restart.DriverRestartManager$1.run(DriverRestartManager.java:136)
       at java.util.TimerThread.mainLoop(Timer.java:555)
       at java.util.TimerThread.run(Timer.java:505)
      

      However, according to Yarn RM logs, these containers have not failed at this time. We suspect that the evaluators are failing to heartbeat into the new Driver.

      Attachments

        Activity

          People

            Unassigned Unassigned
            seanpo03 Sean Po
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated: