Uploaded image for project: 'REEF'
  1. REEF
  2. REEF-1981

Evaluators fail to heartbeat to restarted driver

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: REEF Evaluator
    • Labels:
      None

      Description

      On driver failover, we are hitting the following exception:

      Feb 28, 2018 1:43:23 PM org.apache.reef.runtime.yarn.driver.YarnDriverRuntimeRestartManager informAboutEvaluatorFailures
      WARNING: Container [container_e4838_1519690816115_0025_01_000005] has failed during driver restart process, FailedEvaluatorHandler will be triggered, but no additional evaluator can be requested due to YARN-2433.
      Feb 28, 2018 1:43:23 PM org.apache.reef.runtime.common.driver.evaluator.EvaluatorManager onEvaluatorException
      WARNING: Failed evaluator: container_e4838_1519690816115_0025_01_000005
      org.apache.reef.exception.EvaluatorException: Evaluator [container_e4838_1519690816115_0025_01_000005] is assumed to be in state [ALLOCATED]. But the resource manager reports it to be in state [FAILED]. This most likely means that the Evaluator suffered a failure before being used.
       at org.apache.reef.runtime.common.driver.evaluator.EvaluatorManager.onResourceStatusMessage(EvaluatorManager.java:693)
       at org.apache.reef.runtime.common.driver.resourcemanager.ResourceStatusHandler.onNext(ResourceStatusHandler.java:91)
       at org.apache.reef.runtime.common.driver.resourcemanager.ResourceStatusHandler.onNext(ResourceStatusHandler.java:38)
       at org.apache.reef.runtime.yarn.driver.REEFEventHandlers.onResourceStatus(REEFEventHandlers.java:93)
       at org.apache.reef.runtime.yarn.driver.YarnDriverRuntimeRestartManager.informAboutEvaluatorFailures(YarnDriverRuntimeRestartManager.java:230)
       at org.apache.reef.driver.restart.DriverRestartManager.onDriverRestartCompleted(DriverRestartManager.java:282)
       at org.apache.reef.driver.restart.DriverRestartManager.access$000(DriverRestartManager.java:47)
       at org.apache.reef.driver.restart.DriverRestartManager$1.run(DriverRestartManager.java:136)
       at java.util.TimerThread.mainLoop(Timer.java:555)
       at java.util.TimerThread.run(Timer.java:505)
      

      However, according to Yarn RM logs, these containers have not failed at this time. We suspect that the evaluators are failing to heartbeat into the new Driver.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              seanpo03 Sean Po
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated: