Uploaded image for project: 'REEF (Retired)'
  1. REEF (Retired)
  2. REEF-345 Complete implementation for YARN AM HA
  3. REEF-563

Evaluators that are kept alive are not able to re-register with the driver

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 0.13
    • None

    Description

      This stems from 2 issues:
      1. the queryUri to find the driver http endpoint is missing a '/' after application ID.
      2. YARN currently redirects the proxy call to the primary RM via a meta-refresh, which is not handled by our recovery mechanism assuming no redirects. See YARN-2605.
      3. The driver does not recognize the evaluator trying to contact it and receives exception:

      java.lang.RuntimeException: Contact from unknown Evaluator with identifier 'container_e02_1438245500443_0040_01_000004' with state 'RUNNING'
      	at org.apache.reef.runtime.common.driver.evaluator.EvaluatorHeartbeatHandler.onNext(EvaluatorHeartbeatHandler.java:72)
      	at org.apache.reef.runtime.common.driver.evaluator.EvaluatorHeartbeatHandler.onNext(EvaluatorHeartbeatHandler.java:36)
      	at org.apache.reef.wake.remote.impl.HandlerContainer.onNext(HandlerContainer.java:146)
      	at org.apache.reef.wake.remote.impl.HandlerContainer.onNext(HandlerContainer.java:37)
      	at org.apache.reef.wake.remote.impl.OrderedPullEventHandler.onNext(OrderedRemoteReceiverStage.java:171)
      	at org.apache.reef.wake.remote.impl.OrderedPullEventHandler.onNext(OrderedRemoteReceiverStage.java:152)
      	at org.apache.reef.wake.impl.ThreadPoolStage$1.run(ThreadPoolStage.java:181)
      	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
      	at java.lang.Thread.run(Thread.java:745)
      

      Item 3 will be incorporated into the work of REEF-560 instead of being covered by this item.

      Attachments

        Issue Links

          Activity

            People

              afchung90 Andrew Chung
              afchung90 Andrew Chung
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: