Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-9439

ExternalShuffleService should be robust to NodeManager restarts in yarn

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 1.4.1
    • 1.6.0
    • Shuffle, Spark Core
    • None

    Description

      Yarn applications should be robust to NodeManager restarts in general. However, if you run with the external shuffle service enabled, after a NM restart, you will observe failures like:

      2015-07-22 18:30:18,212 ERROR org.apache.spark.network.server.TransportRequestHandler: Error while invoking RpcHandler#receive() on RPC id 5405054848584757735
      java.lang.RuntimeException: Executor is not registered (appId=application_1437612356649_0008, execId=73)
              at org.apache.spark.network.shuffle.ExternalShuffleBlockManager.getBlockData(ExternalShuffleBlockManager.java:105)
      ...
      

      This is because when the NM restarts (and restarts the ExternalShuffleService), it doesn't call ExternalShuffleBlockResolver#registerExecutor

      Attachments

        Issue Links

          Activity

            People

              irashid Imran Rashid
              irashid Imran Rashid
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: