Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-9439

ExternalShuffleService should be robust to NodeManager restarts in yarn

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 1.4.1
    • Fix Version/s: 1.6.0
    • Component/s: Shuffle
    • Labels:
      None

      Description

      Yarn applications should be robust to NodeManager restarts in general. However, if you run with the external shuffle service enabled, after a NM restart, you will observe failures like:

      2015-07-22 18:30:18,212 ERROR org.apache.spark.network.server.TransportRequestHandler: Error while invoking RpcHandler#receive() on RPC id 5405054848584757735
      java.lang.RuntimeException: Executor is not registered (appId=application_1437612356649_0008, execId=73)
              at org.apache.spark.network.shuffle.ExternalShuffleBlockManager.getBlockData(ExternalShuffleBlockManager.java:105)
      ...
      

      This is because when the NM restarts (and restarts the ExternalShuffleService), it doesn't call ExternalShuffleBlockResolver#registerExecutor

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                irashid Imran Rashid
                Reporter:
                irashid Imran Rashid
              • Votes:
                0 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: