Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-16505

YARN shuffle service should throw errors when it fails to start

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.0.0
    • Fix Version/s: 2.1.0
    • Component/s: YARN
    • Labels:
      None

      Description

      Right now the YARN shuffle service will swallow errors that happen during startup and just log them:

          try {
            blockHandler = new ExternalShuffleBlockHandler(transportConf, registeredExecutorFile);
          } catch (Exception e) {
            logger.error("Failed to initialize external shuffle service", e);
          }
      

      This causes two undesirable things to happen:

      • because blockHandler will remain null when an error happens, every request to the shuffle service will cause an NPE
      • because the NM is running, containers may be assigned to that host, only to fail to register with the shuffle service.

      Example of the first:

      2016-05-25 15:01:12,198  ERROR org.apache.spark.network.TransportContext: Error while initializing Netty pipeline
      java.lang.NullPointerException
      	at org.apache.spark.network.server.TransportRequestHandler.<init>(TransportRequestHandler.java:77)
      	at org.apache.spark.network.TransportContext.createChannelHandler(TransportContext.java:159)
      	at org.apache.spark.network.TransportContext.initializePipeline(TransportContext.java:135)
      	at org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:123)
      	at org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:116)
      	at io.netty.channel.ChannelInitializer.channelRegistered(ChannelInitializer.java:69)
      

      Example of the second:

      16/05/25 15:01:12 INFO storage.BlockManager: Registering executor with local external shuffle service.
      16/05/25 15:01:12 ERROR client.TransportClient: Failed to send RPC 5736508221708472525 to qxhddn01.ascap.com/10.6.41.31:7337: java.nio.channels.ClosedChannelException
      java.nio.channels.ClosedChannelException
      16/05/25 15:01:12 ERROR storage.BlockManager: Failed to connect to external shuffle server, will retry 2 more times after waiting 5 seconds...
      java.lang.RuntimeException: java.io.IOException: Failed to send RPC 5736508221708472525 to qxhddn01.ascap.com/10.6.41.31:7337: java.nio.channels.ClosedChannelException
      	at org.spark-project.guava.base.Throwables.propagate(Throwables.java:160)
      	at org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:272)
      

        Attachments

          Activity

            People

            • Assignee:
              vanzin Marcelo Vanzin
              Reporter:
              vanzin Marcelo Vanzin
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: