Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-16505

YARN shuffle service should throw errors when it fails to start

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.0.0
    • 2.1.0
    • Spark Core, YARN
    • None

    Description

      Right now the YARN shuffle service will swallow errors that happen during startup and just log them:

          try {
            blockHandler = new ExternalShuffleBlockHandler(transportConf, registeredExecutorFile);
          } catch (Exception e) {
            logger.error("Failed to initialize external shuffle service", e);
          }
      

      This causes two undesirable things to happen:

      • because blockHandler will remain null when an error happens, every request to the shuffle service will cause an NPE
      • because the NM is running, containers may be assigned to that host, only to fail to register with the shuffle service.

      Example of the first:

      2016-05-25 15:01:12,198  ERROR org.apache.spark.network.TransportContext: Error while initializing Netty pipeline
      java.lang.NullPointerException
      	at org.apache.spark.network.server.TransportRequestHandler.<init>(TransportRequestHandler.java:77)
      	at org.apache.spark.network.TransportContext.createChannelHandler(TransportContext.java:159)
      	at org.apache.spark.network.TransportContext.initializePipeline(TransportContext.java:135)
      	at org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:123)
      	at org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:116)
      	at io.netty.channel.ChannelInitializer.channelRegistered(ChannelInitializer.java:69)
      

      Example of the second:

      16/05/25 15:01:12 INFO storage.BlockManager: Registering executor with local external shuffle service.
      16/05/25 15:01:12 ERROR client.TransportClient: Failed to send RPC 5736508221708472525 to qxhddn01.ascap.com/10.6.41.31:7337: java.nio.channels.ClosedChannelException
      java.nio.channels.ClosedChannelException
      16/05/25 15:01:12 ERROR storage.BlockManager: Failed to connect to external shuffle server, will retry 2 more times after waiting 5 seconds...
      java.lang.RuntimeException: java.io.IOException: Failed to send RPC 5736508221708472525 to qxhddn01.ascap.com/10.6.41.31:7337: java.nio.channels.ClosedChannelException
      	at org.spark-project.guava.base.Throwables.propagate(Throwables.java:160)
      	at org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:272)
      

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            vanzin Marcelo Masiero Vanzin
            vanzin Marcelo Masiero Vanzin
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment