Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
2.0.0
-
None
Description
Right now the YARN shuffle service will swallow errors that happen during startup and just log them:
try { blockHandler = new ExternalShuffleBlockHandler(transportConf, registeredExecutorFile); } catch (Exception e) { logger.error("Failed to initialize external shuffle service", e); }
This causes two undesirable things to happen:
- because blockHandler will remain null when an error happens, every request to the shuffle service will cause an NPE
- because the NM is running, containers may be assigned to that host, only to fail to register with the shuffle service.
Example of the first:
2016-05-25 15:01:12,198 ERROR org.apache.spark.network.TransportContext: Error while initializing Netty pipeline java.lang.NullPointerException at org.apache.spark.network.server.TransportRequestHandler.<init>(TransportRequestHandler.java:77) at org.apache.spark.network.TransportContext.createChannelHandler(TransportContext.java:159) at org.apache.spark.network.TransportContext.initializePipeline(TransportContext.java:135) at org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:123) at org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:116) at io.netty.channel.ChannelInitializer.channelRegistered(ChannelInitializer.java:69)
Example of the second:
16/05/25 15:01:12 INFO storage.BlockManager: Registering executor with local external shuffle service. 16/05/25 15:01:12 ERROR client.TransportClient: Failed to send RPC 5736508221708472525 to qxhddn01.ascap.com/10.6.41.31:7337: java.nio.channels.ClosedChannelException java.nio.channels.ClosedChannelException 16/05/25 15:01:12 ERROR storage.BlockManager: Failed to connect to external shuffle server, will retry 2 more times after waiting 5 seconds... java.lang.RuntimeException: java.io.IOException: Failed to send RPC 5736508221708472525 to qxhddn01.ascap.com/10.6.41.31:7337: java.nio.channels.ClosedChannelException at org.spark-project.guava.base.Throwables.propagate(Throwables.java:160) at org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:272)