Details
-
Improvement
-
Status: In Progress
-
Major
-
Resolution: Unresolved
-
3.2.0
-
None
-
None
Description
In order to enable dynamic allocation with a customized remote shuffle service, the following configuration properties are set:
- spark.dynamicAllocation.enabled=true
- spark.dynamicAllocation.shuffleTracking.enabled=false
- spark.shuffle.service.enabled=true
- spark.shuffle.manager=org.apache.spark.SomeShuffleManager
When running Spark job with the above configurations, the job failed with the following error:
21/11/19 23:01:51 INFO BlockManager: external shuffle service port = 7337 21/11/19 23:01:51 INFO BlockManager: Registering executor with local external shuffle service. 21/11/19 23:01:51 ERROR BlockManager: Failed to connect to external shuffle server, will retry 2 more times after waiting 5 seconds... java.io.IOException: Failed to connect to /10.1.2.75:7337 at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:245) at org.apache.spark.network.client.TransportClientFactory.createUnmanagedClient(TransportClientFactory.java:201) at org.apache.spark.network.shuffle.ExternalShuffleClient.registerWithShuffleServer(ExternalShuffleClient.java:142) at org.apache.spark.storage.BlockManager$$anonfun$registerWithExternalShuffleServer$1.apply$mcVI$sp(BlockManager.scala:294) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160) at org.apache.spark.storage.BlockManager.registerWithExternalShuffleServer(BlockManager.scala:291) at org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:265) at org.apache.spark.executor.Executor.<init>(Executor.scala:118) at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:83) at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101) at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:221) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /10.1.2.75:7337 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:323) at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:633) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459) at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138) ... 1 more Caused by: java.net.ConnectException: Connection refused ... 11 more