Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-37394

Skip registering with ESS if a customized shuffle manager is configured

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: In Progress
    • Major
    • Resolution: Unresolved
    • 3.2.0
    • None
    • Spark Core
    • None

    Description

      In order to enable dynamic allocation with a customized remote shuffle service, the following configuration properties are set:

      • spark.dynamicAllocation.enabled=true
      • spark.dynamicAllocation.shuffleTracking.enabled=false
      • spark.shuffle.service.enabled=true
      • spark.shuffle.manager=org.apache.spark.SomeShuffleManager

      When running Spark job with the above configurations, the job failed with the following error:

      21/11/19 23:01:51 INFO BlockManager: external shuffle service port = 7337
      21/11/19 23:01:51 INFO BlockManager: Registering executor with local external shuffle service.
      21/11/19 23:01:51 ERROR BlockManager: Failed to connect to external shuffle server, will retry 2 more times after waiting 5 seconds...
      java.io.IOException: Failed to connect to /10.1.2.75:7337
      	at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:245)
      	at org.apache.spark.network.client.TransportClientFactory.createUnmanagedClient(TransportClientFactory.java:201)
      	at org.apache.spark.network.shuffle.ExternalShuffleClient.registerWithShuffleServer(ExternalShuffleClient.java:142)
      	at org.apache.spark.storage.BlockManager$$anonfun$registerWithExternalShuffleServer$1.apply$mcVI$sp(BlockManager.scala:294)
      	at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
      	at org.apache.spark.storage.BlockManager.registerWithExternalShuffleServer(BlockManager.scala:291)
      	at org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:265)
      	at org.apache.spark.executor.Executor.<init>(Executor.scala:118)
      	at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:83)
      	at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117)
      	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
      	at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)
      	at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:221)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      	at java.lang.Thread.run(Thread.java:748)
      Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /10.1.2.75:7337
      	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
      	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
      	at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:323)
      	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)
      	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:633)
      	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
      	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
      	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
      	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
      	at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
      	... 1 more
      Caused by: java.net.ConnectException: Connection refused
      	... 11 more
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            wwei Weiwei Yang
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: