Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-19528

external shuffle service registration timeout is very short with heavy workloads when dynamic allocation is enabled

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 1.6.2, 1.6.3, 2.0.2
    • None
    • None
    • Hadoop2.7.1
      spark1.6.2
      hive2.2

    Description

      when dynamic allocation is enabled, the external shuffle service is used for maintain the unfinished status between executors. So the external shuffle service should not close before the executor while still have request from executor.

      container's log:

      17/02/09 08:30:46 INFO executor.CoarseGrainedExecutorBackend: Connecting to driver: spark://CoarseGrainedScheduler@192.168.1.1:41867
      17/02/09 08:30:46 INFO executor.CoarseGrainedExecutorBackend: Successfully registered with driver
      17/02/09 08:30:46 INFO executor.Executor: Starting executor ID 75 on host hsx-node8
      17/02/09 08:30:46 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 40374.
      17/02/09 08:30:46 INFO netty.NettyBlockTransferService: Server created on 40374
      17/02/09 08:30:46 INFO storage.BlockManager: external shuffle service port = 7337
      17/02/09 08:30:46 INFO storage.BlockManagerMaster: Trying to register BlockManager
      17/02/09 08:30:46 INFO storage.BlockManagerMaster: Registered BlockManager
      17/02/09 08:30:46 INFO storage.BlockManager: Registering executor with local external shuffle service.
      17/02/09 08:30:51 ERROR client.TransportResponseHandler: Still have 1 requests outstanding when connection from hsx-node8/192.168.1.8:7337 is closed
      17/02/09 08:30:51 ERROR storage.BlockManager: Failed to connect to external shuffle server, will retry 2 more times after waiting 5 seconds...
      java.lang.RuntimeException: java.util.concurrent.TimeoutException: Timeout waiting for task.
      	at org.spark-project.guava.base.Throwables.propagate(Throwables.java:160)
      	at org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:278)
      	at org.apache.spark.network.shuffle.ExternalShuffleClient.registerWithShuffleServer(ExternalShuffleClient.java:144)
      	at org.apache.spark.storage.BlockManager$$anonfun$registerWithExternalShuffleServer$1.apply$mcVI$sp(BlockManager.scala:218)
      	at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
      	at org.apache.spark.storage.BlockManager.registerWithExternalShuffleServer(BlockManager.scala:215)
      	at org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:201)
      	at org.apache.spark.executor.Executor.<init>(Executor.scala:86)
      	at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:83)
      	at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116)
      	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
      	at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
      	at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      	at java.lang.Thread.run(Thread.java:745)
      Caused by: java.util.concurrent.TimeoutException: Timeout waiting for task.
      	at org.spark-project.guava.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:276)
      	at org.spark-project.guava.util.concurrent.AbstractFuture.get(AbstractFuture.java:96)
      	at org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:274)
      	... 14 more
      17/02/09 08:31:01 ERROR storage.BlockManager: Failed to connect to external shuffle server, will retry 1 more times after waiting 5 seconds...
      

      nodemanager's log:

      2017-02-09 08:30:48,836 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed completed containers from NM context: [container_1486564603520_0097_01_000005]
      2017-02-09 08:31:12,122 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code from container container_1486564603520_0096_01_000071 is : 1
      2017-02-09 08:31:12,122 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exception from container-launch with container ID: container_1486564603520_0096_01_000071 and exit code: 1
      ExitCodeException exitCode=1:
              at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
              at org.apache.hadoop.util.Shell.run(Shell.java:456)
              at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
              at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
              at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
              at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
              at java.util.concurrent.FutureTask.run(FutureTask.java:266)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
              at java.lang.Thread.run(Thread.java:745)
      2017-02-09 08:31:12,122 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exception from container-launch.
      2017-02-09 08:31:12,122 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Container id: container_1486564603520_0096_01_000071
      2017-02-09 08:31:12,122 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exit code: 1
      2017-02-09 08:31:12,122 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Stack trace: ExitCodeException exitCode=1:
      2017-02-09 08:31:12,122 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:       at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
      2017-02-09 08:31:12,122 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:       at org.apache.hadoop.util.Shell.run(Shell.java:456)
      2017-02-09 08:31:12,122 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:       at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
      2017-02-09 08:31:12,122 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:       at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
      2017-02-09 08:31:12,122 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:       at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
      2017-02-09 08:31:12,122 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:       at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
      2017-02-09 08:31:12,122 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:       at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      2017-02-09 08:31:12,122 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      2017-02-09 08:31:12,122 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      2017-02-09 08:31:12,122 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:       at java.lang.Thread.run(Thread.java:745)
      2017-02-09 08:31:12,122 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Container exited with a non-zero exit code 1
      2017-02-09 08:31:12,122 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_1486564603520_0096_01_000071 transitioned from RUNNING to EXITED_WITH_FAILURE
      

      Attachments

        1. SPARK-19528.1.spark2.patch
          0.9 kB
          KaiXu
        2. SPARK-19528.1.patch
          0.8 kB
          KaiXu

        Issue Links

          Activity

            People

              Unassigned Unassigned
              KaiXu KaiXu
              Votes:
              3 Vote for this issue
              Watchers:
              18 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: