Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-31898

[K8S] Driver may launch an uncontrolled number of exec if exec can't talk to the driver

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.0.0
    • None
    • Kubernetes
    • None

    Description

      We launched a simple SparkPi job on k8s with dynamic allocation on (with shuffle tracking, as in 3.0). For an unknown reason (we're investigating), the Spark executors would start but couldn't talk to the driver. I've attached an executor log below. The Spark driver would keep requesting executor pods from k8s – as from the driver perspective it never got them. 

      The end result was that we launched an unbounded number of Spark exec pods which would be stuck in running state but doing nothing. The dynamic allocation.maxExecutors parameter doesn't help, the Spark app eventually filled up the k8s cluster after it autoscaled at maximum node pool capacity.

      We may be able to fix this issue and https://issues.apache.org/jira/browse/SPARK-26423 at the same time – in both cases it's about cleaning up Spark execs that can't talk to the driver.

       

      Executor log:

      ++ id -u++ id -u+ myuid=0++ id -g+ mygid=0+ set +e++ getent passwd 0+ uidentry=root:x:0:0:root:/root:/bin/bash+ set -e+ '[' -z root:x:0:0:root:/root:/bin/bash ']'+ SPARK_CLASSPATH=':/opt/spark/jars/*'+ env+ grep SPARK_JAVA_OPT_+ sort -t_ -k4 -n+ sed 's/[^=]*=\(.*\)/\1/g'+ readarray -t SPARK_EXECUTOR_JAVA_OPTS+ '[' -n '' ']'+ '[' '' == 2 ']'+ '[' '' == 3 ']'+ '[' -z ']'+ case "$1" in+ shift 1+ CMD=(${JAVA_HOME}/bin/java "${SPARK_EXECUTOR_JAVA_OPTS[@]}" -Xms$SPARK_EXECUTOR_MEMORY -Xmx$SPARK_EXECUTOR_MEMORY -cp "$SPARK_CLASSPATH" org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url $SPARK_DRIVER_URL --executor-id $SPARK_EXECUTOR_ID --cores $SPARK_EXECUTOR_CORES --app-id $SPARK_APPLICATION_ID --hostname $SPARK_EXECUTOR_POD_IP)+ exec /usr/bin/tini -s -- /usr/local/openjdk-8/bin/java -Dspark.driver.blockManager.port=7079 -Dspark.driver.port=7078 -Xms3g -Xmx3g -cp ':/opt/spark/jars/*' org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@sparkpi2-20200602-183440-y42hy-3392e47276506f56-driver-svc.spark-apps.svc:7078 --executor-id 55 --cores 2 --app-id spark-fc2da45b9e1549edac73739d8132aa2d --hostname 10.0.4.3Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties20/06/02 18:39:00 INFO CoarseGrainedExecutorBackend: Started daemon with process name: 13@sparkpi2-20200602-183440-y42hy-3392e47276506f56-exec-5520/06/02 18:39:00 INFO SignalUtils: Registered signal handler for TERM20/06/02 18:39:00 INFO SignalUtils: Registered signal handler for HUP20/06/02 18:39:00 INFO SignalUtils: Registered signal handler for INT20/06/02 18:39:00 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable20/06/02 18:39:01 INFO SecurityManager: Changing view acls to: root20/06/02 18:39:01 INFO SecurityManager: Changing modify acls to: root20/06/02 18:39:01 INFO SecurityManager: Changing view acls groups to:20/06/02 18:39:01 INFO SecurityManager: Changing modify acls groups to:20/06/02 18:39:01 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()Exception in thread "main" java.lang.reflect.UndeclaredThrowableException at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1748) at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:61) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:254) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:244) at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult: at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:227) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$3(CoarseGrainedExecutorBackend.scala:272) at scala.runtime.java8.JFunction1$mcVI$sp.apply(JFunction1$mcVI$sp.java:23) at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877) at scala.collection.immutable.Range.foreach(Range.scala:158) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:876) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$1(CoarseGrainedExecutorBackend.scala:270) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:62) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:61) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) ... 4 moreCaused by: java.io.IOException: Failed to connect to sparkpi2-20200602-183440-y42hy-3392e47276506f56-driver-svc.spark-apps.svc:7078 at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:253) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:195) at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:204) at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:202) at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:198) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)Caused by: java.net.UnknownHostException: sparkpi2-20200602-183440-y42hy-3392e47276506f56-driver-svc.spark-apps.svc at java.net.InetAddress.getAllByName0(InetAddress.java:1281) at java.net.InetAddress.getAllByName(InetAddress.java:1193) at java.net.InetAddress.getAllByName(InetAddress.java:1127) at java.net.InetAddress.getByName(InetAddress.java:1077) at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:146) at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:143) at java.security.AccessController.doPrivileged(Native Method) at io.netty.util.internal.SocketUtils.addressByName(SocketUtils.java:143) at io.netty.resolver.DefaultNameResolver.doResolve(DefaultNameResolver.java:43) at io.netty.resolver.SimpleNameResolver.resolve(SimpleNameResolver.java:63) at io.netty.resolver.SimpleNameResolver.resolve(SimpleNameResolver.java:55) at io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:57) at io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:32) at io.netty.resolver.AbstractAddressResolver.resolve(AbstractAddressResolver.java:108) at io.netty.bootstrap.Bootstrap.doResolveAndConnect0(Bootstrap.java:202) at io.netty.bootstrap.Bootstrap.access$000(Bootstrap.java:48) at io.netty.bootstrap.Bootstrap$1.operationComplete(Bootstrap.java:182) at io.netty.bootstrap.Bootstrap$1.operationComplete(Bootstrap.java:168) at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577) at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:551) at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:490) at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:615) at io.netty.util.concurrent.DefaultPromise.setSuccess0(DefaultPromise.java:604) at io.netty.util.concurrent.DefaultPromise.trySuccess(DefaultPromise.java:104) at io.netty.channel.DefaultChannelPromise.trySuccess(DefaultChannelPromise.java:84) at io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetSuccess(AbstractChannel.java:985) at io.netty.channel.AbstractChannel$AbstractUnsafe.register0(AbstractChannel.java:505) at io.netty.channel.AbstractChannel$AbstractUnsafe.access$200(AbstractChannel.java:416) at io.netty.channel.AbstractChannel$AbstractUnsafe$1.run(AbstractChannel.java:475) at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:510) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:518) at io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1044) at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ... 1 more
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            jystephan Jean-Yves STEPHAN
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: