Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
3.0.0
-
None
-
None
Description
We launched a simple SparkPi job on k8s with dynamic allocation on (with shuffle tracking, as in 3.0). For an unknown reason (we're investigating), the Spark executors would start but couldn't talk to the driver. I've attached an executor log below. The Spark driver would keep requesting executor pods from k8s – as from the driver perspective it never got them.
The end result was that we launched an unbounded number of Spark exec pods which would be stuck in running state but doing nothing. The dynamic allocation.maxExecutors parameter doesn't help, the Spark app eventually filled up the k8s cluster after it autoscaled at maximum node pool capacity.
We may be able to fix this issue and https://issues.apache.org/jira/browse/SPARK-26423 at the same time – in both cases it's about cleaning up Spark execs that can't talk to the driver.
Executor log:
++ id -u++ id -u+ myuid=0++ id -g+ mygid=0+ set +e++ getent passwd 0+ uidentry=root:x:0:0:root:/root:/bin/bash+ set -e+ '[' -z root:x:0:0:root:/root:/bin/bash ']'+ SPARK_CLASSPATH=':/opt/spark/jars/*'+ env+ grep SPARK_JAVA_OPT_+ sort -t_ -k4 -n+ sed 's/[^=]*=\(.*\)/\1/g'+ readarray -t SPARK_EXECUTOR_JAVA_OPTS+ '[' -n '' ']'+ '[' '' == 2 ']'+ '[' '' == 3 ']'+ '[' -z ']'+ case "$1" in+ shift 1+ CMD=(${JAVA_HOME}/bin/java "${SPARK_EXECUTOR_JAVA_OPTS[@]}" -Xms$SPARK_EXECUTOR_MEMORY -Xmx$SPARK_EXECUTOR_MEMORY -cp "$SPARK_CLASSPATH" org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url $SPARK_DRIVER_URL --executor-id $SPARK_EXECUTOR_ID --cores $SPARK_EXECUTOR_CORES --app-id $SPARK_APPLICATION_ID --hostname $SPARK_EXECUTOR_POD_IP)+ exec /usr/bin/tini -s -- /usr/local/openjdk-8/bin/java -Dspark.driver.blockManager.port=7079 -Dspark.driver.port=7078 -Xms3g -Xmx3g -cp ':/opt/spark/jars/*' org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@sparkpi2-20200602-183440-y42hy-3392e47276506f56-driver-svc.spark-apps.svc:7078 --executor-id 55 --cores 2 --app-id spark-fc2da45b9e1549edac73739d8132aa2d --hostname 10.0.4.3Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties20/06/02 18:39:00 INFO CoarseGrainedExecutorBackend: Started daemon with process name: 13@sparkpi2-20200602-183440-y42hy-3392e47276506f56-exec-5520/06/02 18:39:00 INFO SignalUtils: Registered signal handler for TERM20/06/02 18:39:00 INFO SignalUtils: Registered signal handler for HUP20/06/02 18:39:00 INFO SignalUtils: Registered signal handler for INT20/06/02 18:39:00 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable20/06/02 18:39:01 INFO SecurityManager: Changing view acls to: root20/06/02 18:39:01 INFO SecurityManager: Changing modify acls to: root20/06/02 18:39:01 INFO SecurityManager: Changing view acls groups to:20/06/02 18:39:01 INFO SecurityManager: Changing modify acls groups to:20/06/02 18:39:01 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()Exception in thread "main" java.lang.reflect.UndeclaredThrowableException at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1748) at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:61) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:254) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:244) at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult: at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:227) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$3(CoarseGrainedExecutorBackend.scala:272) at scala.runtime.java8.JFunction1$mcVI$sp.apply(JFunction1$mcVI$sp.java:23) at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877) at scala.collection.immutable.Range.foreach(Range.scala:158) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:876) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$1(CoarseGrainedExecutorBackend.scala:270) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:62) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:61) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) ... 4 moreCaused by: java.io.IOException: Failed to connect to sparkpi2-20200602-183440-y42hy-3392e47276506f56-driver-svc.spark-apps.svc:7078 at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:253) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:195) at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:204) at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:202) at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:198) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)Caused by: java.net.UnknownHostException: sparkpi2-20200602-183440-y42hy-3392e47276506f56-driver-svc.spark-apps.svc at java.net.InetAddress.getAllByName0(InetAddress.java:1281) at java.net.InetAddress.getAllByName(InetAddress.java:1193) at java.net.InetAddress.getAllByName(InetAddress.java:1127) at java.net.InetAddress.getByName(InetAddress.java:1077) at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:146) at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:143) at java.security.AccessController.doPrivileged(Native Method) at io.netty.util.internal.SocketUtils.addressByName(SocketUtils.java:143) at io.netty.resolver.DefaultNameResolver.doResolve(DefaultNameResolver.java:43) at io.netty.resolver.SimpleNameResolver.resolve(SimpleNameResolver.java:63) at io.netty.resolver.SimpleNameResolver.resolve(SimpleNameResolver.java:55) at io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:57) at io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:32) at io.netty.resolver.AbstractAddressResolver.resolve(AbstractAddressResolver.java:108) at io.netty.bootstrap.Bootstrap.doResolveAndConnect0(Bootstrap.java:202) at io.netty.bootstrap.Bootstrap.access$000(Bootstrap.java:48) at io.netty.bootstrap.Bootstrap$1.operationComplete(Bootstrap.java:182) at io.netty.bootstrap.Bootstrap$1.operationComplete(Bootstrap.java:168) at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577) at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:551) at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:490) at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:615) at io.netty.util.concurrent.DefaultPromise.setSuccess0(DefaultPromise.java:604) at io.netty.util.concurrent.DefaultPromise.trySuccess(DefaultPromise.java:104) at io.netty.channel.DefaultChannelPromise.trySuccess(DefaultChannelPromise.java:84) at io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetSuccess(AbstractChannel.java:985) at io.netty.channel.AbstractChannel$AbstractUnsafe.register0(AbstractChannel.java:505) at io.netty.channel.AbstractChannel$AbstractUnsafe.access$200(AbstractChannel.java:416) at io.netty.channel.AbstractChannel$AbstractUnsafe$1.run(AbstractChannel.java:475) at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:510) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:518) at io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1044) at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ... 1 more