Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
3.0.1
-
None
-
None
Description
In situations where the executors lose contact with the driver, the java process does not die. I am looking at what on the kubernetes cluster could prevent proper clean-up.
The spark driver is started in it's own pod in client mode (pyspark shell started by jupyter). I works fine most of the time but if the driver process crashes (OOM or kill signal for instance) the executor complains about the connection reset by peer and then hangs.
Here's the log from an executor pod that hangs:
20/11/03 07:35:30 WARN TransportChannelHandler: Exception in connection from /10.17.0.152:37161
java.io.IOException: Connection reset by peer
at java.base/sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at java.base/sun.nio.ch.SocketDispatcher.read(Unknown Source)
at java.base/sun.nio.ch.IOUtil.readIntoNativeBuffer(Unknown Source)
at java.base/sun.nio.ch.IOUtil.read(Unknown Source)
at java.base/sun.nio.ch.IOUtil.read(Unknown Source)
at java.base/sun.nio.ch.SocketChannelImpl.read(Unknown Source)
at io.netty.buffer.PooledByteBuf.setBytes(PooledByteBuf.java:253)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1133)
at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:350)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:148)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Unknown Source)
20/11/03 07:35:30 ERROR CoarseGrainedExecutorBackend: Executor self-exiting due to : Driver 10.17.0.152:37161 disassociated! Shutting down.
20/11/03 07:35:31 INFO MemoryStore: MemoryStore cleared
20/11/03 07:35:31 INFO BlockManager: BlockManager stopped
When start a shell in the pod I can see the process are still running:
UID PID PPID C SZ RSS PSR STIME TTY TIME CMD 185 125 0 0 5045 3968 2 10:07 pts/0 00:00:00 /bin/bash 185 166 125 0 9019 3364 1 10:39 pts/0 00:00:00 \_ ps -AF --forest 185 1 0 0 1130 768 0 07:34 ? 00:00:00 /usr/bin/tini -s -- /opt/java/openjdk/ 185 14 1 0 1935527 493976 3 07:34 ? 00:00:21 /opt/java/openjdk/bin/java -Dspark.dri
Here's the full command used to start the executor:
/opt/java/openjdk/
bin/java -Dspark.driver.port=37161 -Xms4g -Xmx4g -cp :/opt/spark/jars/*: org.apache.spark.executor.CoarseG
rainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@10.17.0.152:37161 --executor-id 1 --core
s 1 --app-id spark-application-1604388891044 --hostname 10.17.2.151
Attachments
Issue Links
- is fixed by
-
SPARK-36532 Deadlock in CoarseGrainedExecutorBackend.onDisconnected
- Resolved