[SPARK-33325] Spark executors pod are not shutting down when losing driver connection - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.0.1
Fix Version/s: None
Component/s: Kubernetes
Labels:
None

Description

In situations where the executors lose contact with the driver, the java process does not die. I am looking at what on the kubernetes cluster could prevent proper clean-up.

The spark driver is started in it's own pod in client mode (pyspark shell started by jupyter). I works fine most of the time but if the driver process crashes (OOM or kill signal for instance) the executor complains about the connection reset by peer and then hangs.

Here's the log from an executor pod that hangs:

20/11/03 07:35:30 WARN TransportChannelHandler: Exception in connection from /10.17.0.152:37161
java.io.IOException: Connection reset by peer
	at java.base/sun.nio.ch.FileDispatcherImpl.read0(Native Method)
	at java.base/sun.nio.ch.SocketDispatcher.read(Unknown Source)
	at java.base/sun.nio.ch.IOUtil.readIntoNativeBuffer(Unknown Source)
	at java.base/sun.nio.ch.IOUtil.read(Unknown Source)
	at java.base/sun.nio.ch.IOUtil.read(Unknown Source)
	at java.base/sun.nio.ch.SocketChannelImpl.read(Unknown Source)
	at io.netty.buffer.PooledByteBuf.setBytes(PooledByteBuf.java:253)
	at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1133)
	at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:350)
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:148)
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Unknown Source)
20/11/03 07:35:30 ERROR CoarseGrainedExecutorBackend: Executor self-exiting due to : Driver 10.17.0.152:37161 disassociated! Shutting down.
20/11/03 07:35:31 INFO MemoryStore: MemoryStore cleared
20/11/03 07:35:31 INFO BlockManager: BlockManager stopped

When start a shell in the pod I can see the process are still running:

UID          PID    PPID  C    SZ   RSS PSR STIME TTY          TIME CMD
185          125       0  0  5045  3968   2 10:07 pts/0    00:00:00 /bin/bash
185          166     125  0  9019  3364   1 10:39 pts/0    00:00:00  \_ ps -AF --forest
185            1       0  0  1130   768   0 07:34 ?        00:00:00 /usr/bin/tini -s -- /opt/java/openjdk/
185           14       1  0 1935527 493976 3 07:34 ?       00:00:21 /opt/java/openjdk/bin/java -Dspark.dri

Here's the full command used to start the executor:

/opt/java/openjdk/
bin/java -Dspark.driver.port=37161 -Xms4g -Xmx4g -cp :/opt/spark/jars/*: org.apache.spark.executor.CoarseG
rainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@10.17.0.152:37161 --executor-id 1 --core
s 1 --app-id spark-application-1604388891044 --hostname 10.17.2.151

Attachments

Issue Links

is fixed by

SPARK-36532 Deadlock in CoarseGrainedExecutorBackend.onDisconnected

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Hadrien Kohl

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 03/Nov/20 10:40

Updated:: 05/Dec/21 07:21