Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-33325

Spark executors pod are not shutting down when losing driver connection

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.0.1
    • None
    • Kubernetes
    • None

    Description

      In situations where the executors lose contact with the driver, the java process does not die. I am looking at what on the kubernetes cluster could prevent proper clean-up. 

      The spark driver is started in it's own pod in client mode (pyspark shell started by jupyter). I works fine most of the time but if the driver process crashes (OOM or kill signal for instance) the executor complains about the connection reset by peer and then hangs.

      Here's the log from an executor pod that hangs:

      20/11/03 07:35:30 WARN TransportChannelHandler: Exception in connection from /10.17.0.152:37161
      java.io.IOException: Connection reset by peer
      	at java.base/sun.nio.ch.FileDispatcherImpl.read0(Native Method)
      	at java.base/sun.nio.ch.SocketDispatcher.read(Unknown Source)
      	at java.base/sun.nio.ch.IOUtil.readIntoNativeBuffer(Unknown Source)
      	at java.base/sun.nio.ch.IOUtil.read(Unknown Source)
      	at java.base/sun.nio.ch.IOUtil.read(Unknown Source)
      	at java.base/sun.nio.ch.SocketChannelImpl.read(Unknown Source)
      	at io.netty.buffer.PooledByteBuf.setBytes(PooledByteBuf.java:253)
      	at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1133)
      	at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:350)
      	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:148)
      	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714)
      	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650)
      	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576)
      	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
      	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
      	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
      	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
      	at java.base/java.lang.Thread.run(Unknown Source)
      20/11/03 07:35:30 ERROR CoarseGrainedExecutorBackend: Executor self-exiting due to : Driver 10.17.0.152:37161 disassociated! Shutting down.
      20/11/03 07:35:31 INFO MemoryStore: MemoryStore cleared
      20/11/03 07:35:31 INFO BlockManager: BlockManager stopped
      
      

      When start a shell in the pod I can see the process are still running: 

      UID          PID    PPID  C    SZ   RSS PSR STIME TTY          TIME CMD
      185          125       0  0  5045  3968   2 10:07 pts/0    00:00:00 /bin/bash
      185          166     125  0  9019  3364   1 10:39 pts/0    00:00:00  \_ ps -AF --forest
      185            1       0  0  1130   768   0 07:34 ?        00:00:00 /usr/bin/tini -s -- /opt/java/openjdk/
      185           14       1  0 1935527 493976 3 07:34 ?       00:00:21 /opt/java/openjdk/bin/java -Dspark.dri
      

      Here's the full command used to start the executor: 

      /opt/java/openjdk/
      bin/java -Dspark.driver.port=37161 -Xms4g -Xmx4g -cp :/opt/spark/jars/*: org.apache.spark.executor.CoarseG
      rainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@10.17.0.152:37161 --executor-id 1 --core
      s 1 --app-id spark-application-1604388891044 --hostname 10.17.2.151
      

       

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              hadrien.kohl Hadrien Kohl
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated: