Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-35625

Spark on k8s zombie executors

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 3.0.1
    • Fix Version/s: None
    • Component/s: Kubernetes
    • Labels:
      None

      Description

      We are running a POC of Spark on K8s setup for one of our apps, it's scaling up/down quite a lot, and we started noticing that after a while we started seeing quite a few of these logs:

      Error trying to remove broadcast 8805 from block manager BlockManagerId(79, 10.244.248.23, 46681, None)
      java.io.IOException: Failed to send RPC RPC 6006709312311899870 to /10.244.248.23:54004: io.netty.channel.StacklessClosedChannelException
      
      Error trying to remove RDD 32952 from block manager BlockManagerId(79, 10.244.248.23, 46681, None) java.io.IOException: Failed to send RPC RPC 7506603739599355778 to /10.244.248.23:54004: io.netty.channel.StacklessClosedChannelException
      

       

      All the errors/warn are related to trying to remove (shuffle/broadcast/rdd) files/blocks, which doesn't seems to harmful at this point other than spamming our logs.

       

      The interesting part is that when looking in kubectl the executors doesn't seems to be alive (as expected), on the other hand in Spark UI, they do show up as "active" with 0 cores:

       

       

       

      All the executors marked above are long dead, but for some reason the driver app still tries to send RPC requests to them.

       

      According to our event logs, on of the pods was create at May 21 20:11 and was killed 9 min later at 20:20, but we are still seeing new logs on Jun 3.

       

       

      Sample of one of the errors:

      Error trying to remove RDD 33178 from block manager BlockManagerId(79, 10.244.248.23, 46681, None)Error trying to remove RDD 33178 from block manager BlockManagerId(79, 10.244.248.23, 46681, None)java.io.IOException: Failed to send RPC RPC 7684271332363250835 to /10.244.248.23:54004: io.netty.channel.StacklessClosedChannelException at org.apache.spark.network.client.TransportClient$RpcChannelListener.handleFailure(TransportClient.java:363) at org.apache.spark.network.client.TransportClient$StdChannelListener.operationComplete(TransportClient.java:340) at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577) at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:551) at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:490) at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:615) at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:608) at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117) at io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:998) at io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:866) at io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1367) at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717) at io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:764) at io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1071) at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:497) at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.lang.Thread.run(Thread.java:748)
      Caused by: io.netty.channel.StacklessClosedChannelException at io.netty.channel.AbstractChannel$AbstractUnsafe.write(Object, ChannelPromise)(Unknown Source)  Failed to send RPC RPC 7684271332363250835 to /10.244.248.23:54004: io.netty.channel.StacklessClosedChannelExceptionio.netty.channel.StacklessClosedChannelException at io.netty.channel.AbstractChannel$AbstractUnsafe.write(Object, ChannelPromise)(Unknown Source)
      
      Failed to send RPC RPC 7684271332363250835 to /10.244.248.23:54004: io.netty.channel.StacklessClosedChannelExceptionFailed to send RPC RPC 7684271332363250835 to /10.244.248.23:54004: io.netty.channel.StacklessClosedChannelExceptionio.netty.channel.StacklessClosedChannelException at io.netty.channel.AbstractChannel$AbstractUnsafe.write(Object, ChannelPromise)(Unknown Source)

        Attachments

        1. image-2021-06-03-12-16-12-573.png
          117 kB
          Liran
        2. image-2021-06-03-12-16-03-621.png
          291 kB
          Liran
        3. image-2021-06-03-12-15-57-095.png
          35 kB
          Liran

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              lbuanos Liran
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated: