Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-29475

Executors aren't marked as lost on OutOfMemoryError in YARN mode

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Incomplete
    • Affects Version/s: 2.3.0
    • Fix Version/s: None
    • Component/s: DStreams, Spark Core, YARN
    • Labels:

      Description

      My spark-streaming application runs in yarn cluster mode with ``spark.streaming.concurrentJobs`` set to 50. Once i observed that lots of batches were scheduled and application did not make any progress.

      Thread dump showed that all the streaming threads are blocked, infinitely waiting for result from executor on ``ThreadUtils.awaitReady(waiter.completionFuture, Duration.Inf)``. 

      "streaming-job-executor-11" - Thread t@324
         java.lang.Thread.State: WAITING
              at sun.misc.Unsafe.park(Native Method)
              - parking to wait for <7fd76f11> (a scala.concurrent.impl.Promise$CompletionLatch)
              at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
              at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
              at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
              at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
              at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202)
              at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
              at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:153)
              at org.apache.spark.util.ThreadUtils$.awaitReady(ThreadUtils.scala:222)
              at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:633)
              at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
              at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
              at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)
              at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)

      My tasks are short running and pretty simple, e.g. read raw data from Kafka, normalize it and write back to Kafka.

      After logs analysis i noticed that there were lots of ``RpcTimeoutException`` on both executor

      driver-heartbeater WARN executor.Executor: Issue communicating with driver in heartbeater
      org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. This timeout is controlled by spark.executor.heartbeatInterval
      	at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
      

      and driver sides during context clearing.

      WARN storage.BlockManagerMaster: Failed to remove RDD 583574 - Cannot receive any reply in 120 seconds. This timeout is controlled by spark.rpc.askTimeout WARN storage.BlockManagerMaster: Failed to remove RDD 583574 - Cannot receive any reply in 120 seconds. This timeout is controlled by spark.rpc.askTimeoutorg.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply from bi-prod-hadoop-17.va2:25109 in 120 seconds. This timeout is controlled by spark.rpc.askTimeout at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
      

      Also the only error on executors was

      SIGTERM handler ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM
      

      no exceptions at all.

      After digging into YARN logs i noticed that executors were killed by AM because of high memory usage. 

      Also there were no any logs on driver side saying that executors were lost. Thus seems that driver wasn't notified about this.

      Unfortunately i can't find a line of code in CoarseGrainedExecutorBackend which logs such message. ``exitExecutor`` method looks similar but it's message should look differently.

      However i believe that driver is notified that executor is lost via async rpc call, but if executor encounters issues with rpc because of high GC pressure this message won't be send.

       

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              choojoyq Nikita Gorbachevski
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: