Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-3612

Executor shouldn't quit if heartbeat message fails to reach the driver

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.1.1, 1.2.0
    • Spark Core
    • None

    Description

      The thread started by Executor.startDriverHeartbeater can actually terminate the whole executor if AkkaUtils.askWithReply[HeartbeatResponse] throws an exception.

      I don't think we should quit the executor this way. At the very least, we would want to log a more meaningful exception then simply

      14/09/20 06:38:12 WARN AkkaUtils: Error sending message in 1 attempts
      java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
              at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
              at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
              at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
              at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
              at scala.concurrent.Await$.result(package.scala:107)
              at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176)
              at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:379)
      14/09/20 06:38:45 WARN AkkaUtils: Error sending message in 2 attempts
      java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
              at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
              at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
              at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
              at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
              at scala.concurrent.Await$.result(package.scala:107)
              at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176)
              at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:379)
      14/09/20 06:39:18 WARN AkkaUtils: Error sending message in 3 attempts
      java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
              at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
              at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
              at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
              at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
              at scala.concurrent.Await$.result(package.scala:107)
              at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176)
              at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:379)
      14/09/20 06:39:21 ERROR ExecutorUncaughtExceptionHandler: Uncaught exception in thread Thread[Driver Heartbeater,5,main]
      org.apache.spark.SparkException: Error sending message [message = Heartbeat(281,[Lscala.Tuple2;@4d9294db,BlockManagerId(281, ip-172-31-7-55.eu-west-1.compute.internal, 52303))]
              at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:190)
              at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:379)
      Caused by: java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
              at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
              at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
              at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
              at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
              at scala.concurrent.Await$.result(package.scala:107)
              at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176)
              ... 1 more
      
      

      Attachments

        Activity

          People

            sandyr Sandy Ryza
            rxin Reynold Xin
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: