Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-24981

ShutdownHook timeout causes job to fail when succeeded when SparkContext stop() not called by user program

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 2.3.1
    • 2.4.0
    • Spark Core
    • None

    Description

      When user does not stop the SparkContext at the end of their program, ShutdownHookManger will stop the SparkContext. However, each shutdown hook is only given 10s to run, it will be interrupted and cancelled after that given time. In case stopping spark context takes longer than 10s, InterruptedException will be thrown, and the job will fail even though it succeeded before. An example of this is shown below.

      I think there are a few ways to fix this, below are the 2 ways that I have now: 

      1. After user program finished, we can check if user program stoped SparkContext or not. If user didn't stop the SparkContext, we can stop it before finishing the userThread. By doing so, SparkContext.stop() can take as much time as it needed.

      2. We can just catch the InterruptedException thrown by ShutdownHookManger while we are stopping the SparkContext, and ignoring all the things that we haven't stopped inside the SparkContext. Since we are shutting down, I think it will be okay to ignore those things.

       

      18/07/31 17:11:49 ERROR Utils: Uncaught exception in thread pool-4-thread-1
      java.lang.InterruptedException
      	at java.lang.Object.wait(Native Method)
      	at java.lang.Thread.join(Thread.java:1249)
      	at java.lang.Thread.join(Thread.java:1323)
      	at org.apache.spark.scheduler.AsyncEventQueue.stop(AsyncEventQueue.scala:136)
      	at org.apache.spark.scheduler.LiveListenerBus$$anonfun$stop$1.apply(LiveListenerBus.scala:219)
      	at org.apache.spark.scheduler.LiveListenerBus$$anonfun$stop$1.apply(LiveListenerBus.scala:219)
      	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
      	at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
      	at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
      	at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
      	at org.apache.spark.scheduler.LiveListenerBus.stop(LiveListenerBus.scala:219)
      	at org.apache.spark.SparkContext$$anonfun$stop$6.apply$mcV$sp(SparkContext.scala:1922)
      	at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1360)
      	at org.apache.spark.SparkContext.stop(SparkContext.scala:1921)
      	at org.apache.spark.SparkContext$$anonfun$2.apply$mcV$sp(SparkContext.scala:573)
      	at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216)
      	at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188)
      	at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
      	at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
      	at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1991)
      	at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188)
      	at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
      	at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
      	at scala.util.Try$.apply(Try.scala:192)
      	at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
      	at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
      	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      	at java.lang.Thread.run(Thread.java:745)
      18/07/31 17:11:49 WARN ShutdownHookManager: ShutdownHook '$anon$2' timeout, java.util.concurrent.TimeoutException
      java.util.concurrent.TimeoutException
      	at java.util.concurrent.FutureTask.get(FutureTask.java:205)
      	at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:67)
      

      Attachments

        Activity

          People

            hthuynh2 Hieu Tri Huynh
            hthuynh2 Hieu Tri Huynh
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: