Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-1175

on shutting down a long running job, the cluster does not accept new jobs and gets hung

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 0.8.1, 0.9.0
    • 1.0.0
    • Spark Core

    Description

      When shutting down a long processing job (24+ hours) that runs periodically on the same context and generates a lot of shuffles (many hundreds of GB) the spark workers get hung for a long while and the cluster does not accept new jobs. The only way to proceed is to kill -9 the workers.
      This is a big problem because when multiple contexts run on the same cluster, one mast stop them all for a simple restart.
      The context is stopped using sc.stop()
      This happens both in standalone mode and under mesos.

      We suspect this is caused by the "delete Spark local dirs" thread. Attached a thread dump of the worker. Also, the relevant part may be:

      "SIGTERM handler" - Thread t@41040
      java.lang.Thread.State: BLOCKED
      at java.lang.Shutdown.exit(Shutdown.java:168)

      • waiting to lock <69eab6a3> (a java.lang.Class) owned by "SIGTERM handler" t@41038
        at java.lang.Terminator$1.handle(Terminator.java:35)
        at sun.misc.Signal$1.run(Signal.java:195)
        at java.lang.Thread.run(Thread.java:662)

      Locked ownable synchronizers:

      • None

      "delete Spark local dirs" - Thread t@40
      java.lang.Thread.State: RUNNABLE
      at java.io.UnixFileSystem.delete0(Native Method)
      at java.io.UnixFileSystem.delete(UnixFileSystem.java:251)
      at java.io.File.delete(File.java:904)
      at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:482)
      at org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:479)
      at org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:478)
      at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
      at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
      at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:478)
      at org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:479)
      at org.apache.spark.util.Utils$$anonfun$deleteRecursively$1.apply(Utils.scala:478)
      at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
      at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
      at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:478)
      at org.apache.spark.storage.DiskBlockManager$$anon$1$$anonfun$run$2.apply(DiskBlockManager.scala:141)
      at org.apache.spark.storage.DiskBlockManager$$anon$1$$anonfun$run$2.apply(DiskBlockManager.scala:139)
      at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
      at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
      at org.apache.spark.storage.DiskBlockManager$$anon$1.run(DiskBlockManager.scala:139)

      Locked ownable synchronizers:

      • None

      "SIGTERM handler" - Thread t@41038
      java.lang.Thread.State: WAITING
      at java.lang.Object.wait(Native Method)

      • waiting on <355c6c8d> (a org.apache.spark.storage.DiskBlockManager$$anon$1)
        at java.lang.Thread.join(Thread.java:1186)
        at java.lang.Thread.join(Thread.java:1239)
        at java.lang.ApplicationShutdownHooks.runHooks(ApplicationShutdownHooks.java:79)
        at java.lang.ApplicationShutdownHooks$1.run(ApplicationShutdownHooks.java:24)
        at java.lang.Shutdown.runHooks(Shutdown.java:79)
        at java.lang.Shutdown.sequence(Shutdown.java:123)
        at java.lang.Shutdown.exit(Shutdown.java:168)
      • locked <69eab6a3> (a java.lang.Class)
        at java.lang.Terminator$1.handle(Terminator.java:35)
        at sun.misc.Signal$1.run(Signal.java:195)
        at java.lang.Thread.run(Thread.java:662)

      Locked ownable synchronizers:

      • None

      Attachments

        Activity

          People

            codingcat Nan Zhu
            sliwo Tal Sliwowicz
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: