Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-21197

Tricky use case makes dead application struggle for a long duration

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Won't Fix
    • 2.0.2, 2.1.1
    • None
    • DStreams, Spark Core
    • None

    Description

      The use case is in Spark Streaming while the root cause is in DAGScheduler, so I said the component as both of DStreams and Core

      Use case:

      the user has a thread periodically triggering Spark jobs, and in the same application, they retrieve data through Spark Streaming from somewhere....in the Streaming logic, an exception is thrown so that the whole application is supposed to be shutdown and let YARN restart it...

      The user observed that after the exception is propagated to Spark core and SparkContext.stop() is called, after 18 hours, the application is still running...

      The root cause is that when we call DAGScheduler.stop(), we will wait for eventLoop's thread to finish (https://github.com/apache/spark/blob/03eb6117affcca21798be25706a39e0d5a2f7288/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1704 and https://github.com/apache/spark/blob/03eb6117affcca21798be25706a39e0d5a2f7288/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L40)

      Since there is a thread periodically push events to DAGScheduler's event queue, it will never finish

      a potential solution is that in EventLoop, we should allow interrupt the thread directly for some cases, e.g. this one, and simultaneously allow graceful shutdown for other cases, e.g. ListenerBus one,

      Attachments

        Activity

          People

            Unassigned Unassigned
            codingcat Nan Zhu
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: