[SPARK-21197] Tricky use case makes dead application struggle for a long duration - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Won't Fix
Affects Version/s: 2.0.2, 2.1.1
Fix Version/s: None
Component/s: DStreams, Spark Core
Labels:
None

Description

The use case is in Spark Streaming while the root cause is in DAGScheduler, so I said the component as both of DStreams and Core

Use case:

the user has a thread periodically triggering Spark jobs, and in the same application, they retrieve data through Spark Streaming from somewhere....in the Streaming logic, an exception is thrown so that the whole application is supposed to be shutdown and let YARN restart it...

The user observed that after the exception is propagated to Spark core and SparkContext.stop() is called, after 18 hours, the application is still running...

The root cause is that when we call DAGScheduler.stop(), we will wait for eventLoop's thread to finish (https://github.com/apache/spark/blob/03eb6117affcca21798be25706a39e0d5a2f7288/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1704 and https://github.com/apache/spark/blob/03eb6117affcca21798be25706a39e0d5a2f7288/core/src/main/scala/org/apache/spark/util/EventLoop.scala#L40)

Since there is a thread periodically push events to DAGScheduler's event queue, it will never finish

a potential solution is that in EventLoop, we should allow interrupt the thread directly for some cases, e.g. this one, and simultaneously allow graceful shutdown for other cases, e.g. ListenerBus one,

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Nan Zhu

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 23/Jun/17 22:53

Updated:: 24/Jun/17 22:21

Resolved:: 24/Jun/17 22:21