[SPARK-940] Do not directly pass Stage objects to SparkListener - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.8.1, 0.9.0
Component/s: Web UI
Labels:
None

Description

Right now the SparkListener interface directly passes `Stage` objects to listeners. This is a problem because it exposes internal objects to Listeners, which is a semi-public interface. Consumers could, e.g. mutate the stages and break Spark.

I recently found an even bigger reason this is a problem. It causes a bunch of extra pointers to RDD's and other objects to remain live when downstream consumers like the Web UI keep references to them. Even though the UI does its own cleaning, it will retain 1000 stages by default, which can reference a large number of RDD's. In a long running Spark streaming program I was running, these references caused the JVM to run out of memory and begin GC thrashing. A heap analysis later revealed that this was the cause.

To fix this, we should make `StageInfo` not just encapsulate a `Stage` (which it does now) and instead be similar to `TaskInfo` with its own distinct fields.

Attachments

Activity

People

Assignee:: Patrick McFadin

Reporter:: Patrick McFadin

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 20/Oct/13 13:05

Updated:: 07/Nov/13 19:55

Resolved:: 26/Oct/13 14:51