Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-940

Do not directly pass Stage objects to SparkListener

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 0.8.1, 0.9.0
    • Web UI
    • None

    Description

      Right now the SparkListener interface directly passes `Stage` objects to listeners. This is a problem because it exposes internal objects to Listeners, which is a semi-public interface. Consumers could, e.g. mutate the stages and break Spark.

      I recently found an even bigger reason this is a problem. It causes a bunch of extra pointers to RDD's and other objects to remain live when downstream consumers like the Web UI keep references to them. Even though the UI does its own cleaning, it will retain 1000 stages by default, which can reference a large number of RDD's. In a long running Spark streaming program I was running, these references caused the JVM to run out of memory and begin GC thrashing. A heap analysis later revealed that this was the cause.

      To fix this, we should make `StageInfo` not just encapsulate a `Stage` (which it does now) and instead be similar to `TaskInfo` with its own distinct fields.

      Attachments

        Activity

          People

            patrick Patrick McFadin
            patrick Patrick McFadin
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: