Description
Right now the SparkListener interface directly passes `Stage` objects to listeners. This is a problem because it exposes internal objects to Listeners, which is a semi-public interface. Consumers could, e.g. mutate the stages and break Spark.
I recently found an even bigger reason this is a problem. It causes a bunch of extra pointers to RDD's and other objects to remain live when downstream consumers like the Web UI keep references to them. Even though the UI does its own cleaning, it will retain 1000 stages by default, which can reference a large number of RDD's. In a long running Spark streaming program I was running, these references caused the JVM to run out of memory and begin GC thrashing. A heap analysis later revealed that this was the cause.
To fix this, we should make `StageInfo` not just encapsulate a `Stage` (which it does now) and instead be similar to `TaskInfo` with its own distinct fields.