DAGScheduler sometimes generate incorrect stage graph.
Suppose you have the following DAG (please see this in monospaced font):
Note:  means an RDD, () means a shuffle dependency.
Here, RDD B has a shuffle dependency on RDD A, and RDD C has shuffle dependency on both B and A. The shuffle dependency IDs are numbers in the DAGScheduler, but to make the example easier to understand, let's call the shuffled data from A shuffle dependency ID s_A and the shuffled data from B shuffle dependency ID s_B.
The getAncestorShuffleDependencies method in DAGScheduler (incorrectly) does not check for duplicates when it's adding ShuffleDependencies to the parents data structure, so for this DAG, when getAncestorShuffleDependencies gets called on C (previous of the final RDD), getAncestorShuffleDependencies will return s_A, s_B, s_A (s_A gets added twice: once when the method "visit"s RDD C, and once when the method "visit"s RDD B). This is problematic because this line of code: https://github.com/apache/spark/blob/8ef3399/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L289 then generates a new shuffle stage for each dependency returned by getAncestorShuffleDependencies, resulting in duplicate map stages that compute the map output from RDD A.
As a result, DAGScheduler generates the following stages and their parents for each shuffle:
|s_B||ShuffleMapStage 1||List(ShuffleMapStage 0)|
|s_C||ShuffleMapStage 3||List(ShuffleMapStage 1, ShuffleMapStage 2)|
|-||ResultStage 4||List(ShuffleMapStage 3)|
The stage for s_A should be ShuffleMapStage 0, but the stage for s_A is generated twice as ShuffleMapStage 2 and ShuffleMapStage 0 is overwritten by ShuffleMapStage 2, and the stage ShuffleMap Stage1 keeps referring the old stage ShuffleMapStage 0.