Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
Description
DAGScheduler sometimes generate incorrect stage graph.
Suppose you have the following DAG (please see this in monospaced font):
[A] <--(s_A)-- [B] <--(s_B)-- [C] <--(s_C)-- [D] \ / <-------------
Note: [] means an RDD, () means a shuffle dependency.
Here, RDD B has a shuffle dependency on RDD A, and RDD C has shuffle dependency on both B and A. The shuffle dependency IDs are numbers in the DAGScheduler, but to make the example easier to understand, let's call the shuffled data from A shuffle dependency ID s_A and the shuffled data from B shuffle dependency ID s_B.
The getAncestorShuffleDependencies method in DAGScheduler (incorrectly) does not check for duplicates when it's adding ShuffleDependencies to the parents data structure, so for this DAG, when getAncestorShuffleDependencies gets called on C (previous of the final RDD), getAncestorShuffleDependencies will return s_A, s_B, s_A (s_A gets added twice: once when the method "visit"s RDD C, and once when the method "visit"s RDD B). This is problematic because this line of code: https://github.com/apache/spark/blob/8ef3399/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L289 then generates a new shuffle stage for each dependency returned by getAncestorShuffleDependencies, resulting in duplicate map stages that compute the map output from RDD A.
As a result, DAGScheduler generates the following stages and their parents for each shuffle:
stage | parents | |
s_A | ShuffleMapStage 2 | List() |
s_B | ShuffleMapStage 1 | List(ShuffleMapStage 0) |
s_C | ShuffleMapStage 3 | List(ShuffleMapStage 1, ShuffleMapStage 2) |
- | ResultStage 4 | List(ShuffleMapStage 3) |
The stage for s_A should be ShuffleMapStage 0, but the stage for s_A is generated twice as ShuffleMapStage 2 and ShuffleMapStage 0 is overwritten by ShuffleMapStage 2, and the stage ShuffleMap Stage1 keeps referring the old stage ShuffleMapStage 0.
Attachments
Issue Links
- is duplicated by
-
SPARK-10842 Eliminate create duplicate stage while generate job dag
- Resolved
- is related to
-
SPARK-14269 Eliminate unnecessary submitStage() call.
- Resolved
- links to