[SPARK-13902] Make DAGScheduler not to create duplicate stage. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.0.0
Component/s: Scheduler, Spark Core
Labels:
None

Description

DAGScheduler sometimes generate incorrect stage graph.

Suppose you have the following DAG (please see this in monospaced font):

[A] <--(s_A)-- [B] <--(s_B)-- [C] <--(s_C)-- [D]
            \                /
              <-------------

Note: [] means an RDD, () means a shuffle dependency.

Here, RDD B has a shuffle dependency on RDD A, and RDD C has shuffle dependency on both B and A. The shuffle dependency IDs are numbers in the DAGScheduler, but to make the example easier to understand, let's call the shuffled data from A shuffle dependency ID s_A and the shuffled data from B shuffle dependency ID s_B.
The getAncestorShuffleDependencies method in DAGScheduler (incorrectly) does not check for duplicates when it's adding ShuffleDependencies to the parents data structure, so for this DAG, when getAncestorShuffleDependencies gets called on C (previous of the final RDD), getAncestorShuffleDependencies will return s_A, s_B, s_A (s_A gets added twice: once when the method "visit"s RDD C, and once when the method "visit"s RDD B). This is problematic because this line of code: https://github.com/apache/spark/blob/8ef3399/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L289 then generates a new shuffle stage for each dependency returned by getAncestorShuffleDependencies, resulting in duplicate map stages that compute the map output from RDD A.

As a result, DAGScheduler generates the following stages and their parents for each shuffle:

	stage	parents
s_A	ShuffleMapStage 2	List()
s_B	ShuffleMapStage 1	List(ShuffleMapStage 0)
s_C	ShuffleMapStage 3	List(ShuffleMapStage 1, ShuffleMapStage 2)
-	ResultStage 4	List(ShuffleMapStage 3)

The stage for s_A should be ShuffleMapStage 0, but the stage for s_A is generated twice as ShuffleMapStage 2 and ShuffleMapStage 0 is overwritten by ShuffleMapStage 2, and the stage ShuffleMap Stage1 keeps referring the old stage ShuffleMapStage 0.

Attachments

Issue Links

is duplicated by

SPARK-10842 Eliminate create duplicate stage while generate job dag

Resolved

is related to

SPARK-14269 Eliminate unnecessary submitStage() call.

Resolved

links to

[Github] Pull Request #11720 (ueshin)

[Github] Pull Request #12060 (ueshin)

[Github] Pull Request #12655 (ueshin)

Activity

People

Assignee:: Takuya Ueshin

Reporter:: Takuya Ueshin

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 15/Mar/16 08:44

Updated:: 17/May/20 17:48

Resolved:: 12/May/16 19:37