[SPARK-7308] Should there be multiple concurrent attempts for one stage? - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.3.1
Fix Version/s: 1.5.3, 1.6.0
Component/s: Spark Core
Labels:
None

Description

Currently, when there is a fetch failure, you can end up with multiple concurrent attempts for the same stage. Is this intended? At best, it leads to some very confusing behavior, and it makes it hard for the user to make sense of what is going on. At worst, I think this is cause of some very strange errors we've seen errors we've seen from users, where stages start executing before all the dependent stages have completed.

This can happen in the following scenario: there is a fetch failure in attempt 0, so the stage is retried. attempt 1 starts. But, tasks from attempt 0 are still running – some of them can also hit fetch failures after attempt 1 starts. That will cause additional stage attempts to get fired up.

There is an attempt to handle this already https://github.com/apache/spark/blob/16860327286bc08b4e2283d51b4c8fe024ba5006/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1105

but that only checks whether the *stage* is running. It really should check whether that *attempt* is still running, but there isn't enough info to do that.

I'll also post some info on how to reproduce this.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

SPARK-7308_discussion.pdf
21/May/15 19:34
139 kB
Imran Rashid

Issue Links

blocks

SPARK-5945 Spark should not retry a stage infinitely on a FetchFailedException

Closed

relates to

SPARK-7829 SortShuffleWriter writes inconsistent data & index files on stage retry

Resolved

requires

SPARK-8029 ShuffleMapTasks must be robust to concurrent attempts on the same executor

Resolved

SPARK-8103 DAGScheduler should not launch multiple concurrent attempts for one stage on fetch failures

Resolved

links to

[Github] Pull Request #5844 (squito)

[Github] Pull Request #5964 (squito)

(1 links to)

Activity

People

Assignee:: Davies Liu

Reporter:: Imran Rashid

Votes:: 1 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Created:: 01/May/15 19:27

Updated:: 13/Nov/15 22:35

Resolved:: 13/Nov/15 22:10