Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
3.0.0
-
None
Description
This is a follow up of https://issues.apache.org/jira/browse/SPARK-23243
To completely fix that problem, Spark needs to be able to rollback a shuffle map stage and rerun all the map tasks.
According to https://github.com/apache/spark/pull/9214 , Spark doesn't support it currently, as in shuffle writing "first write wins".
Since overwriting shuffle files is hard, we can extend the shuffle id to include a "shuffle generation number". Then the reduce task can specify which generation of shuffle it wants to read. https://github.com/apache/spark/pull/6648 seems in the right direction.
Attachments
Issue Links
- blocks
-
SPARK-28845 Enable spark.sql.execution.sortBeforeRepartition only for retried stages
- Resolved
- causes
-
SPARK-32124 [SHS] Failed to parse FetchFailed TaskEndReason from event log produce by Spark 2.4
- Resolved
- is related to
-
SPARK-25342 Support rolling back a result stage
- In Progress
- relates to
-
SPARK-23243 Shuffle+Repartition on an RDD could lead to incorrect answers
- Resolved
- links to