[SPARK-23243] Shuffle+Repartition on an RDD could lead to incorrect answers - ASF JIRA

XML

Word

Printable

JSON

The RDD repartition also uses the round-robin way to distribute data, this can also cause incorrect answers on RDD workload the similar way as in https://issues.apache.org/jira/browse/SPARK-23207

The approach that fixes DataFrame.repartition() doesn't apply on the RDD repartition issue, as discussed in https://github.com/apache/spark/pull/20393#issuecomment-360912451

We track for alternative solutions for this issue in this task.

is duplicated by

SPARK-25156 Same query returns different result

is related to

SPARK-28699 Cache an indeterminate RDD could lead to incorrect result while stage rerun

SPARK-25342 Support rolling back a result stage

SPARK-25341 Support rolling back a shuffle map stage and re-generate the shuffle files

relates to

SPARK-23207 Shuffle+Repartition on an DataFrame could lead to incorrect answers

SPARK-29042 Sampling-based RDD with unordered input should be INDETERMINATE

links to

[Github] Pull Request #20414 (jiangxb1987)

(1 relates to, 6 links to)