Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
1.4.0
-
None
-
Ubuntu 14.04 LTS
Description
Please see pull request for more information.
Currently, PySpark will run an unnecessary comboOp on each partition, combining zeroValue and the results of mapPartitions. Since the zeroValue used in this comboOp is the same reference as the zeroValue used for mapPartitions in each partition, unexpected behavior can happen if zeroValue is a mutable object.
Instead, RDD.aggregate() should do a reduction on the results of each mapPartitions task. This way, we remove the unnecessary initial comboOp on each partition and also correct the unexpected behavior for mutable zeroValues.
Attachments
Issue Links
- duplicates
-
SPARK-6551 Incorrect aggregate results if seqOp(...) mutates its first argument
- Resolved
- links to