[SPARK-9021] Change RDD.aggregate() to do reduce(mapPartitions()) instead of mapPartitions.fold() - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.4.0
Fix Version/s: 1.4.2, 1.5.0
Component/s: PySpark
Labels:
None
Environment:

Ubuntu 14.04 LTS

External issue URL:
https://github.com/apache/spark/pull/7378

Description

Please see pull request for more information.

Currently, PySpark will run an unnecessary comboOp on each partition, combining zeroValue and the results of mapPartitions. Since the zeroValue used in this comboOp is the same reference as the zeroValue used for mapPartitions in each partition, unexpected behavior can happen if zeroValue is a mutable object.

Instead, RDD.aggregate() should do a reduction on the results of each mapPartitions task. This way, we remove the unnecessary initial comboOp on each partition and also correct the unexpected behavior for mutable zeroValues.

Attachments

Issue Links

duplicates

SPARK-6551 Incorrect aggregate results if seqOp(...) mutates its first argument

Resolved

links to

[Github] Pull Request #7378 (njhwang)

Activity

People

Assignee:: Nicholas Hwang

Reporter:: Nicholas Hwang

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 13/Jul/15 21:44

Updated:: 05/Aug/15 17:16

Resolved:: 19/Jul/15 18:57