[SPARK-8137] Improve treeAggregate to combine all data on one machine first - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: 1.4.0
Fix Version/s: None
Component/s: Spark Core
Labels:
- bulk-closed

Description

Right now if we have multiple partitions on the same machine we shuffle the partitions and don't aggregate them first in treeAggregate. Once we have support for shuffle locality, we can get this for free by using the executorIds as the keys for aggregation. https://github.com/amplab/ml-matrix/blob/master/src/main/scala/edu/berkeley/cs/amplab/mlmatrix/util/Utils.scala#L96 has an example implementation

Attachments

Issue Links

links to

[Github] Pull Request #7461 (kmadhugit)

Activity

People

Assignee:: Unassigned

Reporter:: Shivaram Venkataraman

Shepherd:: Josh Rosen

Votes:: 1 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 06/Jun/15 00:20

Updated:: 21/May/19 04:33

Resolved:: 21/May/19 04:33