[SPARK-2568] RangePartitioner should go through the data only once - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.0.0
Fix Version/s: 1.1.0
Component/s: Spark Core
Labels:
None

Target Version/s:

1.1.0

Description

As of Spark 1.0, RangePartitioner goes through data twice: once to compute the count and once to do sampling. As a result, to do sortByKey, Spark goes through data 3 times (once to count, once to sample, and once to sort).

RangePartitioner should go through data only once (remove the count step).

Attachments

Issue Links

is related to

SPARK-1021 sortByKey() launches a cluster job when it shouldn't

Resolved

links to

[Github] Pull Request #1562 (mengxr)

Activity

People

Assignee:: Xiangrui Meng

Reporter:: Reynold Xin

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 18/Jul/14 01:35

Updated:: 30/Jul/14 05:17

Resolved:: 30/Jul/14 05:16