[SPARK-1021] sortByKey() launches a cluster job when it shouldn't - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Won't Fix
Affects Version/s: 0.8.0, 0.9.0, 1.0.0, 1.1.0
Fix Version/s: None
Component/s: Spark Core
Labels:
None

Description

The sortByKey() method is listed as a transformation, not an action, in the documentation. But it launches a cluster job regardless.

http://spark.incubator.apache.org/docs/latest/scala-programming-guide.html

Some discussion on the mailing list suggested that this is a problem with the rdd.count() call inside Partitioner.scala's rangeBounds method.

https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/Partitioner.scala#L102

Josh Rosen suggests that rangeBounds should be made into a lazy variable:

I wonder whether making RangePartitoner .rangeBounds into a lazy val would fix this (https://github.com/apache/incubator-spark/blob/6169fe14a140146602fb07cfcd13eee6efad98f9/core/src/main/scala/org/apache/spark/Partitioner.scala#L95). We'd need to make sure that rangeBounds() is never called before an action is performed. This could be tricky because it's called in the RangePartitioner.equals() method. Maybe it's sufficient to just compare the number of partitions, the ids of the RDDs used to create the RangePartitioner, and the sort ordering. This still supports the case where I range-partition one RDD and pass the same partitioner to a different RDD. It breaks support for the case where two range partitioners created on different RDDs happened to have the same rangeBounds(), but it seems unlikely that this would really harm performance since it's probably unlikely that the range partitioners are equal by chance.

Can we please make this happen? I'll send a PR on GitHub to start the discussion and testing.

Attachments

Issue Links

is blocked by

SPARK-4514 SparkContext localProperties does not inherit property updates across thread reuse

Resolved

is depended upon by

SPARK-3145 Hive on Spark umbrella

Resolved

is related to

HIVE-9370 SparkJobMonitor timeout as sortByKey would launch extra Spark job before original job get submitted [Spark Branch]

Resolved

SPARK-1852 SparkSQL Queries with Sorts run before the user asks them to

Resolved

relates to

SPARK-9999 Dataset API on top of Catalyst/DataFrame

Resolved

SPARK-2568 RangePartitioner should go through the data only once

Resolved

links to

[Github] Pull Request #1689 (erikerlandson)

[Github] Pull Request #3079 (erikerlandson)

(1 relates to, 2 links to)

Activity

People

Assignee:: Erik Erlandson

Reporter:: Andrew Ash

Votes:: 2 Vote for this issue

Watchers:: 21 Start watching this issue

Dates

Created:: 09/Jan/14 22:47

Updated:: 12/Nov/15 05:19

Resolved:: 03/Nov/15 15:31