[SPARK-2203] PySpark does not infer default numPartitions in same way as Spark - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.0.0
Fix Version/s: 1.1.0
Component/s: PySpark
Labels:
None

Description

For shuffle-based operators, such as rdd.groupBy() or rdd.sortByKey(), PySpark will always assume that the default parallelism to use for the reduce side is ctx.defaultParallelism, which is a constant typically determined by the number of cores in cluster.

In contrast, Spark's Partitioner#defaultPartitioner will use the same number of reduce partitions as map partitions unless the defaultParallelism config is explicitly set. This tends to be a better default in order to avoid OOMs, and should also be the behavior of PySpark.

Attachments

Issue Links

links to

GitHub pull request

Activity

People

Assignee:: Aaron Davidson

Reporter:: Aaron Davidson

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 19/Jun/14 19:43

Updated:: 20/Jun/14 07:07

Resolved:: 20/Jun/14 07:07