[SPARK-815] PySpark's parallelize() should batch objects after partitioning (instead of before) - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 0.7.0, 0.7.1, 0.7.2, 0.7.3
Fix Version/s: 0.7.3, 0.8.0
Component/s: PySpark
Labels:
None

Description

PySpark uses batching when serializing and deserializing Python objects. By default, it serializes objects in groups of 1024.

The current batching code causes SparkContext.parallelize() to behave counterintuitively when parallelizing small datasets. The current code batches the objects, then parallelizes the batches, so calls to parallelize() with small inputs will be unaffected by the number of partitions:

>>> rdd = sc.parallelize([1, 2, 3, 4], 2)
>>> rdd.glom().collect()
[[], [1, 2, 3, 4]]

Instead, parallelize() should first partition the elements and then batch them:

>>> rdd = sc.parallelize([1, 2, 3, 4], 2)
>>> rdd.glom().collect()
[[1, 2], [3, 4]]

Maybe parallelize() should accept an option to control the batch size (right now, it can only be set when creating the SparkContext).

Attachments

Activity

People

Assignee:: Matei Alexandru Zaharia

Reporter:: Josh Rosen

Votes:: 1 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 15/Jul/13 19:09

Updated:: 30/Mar/14 23:33

Resolved:: 28/Jul/13 23:56