Details
Description
PySpark uses batching when serializing and deserializing Python objects. By default, it serializes objects in groups of 1024.
The current batching code causes SparkContext.parallelize() to behave counterintuitively when parallelizing small datasets. The current code batches the objects, then parallelizes the batches, so calls to parallelize() with small inputs will be unaffected by the number of partitions:
>>> rdd = sc.parallelize([1, 2, 3, 4], 2) >>> rdd.glom().collect() [[], [1, 2, 3, 4]]
Instead, parallelize() should first partition the elements and then batch them:
>>> rdd = sc.parallelize([1, 2, 3, 4], 2) >>> rdd.glom().collect() [[1, 2], [3, 4]]
Maybe parallelize() should accept an option to control the batch size (right now, it can only be set when creating the SparkContext).