Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-815

PySpark's parallelize() should batch objects after partitioning (instead of before)

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 0.7.0, 0.7.1, 0.7.2, 0.7.3
    • 0.7.3, 0.8.0
    • PySpark
    • None

    Description

      PySpark uses batching when serializing and deserializing Python objects. By default, it serializes objects in groups of 1024.

      The current batching code causes SparkContext.parallelize() to behave counterintuitively when parallelizing small datasets. The current code batches the objects, then parallelizes the batches, so calls to parallelize() with small inputs will be unaffected by the number of partitions:

      >>> rdd = sc.parallelize([1, 2, 3, 4], 2)
      >>> rdd.glom().collect()
      [[], [1, 2, 3, 4]]
      

      Instead, parallelize() should first partition the elements and then batch them:

      >>> rdd = sc.parallelize([1, 2, 3, 4], 2)
      >>> rdd.glom().collect()
      [[1, 2], [3, 4]]
      

      Maybe parallelize() should accept an option to control the batch size (right now, it can only be set when creating the SparkContext).

      Attachments

        Activity

          People

            matei Matei Alexandru Zaharia
            joshrosen Josh Rosen
            Votes:
            1 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: