Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-9021

Change RDD.aggregate() to do reduce(mapPartitions()) instead of mapPartitions.fold()

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.4.0
    • 1.4.2, 1.5.0
    • PySpark
    • None
    • Ubuntu 14.04 LTS

    Description

      Please see pull request for more information.

      Currently, PySpark will run an unnecessary comboOp on each partition, combining zeroValue and the results of mapPartitions. Since the zeroValue used in this comboOp is the same reference as the zeroValue used for mapPartitions in each partition, unexpected behavior can happen if zeroValue is a mutable object.

      Instead, RDD.aggregate() should do a reduction on the results of each mapPartitions task. This way, we remove the unnecessary initial comboOp on each partition and also correct the unexpected behavior for mutable zeroValues.

      Attachments

        Issue Links

          Activity

            People

              njhwang Nicholas Hwang
              njhwang Nicholas Hwang
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: