Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-8137

Improve treeAggregate to combine all data on one machine first

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 1.4.0
    • None
    • Spark Core

    Description

      Right now if we have multiple partitions on the same machine we shuffle the partitions and don't aggregate them first in treeAggregate. Once we have support for shuffle locality, we can get this for free by using the executorIds as the keys for aggregation. https://github.com/amplab/ml-matrix/blob/master/src/main/scala/edu/berkeley/cs/amplab/mlmatrix/util/Utils.scala#L96 has an example implementation

      Attachments

        Activity

          People

            Unassigned Unassigned
            shivaram Shivaram Venkataraman
            Josh Rosen Josh Rosen
            Votes:
            1 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: