Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-8137

Improve treeAggregate to combine all data on one machine first

    Details

    • Type: Improvement
    • Status: In Progress
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 1.4.0
    • Fix Version/s: None
    • Component/s: Spark Core
    • Labels:
      None

      Description

      Right now if we have multiple partitions on the same machine we shuffle the partitions and don't aggregate them first in treeAggregate. Once we have support for shuffle locality, we can get this for free by using the executorIds as the keys for aggregation. https://github.com/amplab/ml-matrix/blob/master/src/main/scala/edu/berkeley/cs/amplab/mlmatrix/util/Utils.scala#L96 has an example implementation

        Issue Links

          Activity

          Hide
          apachespark Apache Spark added a comment -

          User 'kmadhugit' has created a pull request for this issue:
          https://github.com/apache/spark/pull/7461

          Show
          apachespark Apache Spark added a comment - User 'kmadhugit' has created a pull request for this issue: https://github.com/apache/spark/pull/7461

            People

            • Assignee:
              Unassigned
              Reporter:
              shivaram Shivaram Venkataraman
              Shepherd:
              Josh Rosen
            • Votes:
              1 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:

                Development