Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-31948

expose mapSideCombine in aggByKey/reduceByKey/foldByKey

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: 3.1.0
    • Fix Version/s: None
    • Component/s: ML, Spark Core
    • Labels:
      None

      Description

      1. aggregateByKey, reduceByKey and  foldByKey will always perform mapSideCombine;

      However, this can be skiped sometime, specially in ML (RobustScaler):

      vectors.mapPartitions { iter =>
        if (iter.hasNext) {
          val summaries = Array.fill(numFeatures)(
            new QuantileSummaries(QuantileSummaries.defaultCompressThreshold, relativeError))
          while (iter.hasNext) {
            val vec = iter.next
            vec.foreach { (i, v) => if (!v.isNaN) summaries(i) = summaries(i).insert(v) }
          }
          Iterator.tabulate(numFeatures)(i => (i, summaries(i).compress))
        } else Iterator.empty
      }.reduceByKey { case (s1, s2) => s1.merge(s2) } 

       

      This reduceByKey in RobustScaler does not need mapSideCombine at all, similar places exist in KMeans, GMM, etc;
      To my knowledge, we do not need mapSideCombine if the reduction factor isn't high;

       

      2. treeAggregate and treeReduce are based on foldByKey,  the mapSideCombine in the first call of foldByKey can also be avoided.

       

      SPARK-772:

      Map side combine in group by key case does not reduce the amount of data shuffled. Instead, it forces a lot more objects to go into old gen, and leads to worse GC.

       

      So what about:

      1. exposing mapSideCombine in aggByKey/reduceByKey/foldByKey, so that user can disable unnecessary mapSideCombine

      2. disabling the mapSideCombine in the first call of foldByKey in  treeAggregate and treeReduce

      3. disabling the unnecessary mapSideCombine in ML;

      Friendly ping Sean R. Owen Huaxin Gao Weichen Xu Hyukjin Kwon L. C. Hsieh 

       

       

       

       

       

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                podongfeng zhengruifeng
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated: