1. aggregateByKey, reduceByKey and foldByKey will always perform mapSideCombine;
However, this can be skiped sometime, specially in ML (RobustScaler):
This reduceByKey in RobustScaler does not need mapSideCombine at all, similar places exist in KMeans, GMM, etc;
To my knowledge, we do not need mapSideCombine if the reduction factor isn't high;
2. treeAggregate and treeReduce are based on foldByKey, the mapSideCombine in the first call of foldByKey can also be avoided.
Map side combine in group by key case does not reduce the amount of data shuffled. Instead, it forces a lot more objects to go into old gen, and leads to worse GC.
So what about:
1. exposing mapSideCombine in aggByKey/reduceByKey/foldByKey, so that user can disable unnecessary mapSideCombine
2. disabling the mapSideCombine in the first call of foldByKey in treeAggregate and treeReduce
3. disabling the unnecessary mapSideCombine in ML;