Description
When I try to use ML Randomforest to train a classifier with dataset news20.binary, which has 19,996 training examples and 1,355,191 features, i found that shuffle write size( 51 GB ) of findSplitsBySorting is very large compared with the small data size( 133.52 MB ). I think it is useful to replace groupByKey by reduceByKey to improve shuffle performance.