Details
-
Improvement
-
Status: Closed
-
Minor
-
Resolution: Duplicate
-
None
Description
In GradientDescent, currently, each row of the input data will create its own gradient matrix object, and then we sum them up in the reducer.
We found that when the number of features are in the order of thousands, it becomes the bottleneck. The situation was worse when we tested with Newton optimizer due to that the dimension of hessian matrix is so huge.
In our testing, when the # of features are hundreds of thousands, the GC kicks in for each row of input, and it sometimes brings down the workers.
With aggregating the lossSum, and gradientSum using mapPartitions, we solved the GC issue, and scale better with # of features.
Attachments
Issue Links
- Is contained by
-
SPARK-1212 Support sparse data in MLlib
- Resolved