Description
MLlib's NaiveBayes, SGD, and KMeans accept RDD[LabeledPoint] for training and RDD[Array[Double]] for prediction, where LabeledPoint is a wrapper of (Double, Array[Double]). Using Array[Double] could have good performance, but sparse data appears quite often in practice. So I created this JIRA to discuss the plan of adding sparse data support to MLlib and track its progress.
The goal is to support sparse data for training and prediction in all existing algorithms in MLlib:
- Gradient Descent
- K-Means
- Naive Bayes
Previous discussions and pull requests:
Attachments
Issue Links
- contains
-
SPARK-1401 Use mapParitions instead of map to avoid creating expensive object in GradientDescent optimizer
-
- Closed
-