Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-1212

Support sparse data in MLlib

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • 0.9.0
    • 1.0.0
    • MLlib
    • None

    Description

      MLlib's NaiveBayes, SGD, and KMeans accept RDD[LabeledPoint] for training and RDD[Array[Double]] for prediction, where LabeledPoint is a wrapper of (Double, Array[Double]). Using Array[Double] could have good performance, but sparse data appears quite often in practice. So I created this JIRA to discuss the plan of adding sparse data support to MLlib and track its progress.

      The goal is to support sparse data for training and prediction in all existing algorithms in MLlib:

      • Gradient Descent
      • K-Means
      • Naive Bayes

      Previous discussions and pull requests:

      Attachments

        Issue Links

          Activity

            People

              mengxr Xiangrui Meng
              mengxr Xiangrui Meng
              Votes:
              2 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: