Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-17847

Reduce shuffled data size of GaussianMixture & copy the implementation from mllib to ml

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 2.2.0
    • ML, MLlib
    • None

    Description

      Copy GaussianMixture implementation from mllib to ml, then we can add new features to it.
      I left mllib GaussianMixture untouched, unlike some other algorithms to wrap the ml implementation. For the following reasons:

      • mllib GaussianMixture allow k == 1, but ml does not.
      • mllib GaussianMixture supports setting initial model, but ml does not support currently. (We will definitely add this feature for ml in the future)

      Meanwhile, There is a big performance improvement for GaussianMixture in this task. Since the covariance matrix of multivariate gaussian distribution is symmetric, we can only store the upper triangular part of the matrix and it will greatly reduce the shuffled data size.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            yanboliang Yanbo Liang
            yanboliang Yanbo Liang
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment