Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-3588

Gaussian Mixture Model clustering

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Major
    • Resolution: Duplicate
    • None
    • None
    • MLlib, PySpark
    • None

    Description

      Gaussian Mixture Models (GMM) is a popular technique for soft clustering. GMM models the entire data set as a finite mixture of Gaussian distributions,each parameterized by a mean vector µ ,a covariance matrix ∑ and a mixture weight π. In this technique, probability of each point to belong to each cluster is computed along with the cluster statistics.

      We have come up with an initial distributed implementation of GMM in pyspark where the parameters are estimated using the Expectation-Maximization algorithm.Our current implementation considers diagonal covariance matrix for each component.

      We did an initial benchmark study on a 2 node Spark standalone cluster setup where each node config is 8 Cores,8 GB RAM, the spark version used is 1.0.0. We also evaluated python version of k-means available in spark on the same datasets.
      Below are the results from this benchmark study. The reported stats are average from 10 runs.Tests were done on multiple datasets with varying number of features and instances.

                Dataset                    Gaussian mixture model                 Kmeans(Python)           
      Instances Dimensions Avg time per iteration Time for 100 iterations Avg time per iteration Time for 100 iterations
      0.7million    13                     7s                  12min                  13s             26min    
      1.8million    11                   17s                 29min                   33s             53min    
      10million    16                   1.6min              2.7hr                   1.2min           2hr        

      Attachments

        1. GMMSpark.py
          5 kB
          Meethu Mathew

        Issue Links

          Activity

            People

              MeethuMathew Meethu Mathew
              MeethuMathew Meethu Mathew
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: