Description
The KMeansPlusPlus algorithm is implemented in time O( m k^2), where m is the rounds of the KMeansParallel algorithm and k is the number of clusters.
This can be dramatically improved by maintaining the distance the closest cluster center from round to round and then incrementally updating that value for each point. This incremental update is O(1) time, this reduces the running time for K Means Plus Plus to O( m k ). For large k, this is significant.
Attachments
Issue Links
- is related to
-
SPARK-2138 The KMeans algorithm in the MLlib can lead to the Serialized Task size become bigger and bigger
- Closed
- links to