[SPARK-17389] KMeans speedup with better choice of k-means|| init steps = 2 - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 2.0.0
Fix Version/s: 2.1.0
Component/s: MLlib
Labels:
None

Target Version/s:

2.1.0

Description

As reported in http://stackoverflow.com/questions/39260820/is-sparks-kmeans-unable-to-handle-bigdata#39260820 KMeans can be surprisingly slow, and it's easy to see that most of the time spent is in kmeans|| initialization. For example, in this simple example...

import org.apache.spark.mllib.random.RandomRDDs
import org.apache.spark.mllib.clustering.KMeans

val data = RandomRDDs.uniformVectorRDD(sc, 1000000, 64, sc.defaultParallelism).cache()
data.count()

new KMeans().setK(1000).setMaxIterations(5).run(data)

Init takes 5:54, and iterations take about 0:15 each, on my laptop. Init takes about as long as 24 iterations, which is a typical run, meaning half the time is just in picking cluster centers. This seems excessive.

There are two ways to speed this up significantly. First, the implementation has an old "runs" parameter that is always 1 now. It used to allow multiple clusterings to be computed at once. The code can be simplified significantly now that runs=1 always. This is already covered by ~~SPARK-11560~~, but just a simple refactoring results in about a 13% init speedup, from 5:54 to 5:09 in this example. That's not what this change is about though.

By default, k-means|| makes 5 passes over the data. The original paper at http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf actually shows that 2 is plenty, certainly when l=2k as is the case in our implementation. (See Figure 5.2/5.3; I believe the default of 5 was taken from Table 6 but it's not suggesting 5 is an optimal value.) Defaulting to 2 brings it down to 1:41 – much improved over 5:54.

Lastly, small thing, but the code will perform a local k-means++ step to reduce the number of centers to k even if there are already only <= k centers. This can be short-circuited. However this is really the topic of ~~SPARK-3261~~ because this can cause fewer than k clusters to be returned where that would actually be correct, too.

Attachments

Issue Links

relates to

SPARK-3261 KMeans clusterer can return duplicate cluster centers

Resolved

SPARK-11560 Optimize KMeans implementation / remove 'runs' from implementation

Resolved

links to

[Github] Pull Request #14948 (srowen)

[Github] Pull Request #14956 (srowen)

[Github] Pull Request #15050 (yanboliang)

Activity

People

Assignee:: Sean R. Owen

Reporter:: Sean R. Owen

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 03/Sep/16 08:17

Updated:: 11/Sep/16 10:15

Resolved:: 11/Sep/16 07:01