As reported in http://stackoverflow.com/questions/39260820/is-sparks-kmeans-unable-to-handle-bigdata#39260820 KMeans can be surprisingly slow, and it's easy to see that most of the time spent is in kmeans|| initialization. For example, in this simple example...
Init takes 5:54, and iterations take about 0:15 each, on my laptop. Init takes about as long as 24 iterations, which is a typical run, meaning half the time is just in picking cluster centers. This seems excessive.
There are two ways to speed this up significantly. First, the implementation has an old "runs" parameter that is always 1 now. It used to allow multiple clusterings to be computed at once. The code can be simplified significantly now that runs=1 always. This is already covered by
SPARK-11560, but just a simple refactoring results in about a 13% init speedup, from 5:54 to 5:09 in this example. That's not what this change is about though.
By default, k-means|| makes 5 passes over the data. The original paper at http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf actually shows that 2 is plenty, certainly when l=2k as is the case in our implementation. (See Figure 5.2/5.3; I believe the default of 5 was taken from Table 6 but it's not suggesting 5 is an optimal value.) Defaulting to 2 brings it down to 1:41 – much improved over 5:54.
Lastly, small thing, but the code will perform a local k-means++ step to reduce the number of centers to k even if there are already only <= k centers. This can be short-circuited. However this is really the topic of
SPARK-3261 because this can cause fewer than k clusters to be returned where that would actually be correct, too.