Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-17389

KMeans speedup with better choice of k-means|| init steps = 2



    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 2.0.0
    • 2.1.0
    • MLlib
    • None


      As reported in http://stackoverflow.com/questions/39260820/is-sparks-kmeans-unable-to-handle-bigdata#39260820 KMeans can be surprisingly slow, and it's easy to see that most of the time spent is in kmeans|| initialization. For example, in this simple example...

      import org.apache.spark.mllib.random.RandomRDDs
      import org.apache.spark.mllib.clustering.KMeans
      val data = RandomRDDs.uniformVectorRDD(sc, 1000000, 64, sc.defaultParallelism).cache()
      new KMeans().setK(1000).setMaxIterations(5).run(data)

      Init takes 5:54, and iterations take about 0:15 each, on my laptop. Init takes about as long as 24 iterations, which is a typical run, meaning half the time is just in picking cluster centers. This seems excessive.

      There are two ways to speed this up significantly. First, the implementation has an old "runs" parameter that is always 1 now. It used to allow multiple clusterings to be computed at once. The code can be simplified significantly now that runs=1 always. This is already covered by SPARK-11560, but just a simple refactoring results in about a 13% init speedup, from 5:54 to 5:09 in this example. That's not what this change is about though.

      By default, k-means|| makes 5 passes over the data. The original paper at http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf actually shows that 2 is plenty, certainly when l=2k as is the case in our implementation. (See Figure 5.2/5.3; I believe the default of 5 was taken from Table 6 but it's not suggesting 5 is an optimal value.) Defaulting to 2 brings it down to 1:41 – much improved over 5:54.

      Lastly, small thing, but the code will perform a local k-means++ step to reduce the number of centers to k even if there are already only <= k centers. This can be short-circuited. However this is really the topic of SPARK-3261 because this can cause fewer than k clusters to be returned where that would actually be correct, too.


        Issue Links



              srowen Sean R. Owen
              srowen Sean R. Owen
              0 Vote for this issue
              1 Start watching this issue