Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-17389

KMeans speedup with better choice of k-means|| init steps = 2

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 2.0.0
    • Fix Version/s: 2.1.0
    • Component/s: MLlib
    • Labels:
      None
    • Target Version/s:

      Description

      As reported in http://stackoverflow.com/questions/39260820/is-sparks-kmeans-unable-to-handle-bigdata#39260820 KMeans can be surprisingly slow, and it's easy to see that most of the time spent is in kmeans|| initialization. For example, in this simple example...

      import org.apache.spark.mllib.random.RandomRDDs
      import org.apache.spark.mllib.clustering.KMeans
      
      val data = RandomRDDs.uniformVectorRDD(sc, 1000000, 64, sc.defaultParallelism).cache()
      data.count()
      
      new KMeans().setK(1000).setMaxIterations(5).run(data)
      

      Init takes 5:54, and iterations take about 0:15 each, on my laptop. Init takes about as long as 24 iterations, which is a typical run, meaning half the time is just in picking cluster centers. This seems excessive.

      There are two ways to speed this up significantly. First, the implementation has an old "runs" parameter that is always 1 now. It used to allow multiple clusterings to be computed at once. The code can be simplified significantly now that runs=1 always. This is already covered by SPARK-11560, but just a simple refactoring results in about a 13% init speedup, from 5:54 to 5:09 in this example. That's not what this change is about though.

      By default, k-means|| makes 5 passes over the data. The original paper at http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf actually shows that 2 is plenty, certainly when l=2k as is the case in our implementation. (See Figure 5.2/5.3; I believe the default of 5 was taken from Table 6 but it's not suggesting 5 is an optimal value.) Defaulting to 2 brings it down to 1:41 – much improved over 5:54.

      Lastly, small thing, but the code will perform a local k-means++ step to reduce the number of centers to k even if there are already only <= k centers. This can be short-circuited. However this is really the topic of SPARK-3261 because this can cause fewer than k clusters to be returned where that would actually be correct, too.

        Issue Links

          Activity

          Hide
          apachespark Apache Spark added a comment -

          User 'srowen' has created a pull request for this issue:
          https://github.com/apache/spark/pull/14948

          Show
          apachespark Apache Spark added a comment - User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/14948
          Hide
          srowen Sean Owen added a comment -

          Oh, this overlaps with https://issues.apache.org/jira/browse/SPARK-11560. This may become just about the init steps change instead.

          Show
          srowen Sean Owen added a comment - Oh, this overlaps with https://issues.apache.org/jira/browse/SPARK-11560 . This may become just about the init steps change instead.
          Hide
          apachespark Apache Spark added a comment -

          User 'srowen' has created a pull request for this issue:
          https://github.com/apache/spark/pull/14956

          Show
          apachespark Apache Spark added a comment - User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/14956
          Hide
          srowen Sean Owen added a comment -

          Issue resolved by pull request 14956
          https://github.com/apache/spark/pull/14956

          Show
          srowen Sean Owen added a comment - Issue resolved by pull request 14956 https://github.com/apache/spark/pull/14956
          Hide
          apachespark Apache Spark added a comment -

          User 'yanboliang' has created a pull request for this issue:
          https://github.com/apache/spark/pull/15050

          Show
          apachespark Apache Spark added a comment - User 'yanboliang' has created a pull request for this issue: https://github.com/apache/spark/pull/15050

            People

            • Assignee:
              srowen Sean Owen
              Reporter:
              srowen Sean Owen
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development