Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-3261

KMeans clusterer can return duplicate cluster centers

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 1.0.2
    • Fix Version/s: 2.1.0
    • Component/s: MLlib
    • Labels:

      Description

      This is a bad design choice. I think that it is preferable to produce no duplicate cluster centers. So instead of forcing the number of clusters to be K, return at most K clusters.

        Issue Links

          Activity

          Hide
          srowen Sean Owen added a comment -

          Issue resolved by pull request 15450
          https://github.com/apache/spark/pull/15450

          Show
          srowen Sean Owen added a comment - Issue resolved by pull request 15450 https://github.com/apache/spark/pull/15450
          Hide
          apachespark Apache Spark added a comment -

          User 'srowen' has created a pull request for this issue:
          https://github.com/apache/spark/pull/15450

          Show
          apachespark Apache Spark added a comment - User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/15450
          Hide
          derrickburns Derrick Burns added a comment -

          One solution is to run KMeansParallel or KMeansRandom after each Lloyds round to "replenish" empty clusters.

          I have implemented the former in https://github.com/derrickburns/generalized-kmeans-clustering.

          Performance is reasonable.

          Inspection reveals that the slow part of the KMeansParallel computation is the computation of the sum of the weights of the points in each cluster.

          However, the performance can be reduced by sampling the points and summing the contributions of each sampled point. For large data sets, this approach is appropriate.

          Show
          derrickburns Derrick Burns added a comment - One solution is to run KMeansParallel or KMeansRandom after each Lloyds round to "replenish" empty clusters. I have implemented the former in https://github.com/derrickburns/generalized-kmeans-clustering . Performance is reasonable. Inspection reveals that the slow part of the KMeansParallel computation is the computation of the sum of the weights of the points in each cluster. However, the performance can be reduced by sampling the points and summing the contributions of each sampled point. For large data sets, this approach is appropriate.
          Hide
          derrickburns Derrick Burns added a comment -

          Another possible source of duplicate cluster centers is the random initialization algorithm that samples with replacement. It needs to sample without replacement.

          Show
          derrickburns Derrick Burns added a comment - Another possible source of duplicate cluster centers is the random initialization algorithm that samples with replacement. It needs to sample without replacement.
          Hide
          apachespark Apache Spark added a comment -

          User 'derrickburns' has created a pull request for this issue:
          https://github.com/apache/spark/pull/2634

          Show
          apachespark Apache Spark added a comment - User 'derrickburns' has created a pull request for this issue: https://github.com/apache/spark/pull/2634
          Hide
          apachespark Apache Spark added a comment -

          User 'derrickburns' has created a pull request for this issue:
          https://github.com/apache/spark/pull/2419

          Show
          apachespark Apache Spark added a comment - User 'derrickburns' has created a pull request for this issue: https://github.com/apache/spark/pull/2419
          Hide
          derrickburns Derrick Burns added a comment -

          This choice also adversely affects performance. I just ran clustering on 1.3M points, asking for 10,000 clusters. This clustering run resulted in 1019 unique cluster centers. The original algorithm ran for 4.5 hours. The algorithm that does not allow cluster centers completed in 45 minutes for a 6x speedup in this dataset.

          Show
          derrickburns Derrick Burns added a comment - This choice also adversely affects performance. I just ran clustering on 1.3M points, asking for 10,000 clusters. This clustering run resulted in 1019 unique cluster centers. The original algorithm ran for 4.5 hours. The algorithm that does not allow cluster centers completed in 45 minutes for a 6x speedup in this dataset.

            People

            • Assignee:
              srowen Sean Owen
              Reporter:
              derrickburns Derrick Burns
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development