Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-3218

K-Means clusterer can fail on degenerate data

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Auto Closed
    • 1.0.2
    • None
    • MLlib

    Description

      The KMeans parallel implementation selects points to be cluster centers with probability weighted by their distance to cluster centers. However, if there are fewer than k DISTINCT points in the data set, this approach will fail.

      Further, the recent checkin to work around this problem results in selection of the same point repeatedly as a cluster center.

      The fix is to allow fewer than k cluster centers to be selected. This requires several changes to the code, as the number of cluster centers is woven into the implementation.

      I have a version of the code that addresses this problem, AND generalizes the distance metric. However, I see that there are literally hundreds of outstanding pull requests. If someone will commit to working with me to sponsor the pull request, I will create it.

      Attachments

        Issue Links

          Activity

            People

              derrickburns Derrick Burns
              derrickburns Derrick Burns
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: