Details
-
Bug
-
Status: Resolved
-
Minor
-
Resolution: Auto Closed
-
1.0.2
-
None
Description
The KMeans parallel implementation selects points to be cluster centers with probability weighted by their distance to cluster centers. However, if there are fewer than k DISTINCT points in the data set, this approach will fail.
Further, the recent checkin to work around this problem results in selection of the same point repeatedly as a cluster center.
The fix is to allow fewer than k cluster centers to be selected. This requires several changes to the code, as the number of cluster centers is woven into the implementation.
I have a version of the code that addresses this problem, AND generalizes the distance metric. However, I see that there are literally hundreds of outstanding pull requests. If someone will commit to working with me to sponsor the pull request, I will create it.
Attachments
Issue Links
- relates to
-
SPARK-1215 Clustering: Index out of bounds error
- Resolved
-
SPARK-2355 Check for the number of clusters to avoid ArrayIndexOutOfBoundsException
- Resolved
- links to