[SPARK-3218] K-Means clusterer can fail on degenerate data - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Auto Closed
Affects Version/s: 1.0.2
Fix Version/s: None
Component/s: MLlib
Labels:
- bulk-closed
- clustering

Description

The KMeans parallel implementation selects points to be cluster centers with probability weighted by their distance to cluster centers. However, if there are fewer than k DISTINCT points in the data set, this approach will fail.

Further, the recent checkin to work around this problem results in selection of the same point repeatedly as a cluster center.

The fix is to allow fewer than k cluster centers to be selected. This requires several changes to the code, as the number of cluster centers is woven into the implementation.

I have a version of the code that addresses this problem, AND generalizes the distance metric. However, I see that there are literally hundreds of outstanding pull requests. If someone will commit to working with me to sponsor the pull request, I will create it.

Attachments

Issue Links

relates to

SPARK-1215 Clustering: Index out of bounds error

Resolved

SPARK-2355 Check for the number of clusters to avoid ArrayIndexOutOfBoundsException

Resolved

links to

[Github] Pull Request #2419 (derrickburns)

[Github] Pull Request #2634 (derrickburns)

Activity

People

Assignee:: Derrick Burns

Reporter:: Derrick Burns

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 26/Aug/14 01:25

Updated:: 06/Jun/19 13:57

Resolved:: 06/Jun/19 13:57