[MATH-1465] Increase the initial sampling speed of KMeansPlusPlusClusterer for large k - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: 3.6.1
Fix Version/s: None
Component/s: None
Labels:
- performance

Description

As the major difference of KMeans++ to classic Kmeans, an initial distance square sampling process is executed, the function is named "chooseInitialCenters" in the current implementation. The time complexity is O(dkn), where d is the dimension, k is the cluster number and n is the number of points.

In paper "Ackermann, Lammersen, Märtens, Raupach, Sohler, Swierkot. StreamKM++: A Clustering Algorithm for Data Streams. In Proceedings of the 12th Workshop on Algorithm Engineering and Experiments (ALENEX 2010)", the authors introduced a data structure named "coreset tree" which can reduce the complexity of the initial sampling to O(dn log k). This is useful in the scenario when value of k is large, say 1000, then the run speed of the algorithm will be much faster.

Attachments

Issue Links

is related to

MATH-1515 Enhance clustering API

Open

Activity

People

Assignee:: Unassigned

Reporter:: yu zhang

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 10/Aug/18 18:08

Updated:: 07/Mar/20 15:16

Time Tracking

Estimated:

1,440h

Remaining:

1,440h

Logged:

Not Specified