[MATH-1509] Implement the MiniBatchKMeansClusterer - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
None

Description

MiniBatchKMeans is a fast clustering algorithm,

which use partial points in initialize cluster centers, and mini batch in training iterations.
It can finish in few seconds on clustering millions of data, and has few differences between KMeans.

I have implemented it by Kotlin in my own project, and I'd like to contribute the code to Apache Commons Math, of course in java.

My implemention is base on Apache Commons Math3, refer to Python sklearn.cluster.MiniBatchKMeans

Thought test I found it works well on intensive data, significant performance improvement and return value has few difference to KMeans++, but has many difference on sparse data.

Below is the comparation of my implemention and KMeansPlusPlusClusterer

I have created a pull request on https://github.com/apache/commons-math/pull/117, for reference only.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

compare.png
17/Jan/20 03:26
270 kB
Chen Tao
intensive-data-comparsion.png
20/Jan/20 02:05
416 kB
Chen Tao
intensive-data-comparsion-badcase.png
20/Jan/20 02:06
396 kB
Chen Tao
random-data-comparison.png
20/Jan/20 02:03
894 kB
Chen Tao

Issue Links

Blocked

MATH-1524 "chooseInitialCenters" should move out from KMeansPlusPlusClusterer

Open

MATH-1525 Make "EmptyClusterStrategy" in KMeansPlusPlusClusterer reusable

Resolved

is related to

MATH-1515 Enhance clustering API

Open

links to

GitHub Pull Request #128

GitHub Pull Request #129

GitHub Pull Request #132

(1 links to)

Activity

People

Assignee:: Unassigned

Reporter:: Chen Tao

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 17/Jan/20 02:33

Updated:: 28/Jul/20 11:45

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

1h 40m