[SPARK-31007] KMeans optimization based on triangle-inequality - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.1.0
Fix Version/s: 3.1.0
Component/s: ML
Labels:
None

Description

In current impl, following Lemma is used in KMeans:

0, Let x be a point, let b be a center and o be the origin, then d(x,c) >= |(d(x,o) - d(c,o))| = |norm-norm(c)|

this can be applied in EuclideanDistance, but not in CosineDistance

According to [Using the Triangle Inequality to Accelerate K-Means|https://www.aaai.org/Papers/ICML/2003/ICML03-022.pdf], we can go futher, and there are another two Lemmas can be used:

1, Let x be a point, and let b and c be centers. If d(b,c)>=2d(x,b) then d(x,c) >= d(x,b);

this can be applied in EuclideanDistance, but not in CosineDistance.

However, luckily for CosineDistance we can get a variant in the space of radian/angle.

2, Let x be a point, and let b and c be centers. Then d(x,c) >= max{0, d(x,b)-d(b,c)};

this can be applied in EuclideanDistance, but not in CosineDistance

The application of Lemma 2 is a little complex: It need to cache/update the distance/lower bounds to previous centers, and thus can be only applied in training, not usable in prediction.

So this ticket is mainly for Lemma 1. Its idea is quite simple, if point x is close to center b enough (less than a pre-computed radius), then we can say point x belong to center c without computing the distances between x and other centers. It can be used in both training and predction.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

ICML03-022.pdf
02/Mar/20 10:02
95 kB
Ruifeng Zheng

Issue Links

relates to

SPARK-36553 KMeans fails with NegativeArraySizeException for K = 50000 after issue #27758 was introduced

Resolved

links to

GitHub Pull Request #27758

Activity

People

Assignee:: Ruifeng Zheng

Reporter:: Ruifeng Zheng

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 02/Mar/20 10:01

Updated:: 09/Feb/22 06:05

Resolved:: 24/Apr/20 16:25