[SPARK-14174] Implement the Mini-Batch KMeans - ASF JIRA

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Not A Problem
Affects Version/s: None
Fix Version/s: None
Component/s: ML
Labels:
None

Description

The MiniBatchKMeans is a variant of the KMeans algorithm which uses mini-batches to reduce the computation time, while still attempting to optimise the same objective function. Mini-batches are subsets of the input data, randomly sampled in each training iteration. These mini-batches drastically reduce the amount of computation required to converge to a local solution. In contrast to other algorithms that reduce the convergence time of k-means, mini-batch k-means produces results that are generally only slightly worse than the standard algorithm.

Comparison of the K-Means and MiniBatchKMeans on sklearn : http://scikit-learn.org/stable/auto_examples/cluster/plot_mini_batch_kmeans.html#example-cluster-plot-mini-batch-kmeans-py

Since MiniBatch-KMeans with fraction=1.0 is not equal to KMeans, so I make it a new estimator

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

MBKM.xlsx
22/Jun/17 08:20
33 kB
Ruifeng Zheng

Issue Links

duplicates

SPARK-2308 Add KMeans MiniBatch clustering algorithm to MLlib

Resolved

is duplicated by

SPARK-6000 Batch K-Means clusters should support "mini-batch" updates

Closed

links to

[Github] Pull Request #11974 (zhengruifeng)

[Github] Pull Request #18389 (zhengruifeng)

Activity

Ascending order - Click to sort in descending order

Apache Spark added a comment - 26/Mar/16 03:53

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/11974

Apache Spark added a comment - 26/Mar/16 03:53 User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/11974

Ruifeng Zheng added a comment - 26/Mar/16 03:58

There is another sklean example for MiniBatch KMeans:

http://scikit-learn.org/stable/auto_examples/cluster/plot_mini_batch_kmeans.html#example-cluster-plot-mini-batch-kmeans-py

Ruifeng Zheng added a comment - 26/Mar/16 03:58 There is another sklean example for MiniBatch KMeans: http://scikit-learn.org/stable/auto_examples/cluster/plot_mini_batch_kmeans.html#example-cluster-plot-mini-batch-kmeans-py

Joseph K. Bradley added a comment - 29/Mar/16 19:10 - edited

Quick note: Do you have a use case which requires faster KMeans? And how does it compare when you run BisectingKMeans?

It will be good to have more info to assess the priority of this change. (I have not heard too many issues with KMeans yet.)

Thanks!

Joseph K. Bradley added a comment - 29/Mar/16 19:10 - edited Quick note: Do you have a use case which requires faster KMeans? And how does it compare when you run BisectingKMeans? It will be good to have more info to assess the priority of this change. (I have not heard too many issues with KMeans yet.) Thanks!

Ruifeng Zheng added a comment - 30/Mar/16 02:01

Faster KMeans is required in many cases:
high dimensionity > 1000,000
big K > 1000
lager dataset (unlabeld data is much more easy to obtained than labeled data, and the data size is ususlly much larger than that used in supervised learning task)

In practice, I have encountered a time-consuming case: I need to cluster about one billon data with 300,000 dimensions into 1000 center, I use MLLIB's KMenas on a cluster of 16 servers, and it takes about one week to end.

I will compare it with BisectingKMeans on speed and cost.

Ruifeng Zheng added a comment - 30/Mar/16 02:01 Faster KMeans is required in many cases: high dimensionity > 1000,000 big K > 1000 lager dataset (unlabeld data is much more easy to obtained than labeled data, and the data size is ususlly much larger than that used in supervised learning task) In practice, I have encountered a time-consuming case: I need to cluster about one billon data with 300,000 dimensions into 1000 center, I use MLLIB's KMenas on a cluster of 16 servers, and it takes about one week to end. I will compare it with BisectingKMeans on speed and cost.

Ruifeng Zheng added a comment - 31/Mar/16 10:46

I have test BisectingKMeans on the same dataset.

val bkm = new BisectingKMeans().setK(10).setMaxIterations(10)
val bkmModel = bkm.run(vecs)
bkmModel.computeCost(vecs)
res3: Double = 3.4435759325644255E8

It takes about 9050sec in training.

In summary (three tests were done, and the results are similar):
KMeans 2876sec 3.317029898599564E8
BisectingKMeans 9050sec 3.4435759325644255E8
MiniBatch KMeans (fraction=0.1) 263sec 3.3169865959604424E8
MiniBatch KMeans (fraction=0.01) 90sec 3.317195831216454E8

Ruifeng Zheng added a comment - 31/Mar/16 10:46 I have test BisectingKMeans on the same dataset. val bkm = new BisectingKMeans().setK(10).setMaxIterations(10) val bkmModel = bkm.run(vecs) bkmModel.computeCost(vecs) res3: Double = 3.4435759325644255E8 It takes about 9050sec in training. In summary (three tests were done, and the results are similar): KMeans 2876sec 3.317029898599564E8 BisectingKMeans 9050sec 3.4435759325644255E8 MiniBatch KMeans (fraction=0.1) 263sec 3.3169865959604424E8 MiniBatch KMeans (fraction=0.01) 90sec 3.317195831216454E8

R J Nowling added a comment - 01/Apr/16 15:00

This is a dupe of ~~SPARK-2308~~ but that needs someone to take it over.

R J Nowling added a comment - 01/Apr/16 15:00 This is a dupe of SPARK-2308 but that needs someone to take it over.

Ruifeng Zheng added a comment - 07/Nov/16 04:48 - edited

In addition, MiniBatch EM can also be used in MixtureModels. see part 6 EM variants in http://www.cs.ubc.ca/%7Emurphyk/Teaching/CS340-Fall06/reading/mixtureModels.pdf

Ruifeng Zheng added a comment - 07/Nov/16 04:48 - edited In addition, MiniBatch EM can also be used in MixtureModels. see part 6 EM variants in http://www.cs.ubc.ca/%7Emurphyk/Teaching/CS340-Fall06/reading/mixtureModels.pdf

Ruifeng Zheng added a comment - 21/Feb/17 03:34

srowen the run changes were already merged, so we can continue working on this?

Ruifeng Zheng added a comment - 21/Feb/17 03:34 srowen the run changes were already merged, so we can continue working on this?

Sean R. Owen added a comment - 21/Feb/17 05:07

See the comments on ~~SPARK-2308~~, in particular about sampling not being efficient. There is some prior work too. I think the issues can be overcome and I understand the idea here. If it's a simple extra option in the API, it seems pretty plausible. You're just running k-means on a random sample of the data every iteration right?

Sean R. Owen added a comment - 21/Feb/17 05:07 See the comments on SPARK-2308 , in particular about sampling not being efficient. There is some prior work too. I think the issues can be overcome and I understand the idea here. If it's a simple extra option in the API, it seems pretty plausible. You're just running k-means on a random sample of the data every iteration right?

R J Nowling added a comment - 21/Feb/17 16:07 - edited

I did the initial implementation for ~~SPARK-2308~~. re: the random sampling. With Spark's approach to random sampling, a Bernoulli trial is performed for each data point in the RDD. It's not as efficient as the case where random-access indexing is available. That said, if your vector dare quite long, then you save computational time on evaluating distances and such. Thus, when evaluating the performance, don't just look at the case of a large number of vectors – look at the case of vectors with many elements.

R J Nowling added a comment - 21/Feb/17 16:07 - edited I did the initial implementation for SPARK-2308 . re: the random sampling. With Spark's approach to random sampling, a Bernoulli trial is performed for each data point in the RDD. It's not as efficient as the case where random-access indexing is available. That said, if your vector dare quite long, then you save computational time on evaluating distances and such. Thus, when evaluating the performance, don't just look at the case of a large number of vectors – look at the case of vectors with many elements.

Joseph K. Bradley added a comment - 14/Mar/17 00:36

I'm fine with improving KMeans, but I'm still not convinced about the priority. My main issue with above use cases is that it doesn't make much sense to me to apply KMeans to high-dimensional data. Basic k-means clustering just doesn't perform well in high dimensions AFAIK (and is "provably" useless if you analyze it under reasonable statistical assumptions). I'm sure there are practical applications in which it's still useful, but are there no good alternatives such as doing dimensionality reduction before clustering?

Joseph K. Bradley added a comment - 14/Mar/17 00:36 I'm fine with improving KMeans, but I'm still not convinced about the priority. My main issue with above use cases is that it doesn't make much sense to me to apply KMeans to high-dimensional data. Basic k-means clustering just doesn't perform well in high dimensions AFAIK (and is "provably" useless if you analyze it under reasonable statistical assumptions). I'm sure there are practical applications in which it's still useful, but are there no good alternatives such as doing dimensionality reduction before clustering?

Nicholas Pentreath added a comment - 29/Mar/17 13:25

The actual fix in the PR is pretty small - essentially just adding an rdd.sample call (similar to the old mllib gradient descent impl). So if we can see some good speed improvements on a relatively large class of input datasets, this seems like an easy win. From the performance tests above it seems like there's a significant win even for low-dimensional vectors. For higher dimensions the improvement may be as large or perhaps larger.

podongfeng it may be best to add a few different cases to the performance tests to illustrate the behavior for different cases (and if not for certain cases, we should document that):

small dimension, dense
high dimension, dense
small dimension, sparse
high dimension, sparse

rnowling do you have time to check out the PR here? It seems similar in spirit to what you had done and just uses the built-in RDD sampling (which I think derrickburns mentioned in ~~SPARK-2308~~).

Nicholas Pentreath added a comment - 29/Mar/17 13:25 The actual fix in the PR is pretty small - essentially just adding an rdd.sample call (similar to the old mllib gradient descent impl). So if we can see some good speed improvements on a relatively large class of input datasets, this seems like an easy win. From the performance tests above it seems like there's a significant win even for low-dimensional vectors. For higher dimensions the improvement may be as large or perhaps larger. podongfeng it may be best to add a few different cases to the performance tests to illustrate the behavior for different cases (and if not for certain cases, we should document that): small dimension, dense high dimension, dense small dimension, sparse high dimension, sparse rnowling do you have time to check out the PR here? It seems similar in spirit to what you had done and just uses the built-in RDD sampling (which I think derrickburns mentioned in SPARK-2308 ).

Christopher Falter added a comment - 21/Apr/17 19:35

josephkb rnowling I am working on a use-case right now that would benefit greatly from mini-batch k-means: clustering documents. The data can be highly dimensional (large vectors of string-id:normalized-tf-idf-scores). Dimension reduction techniques tend not to work very well in this scenario. And the document corpus can often be extremely large.

I haven't looked at the code in the pull request yet, but it would be really useful if the API included a way to pass a user-defined distance function to replace the default (presumably Euclidean distance). Cosine similarity/dissimilarity is more useful than Euclidean distance with many use cases, and something like Levenshtein distance might be useful for genomics use-cases....

Thanks!

Christopher Falter added a comment - 21/Apr/17 19:35 josephkb rnowling I am working on a use-case right now that would benefit greatly from mini-batch k-means: clustering documents. The data can be highly dimensional (large vectors of string-id:normalized-tf-idf-scores). Dimension reduction techniques tend not to work very well in this scenario. And the document corpus can often be extremely large. I haven't looked at the code in the pull request yet, but it would be really useful if the API included a way to pass a user-defined distance function to replace the default (presumably Euclidean distance). Cosine similarity/dissimilarity is more useful than Euclidean distance with many use cases, and something like Levenshtein distance might be useful for genomics use-cases.... Thanks!

Nicholas Pentreath added a comment - 17/May/17 13:35

podongfeng did you manage to look into some performance experiments as per above comment?

Nicholas Pentreath added a comment - 17/May/17 13:35 podongfeng did you manage to look into some performance experiments as per above comment?

Ruifeng Zheng added a comment - 23/May/17 06:11 - edited

mlnick I make some performace experiments last week. However, I got some strange result: MiniBatch process only result in very tiny speed up, which is quite different from the early tests in about 14 months ago.

Env: Spark 2.1.0, --driver-memory 16G --executor-memory 1G --num-executors 16
Low Demoniality & Dense

val random = new Random(123)
val n = 1000000000
val dim = 10

val rdd = sc.parallelize(1 to n).map(i => Vectors.dense(Array.fill(dim)(random.nextDouble()))).persist()

MiniBatchFraction     Duration per iter    Cost after 3 iters
1                                125471                  7.807840581259936E8
0.1                             99093                    7.807848215462652E8
0.01                           95800                    7.807861719437395E8

high dimension, dense tests resulted in similar result.

I'm going to further study on it.

Ruifeng Zheng added a comment - 23/May/17 06:11 - edited mlnick I make some performace experiments last week. However, I got some strange result: MiniBatch process only result in very tiny speed up, which is quite different from the early tests in about 14 months ago. Env: Spark 2.1.0, --driver-memory 16G --executor-memory 1G --num-executors 16 Low Demoniality & Dense val random = new Random(123) val n = 1000000000 val dim = 10 val rdd = sc.parallelize(1 to n).map(i => Vectors.dense(Array.fill(dim)(random.nextDouble()))).persist() MiniBatchFraction Duration per iter Cost after 3 iters 1 125471 7.807840581259936E8 0.1 99093 7.807848215462652E8 0.01 95800 7.807861719437395E8 high dimension, dense tests resulted in similar result. I'm going to further study on it.

Ruifeng Zheng added a comment - 23/May/17 06:19

chrisfalter Agree that it's better to support KMeans with other distance/similarity: cosine similarity is more suitable for document clustering than Euclidean distance. And I think you should supply two functions: one for dispatch points, other for update mean. However, It maybe better to discuss it in another ticket.

Ruifeng Zheng added a comment - 23/May/17 06:19 chrisfalter Agree that it's better to support KMeans with other distance/similarity: cosine similarity is more suitable for document clustering than Euclidean distance. And I think you should supply two functions: one for dispatch points, other for update mean . However, It maybe better to discuss it in another ticket.

Ruifeng Zheng added a comment - 23/May/17 08:17

mlnick I attach the last test result.
I found that when K=2 the accleration is not significant, since the share of distance computation is small.
So I set K=100 in the last tests, and the result shows a good speed up.

Ruifeng Zheng added a comment - 23/May/17 08:17 mlnick I attach the last test result. I found that when K=2 the accleration is not significant, since the share of distance computation is small. So I set K=100 in the last tests, and the result shows a good speed up.

Nicholas Pentreath added a comment - 24/May/17 19:02

It makes sense. However, I think k=100 is perhaps less common than say, the range k=5 to k=10 or k=20. Of course it depends on the dimensionality and the problem domain. Would it be possible to update the performance numbers with k-5, k=10, k=20?

Nicholas Pentreath added a comment - 24/May/17 19:02 It makes sense. However, I think k=100 is perhaps less common than say, the range k=5 to k=10 or k=20. Of course it depends on the dimensionality and the problem domain. Would it be possible to update the performance numbers with k-5, k=10, k=20?

Ruifeng Zheng added a comment - 25/May/17 08:57

mlnick OK, I have updated the attached performance tests. In general, larger computation cost in KMeans.findClosest (larger K, higher dimension, more nnz) will result in better improvement.

Ruifeng Zheng added a comment - 25/May/17 08:57 mlnick OK, I have updated the attached performance tests. In general, larger computation cost in KMeans.findClosest (larger K, higher dimension, more nnz) will result in better improvement.

Nicholas Pentreath added a comment - 25/May/17 09:03

Yes, as one would expect. But the key result here is that there will still be a very good speedup even for relatively lower values of k (5, 10, 20 etc) and low dimension (10s-1000s). Results look good to me, thanks for putting that together! I will take a pass through the PR ASAP.

Nicholas Pentreath added a comment - 25/May/17 09:03 Yes, as one would expect. But the key result here is that there will still be a very good speedup even for relatively lower values of k (5, 10, 20 etc) and low dimension (10s-1000s). Results look good to me, thanks for putting that together! I will take a pass through the PR ASAP.

Ruifeng Zheng added a comment - 26/May/17 06:14

mlnick sethah I am sorry to say that I think current impl may be incorrect according to scikit-learn's impl. So I think you don't need to go on reviewing the PR.

MiniBatch KMeans in scikit-learn is implemented according to this paperhttps://www.eecs.tufts.edu/~dsculley/papers/fastkmeans.pdf
and the centers are updated in an on-line style, which is quite different with this pr now.
So current PR is lack of theoretical proof, although this impl looks like work well in some cases.

I will still work on this issue, and I think there are three options:
1, strictly follow the paper, and we should sample the minibatch back to driver, like https://github.com/apache/spark/pull/1248/files#diff-f1bafa6656211b6909dbccdb443a3003R162
2, still follow the paper, the sequential update seems can be parallized by foldByKey, and the parallelism level is k.
3, we follow the update method in the paper, but use the gradient of the minibath to update the centers; In this way, we modify the algorithm.

If I obtain some test result, I will post it here.

Ruifeng Zheng added a comment - 26/May/17 06:14 mlnick sethah I am sorry to say that I think current impl may be incorrect according to scikit-learn's impl. So I think you don't need to go on reviewing the PR. MiniBatch KMeans in scikit-learn is implemented according to this paper https://www.eecs.tufts.edu/~dsculley/papers/fastkmeans.pdf and the centers are updated in an on-line style, which is quite different with this pr now. So current PR is lack of theoretical proof, although this impl looks like work well in some cases. I will still work on this issue, and I think there are three options: 1, strictly follow the paper, and we should sample the minibatch back to driver, like https://github.com/apache/spark/pull/1248/files#diff-f1bafa6656211b6909dbccdb443a3003R162 2, still follow the paper, the sequential update seems can be parallized by foldByKey , and the parallelism level is k . 3, we follow the update method in the paper, but use the gradient of the minibath to update the centers; In this way, we modify the algorithm. If I obtain some test result, I will post it here.

Apache Spark added a comment - 22/Jun/17 08:24

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/18389

Apache Spark added a comment - 22/Jun/17 08:24 User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/18389

Ruifeng Zheng added a comment - 22/Jun/17 08:27

mlnick I send a new PR for MiniBatch KMeans, and the corresponding test results are attached.

Ruifeng Zheng added a comment - 22/Jun/17 08:27 mlnick I send a new PR for MiniBatch KMeans, and the corresponding test results are attached.

Ruifeng Zheng added a comment - 14/Jun/18 09:19

mlnick mengxr josephkb Mini-Batch KMeans is much faster than KMeans, do you have any plan to involve it in MLLIb? Thanks

Ruifeng Zheng added a comment - 14/Jun/18 09:19 mlnick mengxr josephkb Mini-Batch KMeans is much faster than KMeans, do you have any plan to involve it in MLLIb? Thanks

People

Assignee:: Unassigned

Reporter:: Ruifeng Zheng

Votes:: 2 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 26/Mar/16 03:35

Updated:: 08/May/19 10:34

Resolved:: 08/May/19 10:34