Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Not A Problem
    • None
    • None
    • ML
    • None

    Description

      The MiniBatchKMeans is a variant of the KMeans algorithm which uses mini-batches to reduce the computation time, while still attempting to optimise the same objective function. Mini-batches are subsets of the input data, randomly sampled in each training iteration. These mini-batches drastically reduce the amount of computation required to converge to a local solution. In contrast to other algorithms that reduce the convergence time of k-means, mini-batch k-means produces results that are generally only slightly worse than the standard algorithm.

      Comparison of the K-Means and MiniBatchKMeans on sklearn : http://scikit-learn.org/stable/auto_examples/cluster/plot_mini_batch_kmeans.html#example-cluster-plot-mini-batch-kmeans-py

      Since MiniBatch-KMeans with fraction=1.0 is not equal to KMeans, so I make it a new estimator

      Attachments

        1. MBKM.xlsx
          33 kB
          Ruifeng Zheng

        Issue Links

          Activity

            apachespark Apache Spark added a comment -

            User 'zhengruifeng' has created a pull request for this issue:
            https://github.com/apache/spark/pull/11974

            apachespark Apache Spark added a comment - User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/11974
            podongfeng Ruifeng Zheng added a comment - There is another sklean example for MiniBatch KMeans: http://scikit-learn.org/stable/auto_examples/cluster/plot_mini_batch_kmeans.html#example-cluster-plot-mini-batch-kmeans-py
            josephkb Joseph K. Bradley added a comment - - edited

            Quick note: Do you have a use case which requires faster KMeans? And how does it compare when you run BisectingKMeans?

            It will be good to have more info to assess the priority of this change. (I have not heard too many issues with KMeans yet.)

            Thanks!

            josephkb Joseph K. Bradley added a comment - - edited Quick note: Do you have a use case which requires faster KMeans? And how does it compare when you run BisectingKMeans? It will be good to have more info to assess the priority of this change. (I have not heard too many issues with KMeans yet.) Thanks!
            podongfeng Ruifeng Zheng added a comment -

            Faster KMeans is required in many cases:
            high dimensionity > 1000,000
            big K > 1000
            lager dataset (unlabeld data is much more easy to obtained than labeled data, and the data size is ususlly much larger than that used in supervised learning task)

            In practice, I have encountered a time-consuming case: I need to cluster about one billon data with 300,000 dimensions into 1000 center, I use MLLIB's KMenas on a cluster of 16 servers, and it takes about one week to end.

            I will compare it with BisectingKMeans on speed and cost.

            podongfeng Ruifeng Zheng added a comment - Faster KMeans is required in many cases: high dimensionity > 1000,000 big K > 1000 lager dataset (unlabeld data is much more easy to obtained than labeled data, and the data size is ususlly much larger than that used in supervised learning task) In practice, I have encountered a time-consuming case: I need to cluster about one billon data with 300,000 dimensions into 1000 center, I use MLLIB's KMenas on a cluster of 16 servers, and it takes about one week to end. I will compare it with BisectingKMeans on speed and cost.
            podongfeng Ruifeng Zheng added a comment -

            I have test BisectingKMeans on the same dataset.

            val bkm = new BisectingKMeans().setK(10).setMaxIterations(10)
            val bkmModel = bkm.run(vecs)
            bkmModel.computeCost(vecs)
            res3: Double = 3.4435759325644255E8

            It takes about 9050sec in training.

            In summary (three tests were done, and the results are similar):
            KMeans 2876sec 3.317029898599564E8
            BisectingKMeans 9050sec 3.4435759325644255E8
            MiniBatch KMeans (fraction=0.1) 263sec 3.3169865959604424E8
            MiniBatch KMeans (fraction=0.01) 90sec 3.317195831216454E8

            podongfeng Ruifeng Zheng added a comment - I have test BisectingKMeans on the same dataset. val bkm = new BisectingKMeans().setK(10).setMaxIterations(10) val bkmModel = bkm.run(vecs) bkmModel.computeCost(vecs) res3: Double = 3.4435759325644255E8 It takes about 9050sec in training. In summary (three tests were done, and the results are similar): KMeans 2876sec 3.317029898599564E8 BisectingKMeans 9050sec 3.4435759325644255E8 MiniBatch KMeans (fraction=0.1) 263sec 3.3169865959604424E8 MiniBatch KMeans (fraction=0.01) 90sec 3.317195831216454E8
            rnowling R J Nowling added a comment -

            This is a dupe of SPARK-2308 but that needs someone to take it over.

            rnowling R J Nowling added a comment - This is a dupe of SPARK-2308 but that needs someone to take it over.
            podongfeng Ruifeng Zheng added a comment - - edited

            In addition, MiniBatch EM can also be used in MixtureModels. see part 6 EM variants in http://www.cs.ubc.ca/%7Emurphyk/Teaching/CS340-Fall06/reading/mixtureModels.pdf

            podongfeng Ruifeng Zheng added a comment - - edited In addition, MiniBatch EM can also be used in MixtureModels. see part 6 EM variants in http://www.cs.ubc.ca/%7Emurphyk/Teaching/CS340-Fall06/reading/mixtureModels.pdf
            podongfeng Ruifeng Zheng added a comment -

            srowen the run changes were already merged, so we can continue working on this?

            podongfeng Ruifeng Zheng added a comment - srowen the run changes were already merged, so we can continue working on this?
            srowen Sean R. Owen added a comment -

            See the comments on SPARK-2308, in particular about sampling not being efficient. There is some prior work too. I think the issues can be overcome and I understand the idea here. If it's a simple extra option in the API, it seems pretty plausible. You're just running k-means on a random sample of the data every iteration right?

            srowen Sean R. Owen added a comment - See the comments on SPARK-2308 , in particular about sampling not being efficient. There is some prior work too. I think the issues can be overcome and I understand the idea here. If it's a simple extra option in the API, it seems pretty plausible. You're just running k-means on a random sample of the data every iteration right?
            rnowling R J Nowling added a comment - - edited

            I did the initial implementation for SPARK-2308. re: the random sampling. With Spark's approach to random sampling, a Bernoulli trial is performed for each data point in the RDD. It's not as efficient as the case where random-access indexing is available. That said, if your vector dare quite long, then you save computational time on evaluating distances and such. Thus, when evaluating the performance, don't just look at the case of a large number of vectors – look at the case of vectors with many elements.

            rnowling R J Nowling added a comment - - edited I did the initial implementation for SPARK-2308 . re: the random sampling. With Spark's approach to random sampling, a Bernoulli trial is performed for each data point in the RDD. It's not as efficient as the case where random-access indexing is available. That said, if your vector dare quite long, then you save computational time on evaluating distances and such. Thus, when evaluating the performance, don't just look at the case of a large number of vectors – look at the case of vectors with many elements.

            I'm fine with improving KMeans, but I'm still not convinced about the priority. My main issue with above use cases is that it doesn't make much sense to me to apply KMeans to high-dimensional data. Basic k-means clustering just doesn't perform well in high dimensions AFAIK (and is "provably" useless if you analyze it under reasonable statistical assumptions). I'm sure there are practical applications in which it's still useful, but are there no good alternatives such as doing dimensionality reduction before clustering?

            josephkb Joseph K. Bradley added a comment - I'm fine with improving KMeans, but I'm still not convinced about the priority. My main issue with above use cases is that it doesn't make much sense to me to apply KMeans to high-dimensional data. Basic k-means clustering just doesn't perform well in high dimensions AFAIK (and is "provably" useless if you analyze it under reasonable statistical assumptions). I'm sure there are practical applications in which it's still useful, but are there no good alternatives such as doing dimensionality reduction before clustering?

            The actual fix in the PR is pretty small - essentially just adding an rdd.sample call (similar to the old mllib gradient descent impl). So if we can see some good speed improvements on a relatively large class of input datasets, this seems like an easy win. From the performance tests above it seems like there's a significant win even for low-dimensional vectors. For higher dimensions the improvement may be as large or perhaps larger.

            podongfeng it may be best to add a few different cases to the performance tests to illustrate the behavior for different cases (and if not for certain cases, we should document that):

            1. small dimension, dense
            2. high dimension, dense
            3. small dimension, sparse
            4. high dimension, sparse

            rnowling do you have time to check out the PR here? It seems similar in spirit to what you had done and just uses the built-in RDD sampling (which I think derrickburns mentioned in SPARK-2308).

            mlnick Nicholas Pentreath added a comment - The actual fix in the PR is pretty small - essentially just adding an rdd.sample call (similar to the old mllib gradient descent impl). So if we can see some good speed improvements on a relatively large class of input datasets, this seems like an easy win. From the performance tests above it seems like there's a significant win even for low-dimensional vectors. For higher dimensions the improvement may be as large or perhaps larger. podongfeng it may be best to add a few different cases to the performance tests to illustrate the behavior for different cases (and if not for certain cases, we should document that): small dimension, dense high dimension, dense small dimension, sparse high dimension, sparse rnowling do you have time to check out the PR here? It seems similar in spirit to what you had done and just uses the built-in RDD sampling (which I think derrickburns mentioned in SPARK-2308 ).

            josephkb rnowling I am working on a use-case right now that would benefit greatly from mini-batch k-means: clustering documents. The data can be highly dimensional (large vectors of string-id:normalized-tf-idf-scores). Dimension reduction techniques tend not to work very well in this scenario. And the document corpus can often be extremely large.

            I haven't looked at the code in the pull request yet, but it would be really useful if the API included a way to pass a user-defined distance function to replace the default (presumably Euclidean distance). Cosine similarity/dissimilarity is more useful than Euclidean distance with many use cases, and something like Levenshtein distance might be useful for genomics use-cases....

            Thanks!

            chrisfalter Christopher Falter added a comment - josephkb rnowling I am working on a use-case right now that would benefit greatly from mini-batch k-means: clustering documents. The data can be highly dimensional (large vectors of string-id:normalized-tf-idf-scores). Dimension reduction techniques tend not to work very well in this scenario. And the document corpus can often be extremely large. I haven't looked at the code in the pull request yet, but it would be really useful if the API included a way to pass a user-defined distance function to replace the default (presumably Euclidean distance). Cosine similarity/dissimilarity is more useful than Euclidean distance with many use cases, and something like Levenshtein distance might be useful for genomics use-cases.... Thanks!

            podongfeng did you manage to look into some performance experiments as per above comment?

            mlnick Nicholas Pentreath added a comment - podongfeng did you manage to look into some performance experiments as per above comment?
            podongfeng Ruifeng Zheng added a comment - - edited

            mlnick I make some performace experiments last week. However, I got some strange result: MiniBatch process only result in very tiny speed up, which is quite different from the early tests in about 14 months ago.

            Env: Spark 2.1.0, --driver-memory 16G --executor-memory 1G --num-executors 16
            Low Demoniality & Dense
            
            val random = new Random(123)
            val n = 1000000000
            val dim = 10
            
            val rdd = sc.parallelize(1 to n).map(i => Vectors.dense(Array.fill(dim)(random.nextDouble()))).persist()
            
            MiniBatchFraction     Duration per iter    Cost after 3 iters
            1                                125471                  7.807840581259936E8
            0.1                             99093                    7.807848215462652E8
            0.01                           95800                    7.807861719437395E8
            
             

            high dimension, dense tests resulted in similar result.

            I'm going to further study on it.

            podongfeng Ruifeng Zheng added a comment - - edited mlnick I make some performace experiments last week. However, I got some strange result: MiniBatch process only result in very tiny speed up, which is quite different from the early tests in about 14 months ago. Env: Spark 2.1.0, --driver-memory 16G --executor-memory 1G --num-executors 16 Low Demoniality & Dense val random = new Random(123) val n = 1000000000 val dim = 10 val rdd = sc.parallelize(1 to n).map(i => Vectors.dense(Array.fill(dim)(random.nextDouble()))).persist() MiniBatchFraction Duration per iter Cost after 3 iters 1 125471 7.807840581259936E8 0.1 99093 7.807848215462652E8 0.01 95800 7.807861719437395E8 high dimension, dense tests resulted in similar result. I'm going to further study on it.
            podongfeng Ruifeng Zheng added a comment -

            chrisfalter Agree that it's better to support KMeans with other distance/similarity: cosine similarity is more suitable for document clustering than Euclidean distance. And I think you should supply two functions: one for dispatch points, other for update mean. However, It maybe better to discuss it in another ticket.

            podongfeng Ruifeng Zheng added a comment - chrisfalter Agree that it's better to support KMeans with other distance/similarity: cosine similarity is more suitable for document clustering than Euclidean distance. And I think you should supply two functions: one for dispatch points, other for update mean . However, It maybe better to discuss it in another ticket.
            podongfeng Ruifeng Zheng added a comment -

            mlnick I attach the last test result.
            I found that when K=2 the accleration is not significant, since the share of distance computation is small.
            So I set K=100 in the last tests, and the result shows a good speed up.

            podongfeng Ruifeng Zheng added a comment - mlnick I attach the last test result. I found that when K=2 the accleration is not significant, since the share of distance computation is small. So I set K=100 in the last tests, and the result shows a good speed up.

            It makes sense. However, I think k=100 is perhaps less common than say, the range k=5 to k=10 or k=20. Of course it depends on the dimensionality and the problem domain. Would it be possible to update the performance numbers with k-5, k=10, k=20?

            mlnick Nicholas Pentreath added a comment - It makes sense. However, I think k=100 is perhaps less common than say, the range k=5 to k=10 or k=20. Of course it depends on the dimensionality and the problem domain. Would it be possible to update the performance numbers with k-5, k=10, k=20?
            podongfeng Ruifeng Zheng added a comment -

            mlnick OK, I have updated the attached performance tests. In general, larger computation cost in KMeans.findClosest (larger K, higher dimension, more nnz) will result in better improvement.

            podongfeng Ruifeng Zheng added a comment - mlnick OK, I have updated the attached performance tests. In general, larger computation cost in KMeans.findClosest (larger K, higher dimension, more nnz) will result in better improvement.

            Yes, as one would expect. But the key result here is that there will still be a very good speedup even for relatively lower values of k (5, 10, 20 etc) and low dimension (10s-1000s). Results look good to me, thanks for putting that together! I will take a pass through the PR ASAP.

            mlnick Nicholas Pentreath added a comment - Yes, as one would expect. But the key result here is that there will still be a very good speedup even for relatively lower values of k (5, 10, 20 etc) and low dimension (10s-1000s). Results look good to me, thanks for putting that together! I will take a pass through the PR ASAP.
            podongfeng Ruifeng Zheng added a comment -

            mlnick sethah I am sorry to say that I think current impl may be incorrect according to scikit-learn's impl. So I think you don't need to go on reviewing the PR.

            MiniBatch KMeans in scikit-learn is implemented according to this paperhttps://www.eecs.tufts.edu/~dsculley/papers/fastkmeans.pdf
            and the centers are updated in an on-line style, which is quite different with this pr now.
            So current PR is lack of theoretical proof, although this impl looks like work well in some cases.

            I will still work on this issue, and I think there are three options:
            1, strictly follow the paper, and we should sample the minibatch back to driver, like https://github.com/apache/spark/pull/1248/files#diff-f1bafa6656211b6909dbccdb443a3003R162
            2, still follow the paper, the sequential update seems can be parallized by foldByKey, and the parallelism level is k.
            3, we follow the update method in the paper, but use the gradient of the minibath to update the centers; In this way, we modify the algorithm.

            If I obtain some test result, I will post it here.

            podongfeng Ruifeng Zheng added a comment - mlnick sethah I am sorry to say that I think current impl may be incorrect according to scikit-learn's impl. So I think you don't need to go on reviewing the PR. MiniBatch KMeans in scikit-learn is implemented according to this paper https://www.eecs.tufts.edu/~dsculley/papers/fastkmeans.pdf and the centers are updated in an on-line style, which is quite different with this pr now. So current PR is lack of theoretical proof, although this impl looks like work well in some cases. I will still work on this issue, and I think there are three options: 1, strictly follow the paper, and we should sample the minibatch back to driver, like https://github.com/apache/spark/pull/1248/files#diff-f1bafa6656211b6909dbccdb443a3003R162 2, still follow the paper, the sequential update seems can be parallized by foldByKey , and the parallelism level is k . 3, we follow the update method in the paper, but use the gradient of the minibath to update the centers; In this way, we modify the algorithm. If I obtain some test result, I will post it here.
            apachespark Apache Spark added a comment -

            User 'zhengruifeng' has created a pull request for this issue:
            https://github.com/apache/spark/pull/18389

            apachespark Apache Spark added a comment - User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/18389
            podongfeng Ruifeng Zheng added a comment -

            mlnick I send a new PR for MiniBatch KMeans, and the corresponding test results are attached.

            podongfeng Ruifeng Zheng added a comment - mlnick I send a new PR for MiniBatch KMeans, and the corresponding test results are attached.
            podongfeng Ruifeng Zheng added a comment -

            mlnick mengxr josephkb Mini-Batch KMeans is much faster than KMeans, do you have any plan to involve it in MLLIb? Thanks

            podongfeng Ruifeng Zheng added a comment - mlnick mengxr josephkb Mini-Batch KMeans is much faster than KMeans, do you have any plan to involve it in MLLIb? Thanks

            People

              Unassigned Unassigned
              podongfeng Ruifeng Zheng
              Votes:
              2 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: