Uploaded image for project: 'Commons Math'
  1. Commons Math
  2. MATH-1465

Increase the initial sampling speed of KMeansPlusPlusClusterer for large k

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 3.6.1
    • None
    • None

    Description

      As the major difference of KMeans++ to classic Kmeans, an initial distance square sampling process is executed, the function is named "chooseInitialCenters" in the current implementation. The time complexity is O(dkn), where d is the dimension, k is the cluster number and n is the number of points.

      In paper "Ackermann, Lammersen, Märtens, Raupach, Sohler, Swierkot. StreamKM++: A Clustering Algorithm for Data Streams. In Proceedings of the 12th Workshop on Algorithm Engineering and Experiments (ALENEX 2010)", the authors introduced a data structure named "coreset tree" which can reduce the complexity of the initial sampling to O(dn log k). This is useful in the scenario when value of k is large, say 1000, then the run speed of the algorithm will be much faster.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              yuz1988 yu zhang
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:

                Time Tracking

                  Estimated:
                  Original Estimate - 1,440h
                  1,440h
                  Remaining:
                  Remaining Estimate - 1,440h
                  1,440h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified