Uploaded image for project: 'Commons Math'
  1. Commons Math
  2. MATH-1509

Implement the MiniBatchKMeansClusterer

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None
    • None

    Description

      MiniBatchKMeans is a fast clustering algorithm, 

      which use partial points in initialize cluster centers, and mini batch in training iterations.
      It can finish in few seconds on clustering millions of data, and has few differences between KMeans.

      I have implemented it by Kotlin in my own project, and I'd like to contribute the code  to Apache Commons Math, of course in java.

      My implemention is base on Apache Commons Math3, refer to Python sklearn.cluster.MiniBatchKMeans

      Thought test I found it works well on intensive data, significant performance improvement and return value has few difference to KMeans++, but has many difference on sparse data.

       

      Below is the comparation of my implemention and KMeansPlusPlusClusterer

       

       

      I have created a pull request on https://github.com/apache/commons-math/pull/117, for reference only.

      Attachments

        1. compare.png
          270 kB
          Chen Tao
        2. random-data-comparison.png
          894 kB
          Chen Tao
        3. intensive-data-comparsion.png
          416 kB
          Chen Tao
        4. intensive-data-comparsion-badcase.png
          396 kB
          Chen Tao

        Issue Links

          Activity

            People

              Unassigned Unassigned
              chentao106 Chen Tao
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 40m
                  1h 40m