Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-6706

kmeans|| hangs for a long time if both k and vector dimension are large

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 1.2.1, 1.3.0
    • None
    • MLlib
    • Windows 64bit, Linux 64bit

    Description

      When doing k-means cluster with the "kmeans||" algorithm which is the default one. The algorithm finished some collect() jobs, then the driver hangs for a long time.

      Settings:

      • k above 100
      • feature dimension about 360
      • total data size is about 100 MB

      The issue was first noticed with Spark 1.2.1. I tested with both local and cluster mode. On Spark 1.3.0. I, I can also reproduce this issue with local mode. *However, I do not have a 1.3.0 cluster environment for me to test.*

      Attachments

        1. kmeans-debug.7z
          5 kB
          Xi Shen

        Issue Links

          Activity

            People

              mengxr Xiangrui Meng
              davidshen84 Xi Shen
              Xiangrui Meng Xiangrui Meng
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: