Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-6068

KMeans Parallel test may fail

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Won't Fix
    • 1.2.1
    • None
    • MLlib, Tests

    Description

      The test "k-means|| initialization in KMeansSuite can fail when the random number generator is truly random.

      The test is predicated on the assumption that each round of K-Means || will add at least one new cluster center. The current implementation of K-Means || adds 2*k cluster centers with high probability. However, there is no deterministic lower bound on the number of cluster centers added.

      Choices are:

      1) change the KMeans || implementation to iterate on selecting points until it has satisfied a lower bound on the number of points chosen.

      2) eliminate the test

      3) ignore the problem and depend on the random number generator to sample the space in a lucky manner.

      Option (1) is most in keeping with the contract that KMeans || should provide a precise number of cluster centers when possible.

      Attachments

        Activity

          People

            Unassigned Unassigned
            derrickburns Derrick Burns
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - 24h
                24h
                Remaining:
                Remaining Estimate - 24h
                24h
                Logged:
                Time Spent - Not Specified
                Not Specified