Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-21268

Move center calculations to a distributed map in KMeans

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Trivial
    • Resolution: Fixed
    • Affects Version/s: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1
    • Fix Version/s: 2.3.0
    • Component/s: MLlib

      Description

      As I was monitoring the perfomance of my algorithm with SparkUI, I noticed that their was a "collectAsMap" operation that was done hundreds of time at every iteration of Kmeans:

      https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L295

      It would work just as well by performing the following "foreach" on the RDD, and would slightly improve perfomance.

      Edit:

      Per Sean Owen recommendations, scal() and VectorWithNorm creation should be computed in a distributed map before the collectAsMap.

        Attachments

          Activity

            People

            • Assignee:
              panoramix gjgd
              Reporter:
              panoramix gjgd
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - 1h
                1h
                Remaining:
                Remaining Estimate - 1h
                1h
                Logged:
                Time Spent - Not Specified
                Not Specified