Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-40232

KMeans: high variability in results despite high initSteps parameter value

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Not A Problem
    • 3.3.0
    • None
    • ML, PySpark
    • None

    Description

      I'm running KMeans on a sample dataset using PySpark. I want the results to be fairly stable, so I play with the initSteps parameter. My understanding is that the higher the number of steps for k-means|| initialization mode, the higher the number of iterations the algorithm runs and in the end selects the best model out of all iterations. And that's the behavior I observe when running sklearn implementation with _n_init_ >= 10. However, when running PySpark implementation, regardless of the number of partitions of underlying data frame (tested on 1, 4, 8 number of partitions), even when setting initSteps to 10, 50, or let's say 500, the results I get with different seeds are different and trainingCost value I observe is sometimes far from the lowest.

      As a workaround, to force the algorithm to iterate and select the best model I used a loop with dynamic seed.

      SKlearn in each iteration gets the trainingCost near 276655.

      PySpark implementation of KMeans gets there in the 2nd, 5th and 6th iteration, but all the remaining iterations yield higher values.

      Does the initSteps parameter work as expected? Because my findings suggest that something might be off here.

      Let me know where I could upload this sample dataset (2MB)

       

      import pandas as pd
      from sklearn.cluster import KMeans as KMeansSKlearn
      df = pd.read_csv('sample_data.csv')
      
      minimum = 99999999
      for i in range(1,10):
          kmeans = KMeansSKlearn(init="k-means++", n_clusters=5, n_init=10, random_state=i)
          model = kmeans.fit(df)
          print(f'Sklearn iteration {i}: {round(model.inertia_)}')from pyspark.sql 
      
      import SparkSession
      spark= SparkSession.builder \
          .appName("kmeans-test") \
          .config('spark.driver.memory', '2g') \
          .master("local[2]") \
          .getOrCreate()df1 = spark.createDataFrame(df)
      
      from pyspark.ml.clustering import KMeans
      from pyspark.ml.feature import VectorAssembler
      assemble=VectorAssembler(inputCols=df1.columns, outputCol='features')
      assembled_data=assemble.transform(df1)
      
      minimum = 99999999
      for i in range(1,10):
          kmeans = KMeans(featuresCol='features', k=5, initSteps=100, maxIter=300, seed=i, tol=0.0001)
          model = kmeans.fit(assembled_data)
          summary = model.summary
          print(f'PySpark iteration {i}: {round(summary.trainingCost)}')

       

      Attachments

        1. sample_data.csv
          2.10 MB
          Patryk Piekarski

        Activity

          People

            Unassigned Unassigned
            patryk135 Patryk Piekarski
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: