Description
I'm running KMeans on a sample dataset using PySpark. I want the results to be fairly stable, so I play with the initSteps parameter. My understanding is that the higher the number of steps for k-means|| initialization mode, the higher the number of iterations the algorithm runs and in the end selects the best model out of all iterations. And that's the behavior I observe when running sklearn implementation with _n_init_ >= 10. However, when running PySpark implementation, regardless of the number of partitions of underlying data frame (tested on 1, 4, 8 number of partitions), even when setting initSteps to 10, 50, or let's say 500, the results I get with different seeds are different and trainingCost value I observe is sometimes far from the lowest.
As a workaround, to force the algorithm to iterate and select the best model I used a loop with dynamic seed.
SKlearn in each iteration gets the trainingCost near 276655.
PySpark implementation of KMeans gets there in the 2nd, 5th and 6th iteration, but all the remaining iterations yield higher values.
Does the initSteps parameter work as expected? Because my findings suggest that something might be off here.
Let me know where I could upload this sample dataset (2MB)
import pandas as pd from sklearn.cluster import KMeans as KMeansSKlearn df = pd.read_csv('sample_data.csv') minimum = 99999999 for i in range(1,10): kmeans = KMeansSKlearn(init="k-means++", n_clusters=5, n_init=10, random_state=i) model = kmeans.fit(df) print(f'Sklearn iteration {i}: {round(model.inertia_)}')from pyspark.sql import SparkSession spark= SparkSession.builder \ .appName("kmeans-test") \ .config('spark.driver.memory', '2g') \ .master("local[2]") \ .getOrCreate()df1 = spark.createDataFrame(df) from pyspark.ml.clustering import KMeans from pyspark.ml.feature import VectorAssembler assemble=VectorAssembler(inputCols=df1.columns, outputCol='features') assembled_data=assemble.transform(df1) minimum = 99999999 for i in range(1,10): kmeans = KMeans(featuresCol='features', k=5, initSteps=100, maxIter=300, seed=i, tol=0.0001) model = kmeans.fit(assembled_data) summary = model.summary print(f'PySpark iteration {i}: {round(summary.trainingCost)}')