[SPARK-40232] KMeans: high variability in results despite high initSteps parameter value - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Not A Problem
Affects Version/s: 3.3.0
Fix Version/s: None
Component/s: ML, PySpark
Labels:
None

Description

I'm running KMeans on a sample dataset using PySpark. I want the results to be fairly stable, so I play with the initSteps parameter. My understanding is that the higher the number of steps for k-means|| initialization mode, the higher the number of iterations the algorithm runs and in the end selects the best model out of all iterations. And that's the behavior I observe when running sklearn implementation with _n_init_ >= 10. However, when running PySpark implementation, regardless of the number of partitions of underlying data frame (tested on 1, 4, 8 number of partitions), even when setting initSteps to 10, 50, or let's say 500, the results I get with different seeds are different and trainingCost value I observe is sometimes far from the lowest.

As a workaround, to force the algorithm to iterate and select the best model I used a loop with dynamic seed.

SKlearn in each iteration gets the trainingCost near 276655.

PySpark implementation of KMeans gets there in the 2nd, 5th and 6th iteration, but all the remaining iterations yield higher values.

Does the initSteps parameter work as expected? Because my findings suggest that something might be off here.

Let me know where I could upload this sample dataset (2MB)

import pandas as pd
from sklearn.cluster import KMeans as KMeansSKlearn
df = pd.read_csv('sample_data.csv')

minimum = 99999999
for i in range(1,10):
    kmeans = KMeansSKlearn(init="k-means++", n_clusters=5, n_init=10, random_state=i)
    model = kmeans.fit(df)
    print(f'Sklearn iteration {i}: {round(model.inertia_)}')from pyspark.sql 

import SparkSession
spark= SparkSession.builder \
    .appName("kmeans-test") \
    .config('spark.driver.memory', '2g') \
    .master("local[2]") \
    .getOrCreate()df1 = spark.createDataFrame(df)

from pyspark.ml.clustering import KMeans
from pyspark.ml.feature import VectorAssembler
assemble=VectorAssembler(inputCols=df1.columns, outputCol='features')
assembled_data=assemble.transform(df1)

minimum = 99999999
for i in range(1,10):
    kmeans = KMeans(featuresCol='features', k=5, initSteps=100, maxIter=300, seed=i, tol=0.0001)
    model = kmeans.fit(assembled_data)
    summary = model.summary
    print(f'PySpark iteration {i}: {round(summary.trainingCost)}')

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

sample_data.csv
26/Aug/22 13:35
2.10 MB
Patryk Piekarski

Activity

People

Assignee:: Unassigned

Reporter:: Patryk Piekarski

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 26/Aug/22 13:35

Updated:: 31/Aug/22 17:27

Resolved:: 31/Aug/22 17:27