[SPARK-41008] Isotonic regression result differs from sklearn implementation - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 3.3.1
Fix Version/s: 3.4.0
Component/s: MLlib
Labels:
None

Description

import pandas as pd
from pyspark.sql.types import DoubleType
from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn
from pyspark.ml.regression import IsotonicRegression as IsotonicRegression_pyspark

# The P(positives | model_score):
# 0.6 -> 0.5 (1 out of the 2 labels is positive)
# 0.333 -> 0.333 (1 out of the 3 labels is positive)
# 0.20 -> 0.25 (1 out of the 4 labels is positive)
tc_pd = pd.DataFrame({
    "model_score": [0.6, 0.6, 0.333, 0.333, 0.333, 0.20, 0.20, 0.20, 0.20],         
    "label": [1, 0, 0, 1, 0, 1, 0, 0, 0],         
    "weight": 1,     }
)

# The fraction of positives for each of the distinct model_scores would be the best fit.
# Resulting in the following expected calibrated model_scores:
# "calibrated_model_score": [0.5, 0.5, 0.333, 0.333, 0.333, 0.25, 0.25, 0.25, 0.25]

# The sklearn implementation of Isotonic Regression. 
from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn
tc_regressor_sklearn = IsotonicRegression_sklearn().fit(X=tc_pd['model_score'], y=tc_pd['label'], sample_weight=tc_pd['weight'])
print("sklearn:", tc_regressor_sklearn.predict(tc_pd['model_score']))
# >> sklearn: [0.5 0.5 0.33333333 0.33333333 0.33333333 0.25 0.25 0.25 0.25 ]

# The pyspark implementation of Isotonic Regression. 
tc_df = spark.createDataFrame(tc_pd)
tc_df = tc_df.withColumn('model_score', F.col('model_score').cast(DoubleType()))

isotonic_regressor_pyspark = IsotonicRegression_pyspark(featuresCol='model_score', labelCol='label', weightCol='weight')
tc_model = isotonic_regressor_pyspark.fit(tc_df)
tc_pd = tc_model.transform(tc_df).toPandas()
print("pyspark:", tc_pd['prediction'].values)
# >> pyspark: [0.5 0.5 0.33333333 0.33333333 0.33333333 0. 0. 0. 0. ]

# The result from the pyspark implementation seems unclear. Similar small toy examples lead to similar non-expected results for the pyspark implementation. 

# Strangely enough, for 'large' datasets, the difference between calibrated model_scores generated by both implementations dissapears.

Attachments

Issue Links

links to

[Github] Pull Request #38966 (ahmed-mahran)

[Github] Pull Request #38996 (ahmed-mahran)

Activity

People

Assignee:: Ahmed Mahran

Reporter:: Arne Koopman

Votes:: 1 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 03/Nov/22 15:51

Updated:: 09/Dec/22 07:57

Resolved:: 08/Dec/22 14:29