[SPARK-35142] `OneVsRest` classifier uses incorrect data type for `rawPrediction` column - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.0.0, 3.0.2, 3.1.0, 3.1.1
Fix Version/s: 3.0.3, 3.1.2, 3.2.0
Component/s: ML
Labels:
None

Description

`OneVsRest` classifier uses an incorrect data type for the `rawPrediction` column.

Code to reproduce the issue:

from pyspark.ml.classification import LogisticRegression, OneVsRest
from pyspark.ml.linalg import Vectors
from pyspark.sql import SparkSession
from sklearn.datasets import load_iris

spark = SparkSession.builder.getOrCreate()

X, y = load_iris(return_X_y=True)
df = spark.createDataFrame(
 [(Vectors.dense(features), int(label)) for features, label in zip(X, y)], ["features", "label"]
)
train, test = df.randomSplit([0.8, 0.2])
lor = LogisticRegression(maxIter=5)
ovr = OneVsRest(classifier=lor)
ovrModel = ovr.fit(train)
pred = ovrModel.transform(test)

pred.printSchema()
# This prints out:
# root
#  |-- features: vector (nullable = true)
#  |-- label: long (nullable = true)
#  |-- rawPrediction: string (nullable = true)  # <- should not be string
#  |-- prediction: double (nullable = true)

# pred.show()  # this fails because of the incorrect datatype