[SPARK-23535] MinMaxScaler return 0.5 for an all zero column - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Won't Fix
Affects Version/s: 2.0.0
Fix Version/s: None
Component/s: ML
Labels:
None

Flags:

Important

Description

When applying MinMaxScaler on a column that contains only 0 the output is 0.5 for all the column.

This is inconsistent with sklearn implementation

Steps to reproduce:

from pyspark.ml.feature import MinMaxScaler
from pyspark.ml.linalg import Vectors

dataFrame = spark.createDataFrame([
    (0, Vectors.dense([1.0, 0.1, -1.0]),),
    (1, Vectors.dense([2.0, 1.1, 1.0]),),
    (2, Vectors.dense([3.0, 10.1, 3.0]),)
], ["id", "features"])

scaler = MinMaxScaler(inputCol="features", outputCol="scaledFeatures")

# Compute summary statistics and generate MinMaxScalerModel
scalerModel = scaler.fit(dataFrame)

# rescale each feature to range [min, max].
scaledData = scalerModel.transform(dataFrame)
print("Features scaled to range: [%f, %f]" % (scaler.getMin(), scaler.getMax()))
scaledData.select("features", "scaledFeatures").show()

Features scaled to range: [0.000000, 1.000000]

--------------------------+

features

scaledFeatures

--------------------------+

[1.0,0.1,0.0]

[0.0,0.0,*0.5*]

[2.0,1.1,0.0]| [0.5,0.1,*0.5*]| |

[3.0,10.1,0.0]| [1.0,1.0,*0.5*]|

--------------------------+

VS.

from sklearn.preprocessing import MinMaxScaler
mms = MinMaxScaler(copy=False)
test = np.array([[1.0, 0.1, 0],[2.0, 1.1, 0],[3.0, 10.1, 0]])
print (mms.fit_transform(test))

Output:

[[ 0. 0. 0. ]

[ 0.5 0.1 0. ]

[ 1. 1. 0. ]]

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Yigal Weinberger

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 28/Feb/18 09:45

Updated:: 07/Mar/18 16:32

Resolved:: 07/Mar/18 16:32

Time Tracking

Estimated:

24h

Remaining:

24h

Logged:

Not Specified