Details
-
Improvement
-
Status: Resolved
-
Minor
-
Resolution: Won't Fix
-
2.0.0
-
None
-
None
-
Important
Description
When applying MinMaxScaler on a column that contains only 0 the output is 0.5 for all the column.
This is inconsistent with sklearn implementation
Steps to reproduce:
from pyspark.ml.feature import MinMaxScaler from pyspark.ml.linalg import Vectors dataFrame = spark.createDataFrame([ (0, Vectors.dense([1.0, 0.1, -1.0]),), (1, Vectors.dense([2.0, 1.1, 1.0]),), (2, Vectors.dense([3.0, 10.1, 3.0]),) ], ["id", "features"]) scaler = MinMaxScaler(inputCol="features", outputCol="scaledFeatures") # Compute summary statistics and generate MinMaxScalerModel scalerModel = scaler.fit(dataFrame) # rescale each feature to range [min, max]. scaledData = scalerModel.transform(dataFrame) print("Features scaled to range: [%f, %f]" % (scaler.getMin(), scaler.getMax())) scaledData.select("features", "scaledFeatures").show()
Features scaled to range: [0.000000, 1.000000]
--------------------------+
features | scaledFeatures |
--------------------------+
[1.0,0.1,0.0] | [0.0,0.0,*0.5*] |
[2.0,1.1,0.0]| [0.5,0.1,*0.5*]| |
[3.0,10.1,0.0]| [1.0,1.0,*0.5*]|
--------------------------+
VS.
from sklearn.preprocessing import MinMaxScaler
mms = MinMaxScaler(copy=False)
test = np.array([[1.0, 0.1, 0],[2.0, 1.1, 0],[3.0, 10.1, 0]])
print (mms.fit_transform(test))
Output:
[[ 0. 0. 0. ]
[ 0.5 0.1 0. ]
[ 1. 1. 0. ]]