Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-23535

MinMaxScaler return 0.5 for an all zero column

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Won't Fix
    • 2.0.0
    • None
    • ML
    • None
    • Important

    Description

      When applying MinMaxScaler on a column that contains only 0 the output is 0.5 for all the column. 

      This is inconsistent with sklearn implementation

       

      Steps to reproduce:

       

       

      from pyspark.ml.feature import MinMaxScaler
      from pyspark.ml.linalg import Vectors
      
      dataFrame = spark.createDataFrame([
          (0, Vectors.dense([1.0, 0.1, -1.0]),),
          (1, Vectors.dense([2.0, 1.1, 1.0]),),
          (2, Vectors.dense([3.0, 10.1, 3.0]),)
      ], ["id", "features"])
      
      scaler = MinMaxScaler(inputCol="features", outputCol="scaledFeatures")
      
      # Compute summary statistics and generate MinMaxScalerModel
      scalerModel = scaler.fit(dataFrame)
      
      # rescale each feature to range [min, max].
      scaledData = scalerModel.transform(dataFrame)
      print("Features scaled to range: [%f, %f]" % (scaler.getMin(), scaler.getMax()))
      scaledData.select("features", "scaledFeatures").show()
      

      Features scaled to range: [0.000000, 1.000000]

      --------------------------+

      features scaledFeatures

      --------------------------+

      [1.0,0.1,0.0] [0.0,0.0,*0.5*]  

      [2.0,1.1,0.0]| [0.5,0.1,*0.5*]| |

      [3.0,10.1,0.0]| [1.0,1.0,*0.5*]|

      --------------------------+

       

      VS.

      from sklearn.preprocessing import MinMaxScaler
      mms = MinMaxScaler(copy=False)
      test = np.array([[1.0, 0.1, 0],[2.0, 1.1, 0],[3.0, 10.1, 0]])
      print (mms.fit_transform(test))
      

       

      Output:

      [[ 0. 0. 0. ]

      [ 0.5 0.1 0. ]

      [ 1. 1. 0. ]]

       

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            yigalw Yigal Weinberger
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - 24h
                24h
                Remaining:
                Remaining Estimate - 24h
                24h
                Logged:
                Time Spent - Not Specified
                Not Specified