Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-29144

Binarizer handle sparse vectors incorrectly with negative threshold

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 2.0.0, 2.1.0, 2.2.0, 2.3.0, 2.4.0
    • 3.0.0
    • ML
    • None

    Description

      the process on sparse vector is wrong if thread<0:

      scala> val data = Seq((0, Vectors.sparse(3, Array(1), Array(0.5))), (1, Vectors.dense(Array(0.0, 0.5, 0.0))))
      data: Seq[(Int, org.apache.spark.ml.linalg.Vector)] = List((0,(3,[1],[0.5])), (1,[0.0,0.5,0.0]))
      
      scala> val df = data.toDF("id", "feature")
      df: org.apache.spark.sql.DataFrame = [id: int, feature: vector]
      
      scala> val binarizer: Binarizer = new Binarizer().setInputCol("feature").setOutputCol("binarized_feature").setThreshold(-0.5)
      binarizer: org.apache.spark.ml.feature.Binarizer = binarizer_1c07ac2ae3c8
      
      scala> binarizer.transform(df).show()
      +---+-------------+-----------------+
      | id|      feature|binarized_feature|
      +---+-------------+-----------------+
      |  0|(3,[1],[0.5])|    [0.0,1.0,0.0]|
      |  1|[0.0,0.5,0.0]|    [1.0,1.0,1.0]|
      +---+-------------+-----------------+
      

      expected outputs of the above two input vectors should be the same.

       

      To deal with sparse vectors with threshold < 0, we have two options:

      1, return 1 for non-active items, but this will convert sparse vectors to dense ones

      2, throw an exception like what Scikit-Learn's Binarizer does:

      import numpy as np
      from scipy.sparse import csr_matrix
      from sklearn.preprocessing import Binarizer
      
      row = np.array([0, 0, 1, 2, 2, 2])
      col = np.array([0, 2, 2, 0, 1, 2])
      data = np.array([1, 2, 3, 4, 5, 6])
      a = csr_matrix((data, (row, col)), shape=(3, 3))
      binarizer = Binarizer(threshold=-1.0)
      binarizer.transform(a)
      Traceback (most recent call last):  File "<ipython-input-24-7e12ab26b3ed>", line 1, in <module>
          binarizer.transform(a)  File "/home/zrf/Applications/anaconda3/lib/python3.7/site-packages/sklearn/preprocessing/data.py", line 1874, in transform
          return binarize(X, threshold=self.threshold, copy=copy)  File "/home/zrf/Applications/anaconda3/lib/python3.7/site-packages/sklearn/preprocessing/data.py", line 1774, in binarize
          raise ValueError('Cannot binarize a sparse matrix with threshold 'ValueError: Cannot binarize a sparse matrix with threshold < 0 

       

      Attachments

        Issue Links

          Activity

            People

              podongfeng Ruifeng Zheng
              podongfeng Ruifeng Zheng
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: