Details
Description
the process on sparse vector is wrong if thread<0:
scala> val data = Seq((0, Vectors.sparse(3, Array(1), Array(0.5))), (1, Vectors.dense(Array(0.0, 0.5, 0.0)))) data: Seq[(Int, org.apache.spark.ml.linalg.Vector)] = List((0,(3,[1],[0.5])), (1,[0.0,0.5,0.0])) scala> val df = data.toDF("id", "feature") df: org.apache.spark.sql.DataFrame = [id: int, feature: vector] scala> val binarizer: Binarizer = new Binarizer().setInputCol("feature").setOutputCol("binarized_feature").setThreshold(-0.5) binarizer: org.apache.spark.ml.feature.Binarizer = binarizer_1c07ac2ae3c8 scala> binarizer.transform(df).show() +---+-------------+-----------------+ | id| feature|binarized_feature| +---+-------------+-----------------+ | 0|(3,[1],[0.5])| [0.0,1.0,0.0]| | 1|[0.0,0.5,0.0]| [1.0,1.0,1.0]| +---+-------------+-----------------+
expected outputs of the above two input vectors should be the same.
To deal with sparse vectors with threshold < 0, we have two options:
1, return 1 for non-active items, but this will convert sparse vectors to dense ones
2, throw an exception like what Scikit-Learn's Binarizer does:
import numpy as np from scipy.sparse import csr_matrix from sklearn.preprocessing import Binarizer row = np.array([0, 0, 1, 2, 2, 2]) col = np.array([0, 2, 2, 0, 1, 2]) data = np.array([1, 2, 3, 4, 5, 6]) a = csr_matrix((data, (row, col)), shape=(3, 3)) binarizer = Binarizer(threshold=-1.0) binarizer.transform(a) Traceback (most recent call last): File "<ipython-input-24-7e12ab26b3ed>", line 1, in <module> binarizer.transform(a) File "/home/zrf/Applications/anaconda3/lib/python3.7/site-packages/sklearn/preprocessing/data.py", line 1874, in transform return binarize(X, threshold=self.threshold, copy=copy) File "/home/zrf/Applications/anaconda3/lib/python3.7/site-packages/sklearn/preprocessing/data.py", line 1774, in binarize raise ValueError('Cannot binarize a sparse matrix with threshold 'ValueError: Cannot binarize a sparse matrix with threshold < 0
Attachments
Issue Links
- links to