Details
-
New Feature
-
Status: Resolved
-
Major
-
Resolution: Won't Fix
-
2.3.1
-
None
-
None
Description
currently CountVectorizer() can not output TF (term frequency). I hope there will be such option.
TF defined as https://en.m.wikipedia.org/wiki/Tf–idf
example,
>>> df = spark.createDataFrame( ... [(0, ["a", "b", "c"]), (1, ["a", "b", "b", "c", "a"])], ... ["label", "raw"])
>>> cv = CountVectorizer(inputCol="raw", outputCol="vectors")
>>> model = cv.fit(df)
>>> model.transform(df).limit(1).show(truncate=False)
label raw vectors
0 [a, b, c] (3,[0,1,2],[1.0,1.0,1.0])
instead I want
0 [a, b, c] (3,[0,1,2],[0.33,0.33,0.33]) # ie, each vector devided by by its sum, here 3, so sum of new vector will 1,for every row(document)