Description
If we take a row from a data frame and try to extract vector element by index it is converted to tuple:
from pyspark.ml.feature import HashingTF df = sqlContext.createDataFrame([(["foo", "bar"], )], ("keys", )) transformer = HashingTF(inputCol="keys", outputCol="vec", numFeatures=5) transformed = transformer.transform(df) row = transformed.first() row.vec # As expected ## SparseVector(5, {4: 2.0}) row[1] # Returns tuple ## (0, 5, [4], [2.0])
Problem cannot be reproduced if we create and access Row directly:
from pyspark.mllib.linalg import Vectors from pyspark.sql.types import Row row = Row(vec=Vectors.sparse(3, [(0, 1)])) row.vec ## SparseVector(3, {0: 1.0}) row[0] ## SparseVector(3, {0: 1.0})
but if we use above to create a data frame and extract:
df = sqlContext.createDataFrame([row], ("vec", ))
df.first()[0]
## (0, 3, [0], [1.0])
Attachments
Issue Links
- duplicates
-
SPARK-9116 python UDT in __main__ cannot be serialized by PySpark
- Resolved