Working with ML often means working with DataFrames with vector columns. You can't save these DataFrames to storage (edit: at least as ORC) without converting the vector columns to array columns, and there doesn't appear to an easy way to make that conversion.
This is a common enough problem that it is documented on Stack Overflow. The current solutions to making the conversion from a vector column to an array column are:
- Convert the DataFrame to an RDD and back
- Use a UDF
Both approaches work fine, but it really seems like you should be able to do something like this instead:
We already have an ArrayType in pyspark.sql.types, but it appears that cast() doesn't support this conversion.
Would this be an appropriate thing to add?