[SPARK-19217] Offer easy cast from vector to array - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Later
Affects Version/s: 2.1.0
Fix Version/s: None
Component/s: ML, PySpark, SQL
Labels:
None

Description

Working with ML often means working with DataFrames with vector columns. You can't save these DataFrames to storage (edit: at least as ORC) without converting the vector columns to array columns, and there doesn't appear to an easy way to make that conversion.

This is a common enough problem that it is documented on Stack Overflow. The current solutions to making the conversion from a vector column to an array column are:

Convert the DataFrame to an RDD and back
Use a UDF

Both approaches work fine, but it really seems like you should be able to do something like this instead:

(le_data
    .select(
        col('features').cast('array').alias('features')
    ))

We already have an ArrayType in pyspark.sql.types, but it appears that cast() doesn't support this conversion.

Would this be an appropriate thing to add?

Attachments

Issue Links

is related to

SPARK-19653 `Vector` Type Should Be A First-Class Citizen In Spark SQL

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Nicholas Chammas

Votes:: 1 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 13/Jan/17 18:47

Updated:: 12/Dec/22 18:11

Resolved:: 04/Jan/19 10:41