[SPARK-29040] Support pyspark.createDataFrame from a pyarrow.Table - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.1.0
Fix Version/s: None
Component/s: PySpark, SQL
Labels:
None

Description

PySpark createDataFrame currently supports creating a spark DataFrame from Pandas, using Arrow if enabled. This could be extended to accept a pyarrow.Table which has the added benefit of being able to efficiently use columns with nested struct types.

It is possible to convert a pyarrow.Table with nested columns into a pandas.DataFrame, but the data becomes dictionaries, and is not a performant way to parallelize the data.

Time/Date columns would need to be handled specially, since pyspark currently uses pandas to convert Arrow data of these types to the required Spark internal format.

This follows from a mailing list discussion at http://apache-spark-user-list.1001560.n3.nabble.com/question-about-pyarrow-Table-to-pyspark-DataFrame-conversion-td36110.html

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Bryan Cutler

Votes:: 5 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 10/Sep/19 19:13

Updated:: 01/Jun/23 05:20