[SPARK-32846] Support createDataFrame from an RDD of pd.DataFrames - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: In Progress
Priority: Minor
Resolution: Unresolved
Affects Version/s: 3.0.1
Fix Version/s: None
Component/s: PySpark, SQL
Labels:

Flags:

Patch

Description

Add support to createDataFrame from a distributed collection of pandas.DataFrames by converting the RDD of pd.DFs to an RDD of arrow records batches, then directly creating the spark DataFrame from it.

Performance is significantly better (vectorized) than creating a spark DF by converting each df to a list of rows, similar to the improvement of ~~SPARK-20791~~.

Initial example & benchmark for older spark versions: https://gist.github.com/linar-jether/7dd61ed6fa89098ab9c58a1ab428b2b5

I'm currently working on a PR and will post it soon.

Extends the work done in:

https://issues.apache.org/jira/browse/SPARK-20791

https://issues.apache.org/jira/browse/SPARK-23030